Filter Data

keywords: ALFA, data filtering

Data filtering is one of the basic functionalities of any data science toolkit. This basically includes extracting something interesting/relevant information from the big dataset. Alfa provides a large number of different functionalities to filter and explore big datasets. Let's look at each of them in detail.

OUTLINE

Filter names/strings
Top/Bottom value filtering
greater than,>,less than,<,between...
Logical operations
Create Categorical arrays
Print First/Last N values in an array

1. Filter names/strings

Command: contains

This functionality allows users to filter datasets based on a string/substring present in a variable. This functionality also provides support for misspelt words. As an example, consider a sports dataset (e.g. cricket, tennis, soccer, etc.) given to us and we want to analyze stats of a particular player, say for example, MS Dhoni's stats in a cricketing dataset. We can do this in ALFA using the following sequence of commands.

load cricket
batsman name contains dhoni

The following commands would also work

batsman contains dhon
batsman contains dhoi

The command is contains and it outputs a logical vector with 1s where the desired string or a substring is found in the search variables.

Now, we can make this even more interesting. In order to use and apply this particular filter in all the subsequent analysis, use the command set filter. The usage is as follows:

load cricket
batsman contains dhoni
set filter
sum of runs scored
mean of runs scored
count matches played

The above set of commands filters the cricket dataset based on MS Dhoni and computes the total runs scored by Dhoni, his average and the number of matches played by him.

Note: The filter can be cleared/disabled by typing clear or clear filter.

2. Top/Bottom value filtering

Using the top and bottom filtering, the largest or smallest N values could be filtered out. This could be useful in many cases. For example, if we want to study tumors that are greater than a particular size, we can apply such filters of tumor size. Here is an example of its usage.

load tumor data
find 20 largest tumors OR find top 20 tumors
set filter
[extract properties]

Similar commands: top,bottom,largest,smallest

3. "greater than", "less than", "between"...

Examples:

area > 50
perimeter < 10
area <= 400
perimeter between 20 and 50

4. Logical operations

The logical arrays generated from the above mentioned filtering operations can be combined using logical operations such as AND,OR,NOT,XOR, and their corresponding symbols &, ||, !, ^

Example:

height > 50
save as h1
height < 10
save as h2
h1 || h2
save as h3
set filter
[do further analysis]

5. Create Categorical arrays

The logical vectors can be combined to create categorical arrays. Categorical arrays can then be used as reference arrays for statistical and machine learning analysis. For example, in a housing dataset, I am interested in analyzing the characteristics of houses in different price ranges, their differences and similarities. I can do this by creating a categorical array as follows:

load housing data
price <= 250000
save as p1
price between 250000 and 750000
save as p2
price > 750000
save as p3
create categorical array from p1 p2 and p3
save as categPrice
set categPrice as reference
[do further analysis, see statistics, visualization, and machine learning]

6. Print First/Last N values in an array

This command can be used to inspect an array/variable. This can be done by just typing

print first 10 values of housing price
print last 10 cricketer names

Warning: If you are creating a new filter, while a filter already exists, they are composed together. For example:

find area.mean > 100
set filter
find area.mean < 500
set filter
range area.mean

will give range of elements between 100 and 500. If you want only data less than 500, you can clear filter before creating a new filter as

find area.mean > 100
set filter
range area.mean
clear filter
find area.mean < 500
set filter
range area.mean