Filter Data
keywords: ALFA, data filtering
Data filtering is one of the basic functionalities of any data science toolkit. This basically includes extracting something interesting/relevant information from the big dataset. Alfa provides a large number of different functionalities to filter and explore big datasets. Let's look at each of them in detail.
OUTLINE
- Filter names/strings
- Top/Bottom value filtering
greater than
,>
,less than
,<
,between
...- Logical operations
- Create Categorical arrays
- Print First/Last N values in an array
1. Filter names/strings
Command: contains
This functionality allows users to filter datasets based on a string/substring present in a variable. This functionality also provides support for misspelt words. As an example, consider a sports dataset (e.g. cricket, tennis, soccer, etc.) given to us and we want to analyze stats of a particular player, say for example, MS Dhoni's stats in a cricketing dataset. We can do this in ALFA using the following sequence of commands.
load cricket
batsman name contains dhoni
The following commands would also work
batsman contains dhon
batsman contains dhoi
The command is contains
and it outputs a logical vector with 1s where the desired string or a substring is found in the search variables.
Now, we can make this even more interesting. In order to use and apply this particular filter in all the subsequent analysis, use the command set filter
. The usage is as follows:
load cricket
batsman contains dhoni
set filter
sum of runs scored
mean of runs scored
count matches played
The above set of commands filters the cricket dataset based on MS Dhoni and computes the total runs scored by Dhoni, his average and the number of matches played by him.
Note: The filter can be cleared/disabled by typing clear
or clear filter
.
2. Top/Bottom value filtering
Using the top and bottom filtering, the largest or smallest N values could be filtered out. This could be useful in many cases. For example, if we want to study tumors that are greater than a particular size, we can apply such filters of tumor size. Here is an example of its usage.
load tumor data
find 20 largest tumors OR find top 20 tumors
set filter
[extract properties]
Similar commands: top
,bottom
,largest
,smallest
3. "greater than", "less than", "between"...
Examples:
area > 50
perimeter < 10
area <= 400
perimeter between 20 and 50
4. Logical operations
The logical arrays generated from the above mentioned filtering operations can be combined using logical operations such as AND
,OR
,NOT
,XOR
, and their corresponding symbols &, ||, !, ^
Example:
height > 50
save as h1
height < 10
save as h2
h1 || h2
save as h3
set filter
[do further analysis]
5. Create Categorical arrays
The logical vectors can be combined to create categorical arrays. Categorical arrays can then be used as reference arrays for statistical and machine learning analysis. For example, in a housing dataset, I am interested in analyzing the characteristics of houses in different price ranges, their differences and similarities. I can do this by creating a categorical array as follows:
load housing data
price <= 250000
save as p1
price between 250000 and 750000
save as p2
price > 750000
save as p3
create categorical array from p1 p2 and p3
save as categPrice
set categPrice as reference
[do further analysis, see statistics, visualization, and machine learning]
6. Print First/Last N values in an array
This command can be used to inspect an array/variable. This can be done by just typing
print first 10 values of housing price
print last 10 cricketer names
Warning: If you are creating a new filter, while a filter already exists, they are composed together. For example:
find area.mean > 100
set filter
find area.mean < 500
set filter
range area.mean
will give range of elements between 100 and 500. If you want only data less than 500, you can clear filter
before creating a new filter as
find area.mean > 100
set filter
range area.mean
clear filter
find area.mean < 500
set filter
range area.mean