Basic instructuctions

Input a text to receive response from Alfarvis

Enter Bye to close the program

Alfa is data exploration and analysis software that uses natural language


To begin data exploration, you should first load a dataset using the "load command". For example "load tour de france dataset" to load a cycling dataset


To get a list of datasets that can be loaded, call "list files"


Once you have loaded a dataset, you can perform different sets of commands : visualization, statistics, machine learning


You can list commands using "list commands" or "list visualization/statistics... commands" for specific command list


You can also get individual help for each of the commands by calling "help [command name]"


Apart from the above commands, there are also data handling commands that can set row labels, set reference for setting legends etc


You can get a list of data handling commands by calling "list data handling commands"


______________________________________________________


Loading the dataset and different classifiers.

Note: When you load the classifiers, you will have an option to modify the algorithm parameters using a popup that opens up in the right-most column

User: load breast cancer dataset

Alfa:

Running command in background!

Loaded breast cancer tumor

Showing Statistics for breast cancer tumor

Loaded file: breast_tumor_data.csv

______________________________________________________


User: load svm

Alfa:

Running command in background!

Loaded file: SVM.csv

Saving result to svm.support.vector.machine

______________________________________________________


User: load random forest

Alfa:

Running command in background!

Loaded file: RF.csv

Saving result to rf.random.forest

______________________________________________________


User: load decision tree

Alfa:

Running command in background!

Loaded file: DT.csv

Saving result to dt.decision.tree

______________________________________________________


Setting the reference variable, which, in this case, is diagnosis - Benign (1) vs. Malignant (2)

User: set reference to diagnosis

Alfa:

Setting reference to diagnosis

______________________________________________________


Let's train the random forest algorithm on the training set. We should get a training accuracy of 100%

User: train rf on breast cancer dataset

Alfa:

Training the classifier


Features

area

compactness

concave.points

concavity

fractal.dimension

id

perimeter

radius

smoothness

symmetry

texture


The classifier rf.random.forest has been trained

Accuracy on training set : 100.000%

Saving result to rf.random.forest.breast.cancer.tumor

______________________________________________________


Let's test the RF model we just trained on the same training set. We should still get an accuracy of 100%

User: test rf on breast cancer dataset

Alfa:

Running the trained classifier...

Accuracy : 100.000%

Confusion matrix, without normalization

[[357 0]

[ 0 212]]


______________________________________________________


We can now start doing interesting ML analysis. Let us start by doing some leave-one-out cross validation. We can do any k-fold cross validation by selecting the k value from the pop-up on the right

User: leave one out cross validation using svm on breast cancer dataset

Alfa:

Running command in background!

Using leave one out cross validation

Confusion matrix, without normalization

[[350 7]

[ 26 186]]



Saving result to figure.2.cval.cancer.breast.tumor

______________________________________________________


User: leave one out cross validation using rf on breast cancer dataset

Alfa:

Running command in background!

Using leave one out cross validation

Confusion matrix, without normalization

[[343 14]

[ 18 194]]



Saving result to figure.3.cval.cancer.breast.tumor

______________________________________________________


Finding the top 5 predictors using the random forest algorithm.

User: find top 5 predictors of breast cancer dataset

Alfa:


feature importance

top features

0.28

concave.points

0.17

area

0.14

perimeter

0.13

concavity

0.10

radius


Saving result to top.predictors.breast.cancer.tumor

______________________________________________________


Let us now use the top 5 predictors for leave-one-cross validation using random forest algorithm. This will change the resultant confusion matrix

User: leave one out cross validation using rf using top.predictors.breast.cancer.tumor

Alfa:

Running command in background!

Using leave one out cross validation

Confusion matrix, without normalization

[[337 20]

[ 27 185]]



Saving result to figure.4.cval.breast.cancer.tumor.top.predictors

______________________________________________________


The biggest question is - How do we determine which is the best classifier. This is often a tedious procedure and requires a lot of coding and time. Here you can just ask ALFA to do that for you. See below

User: which is the best classifier between svm rf and dt on breast cancer dataset

Alfa:

Training classifier using the following features:

Index(['area', 'compactness', 'concave.points', 'concavity',

'fractal.dimension', 'id', 'perimeter', 'radius', 'smoothness',

'symmetry', 'texture'],

dtype='object')

Testing the following classifiers:

svm.support.vector.machine

rf.random.forest

dt.decision.tree

Finding the best classifier using k fold cross validation...



Plotting the confusion matrices...


Confusion matrix, without normalization

[[350 7]

[ 26 186]]


Confusion matrix, without normalization

[[343 14]

[ 16 196]]


Confusion matrix, without normalization

[[331 26]

[ 27 185]]




Best classifier is rf.random.forest with an accuracy of - 94.73%

______________________________________________________


User: save notebook

Alfa: