Using Machine Learning Algorithms to predict an ADHD diagnosis from FMRI data

Dataset

Dataset from Study: Pierre Bellec, Carlton Chu, François Chouinard-Decorte, Yassine Benhajali, Daniel S. Margulies, R. Cameron Craddock (2017). The Neuro Bureau ADHD-200 Preprocessed repository. NeuroImage, 144, Part B, pp. 275 - 286. doi:10.1016/j.neuroimage.2016.06.034

This notebook contains an analysis of the ADHD-200 dataset available on Nilearn. The dataset contains resting state fMRI data from 40 subjects and their phenotypic information. Half of the subjects are patients diagnosed with ADHD and the remaining half are healthy controls. The subjects in the study are all children and adolescents. The analysis in this notebook is my attempt to predict ADHD diagnosis using resting state fMRI data.

I have modified the phenotypic file and used an imported file ("phenotypics.csv") instead of the phenotypic file that comes with Nilearn dataset since it did not contain information about the differentiation between healthy and patient - subject types - I derived that information based on the dataset supplied at "http://preprocessed-connectomes-project.org/adhd200/download.html".

Background

This project uses machine learning techniques to predict ADHD diagnosis from fMRI data.

The data used in this project comes from the ADHD-200 study.

Machine learning algorithms are used in this project to classify patients with ADHD from healthy controls. My aim for this project is to gain experience with processing fMRI data using programming languages as well as with FSL. I will further experiement with different machine learning techniques (algorithms/ classifiers, cross-validation methods for multi-voxel pattern analysis and also fine-tweaking the hyperparameters of the classifying models) and compare how they do with each implementation.

Tools and Software Used

This project has used the following technologies:

Phenotypic info for the subjects is included with the data but I will perform some cleaning to start

I'll extract Subject ID from the niifti file names using index slicing and then merge the fMRI file paths to the phenotypic data

Now let us look at the data. And save the phenotypic data to a csv in the way I needed it

Now let us make subsets for patients and controls since I have the file names matching with the phenotypic data

The code below creates an interactive application using plotly express which will plot a histogram of subject age

Connectivity

This analysis uses the BASC atlas to define ROIs. I will focus on 64 ROIs for this analysis

Now I will generate correlation matrices for each subject and then add them to the phenotypic data.

Here is the pandas dataframe with complete demographic information and a column that has the correlation matrix for each subject as an array

Visualizing Connectivity

Classification

This section contains the main data analysis. I will be predicting "...." diagnosis here. The features used are the correlation matrices generated above, and the diagnosis labels are contained in subject_type column in our phenotypic data.

I will first split the data into training and testing sets, with a ratio of 80/20.

My starting classifier will be a linear support vector machine - SVC() in nilearn. This is because of the high recommendation for it for classification problems with small sample sizes.

I will be using 10-fold crosss validation to get a rough benchmark of performance for each classifier. I will use F1 as my performance metric. After each run I will look at the performance of the classifier across the folds as well as the average performance.

Based on above results, Linear SVC has a good performance as seen above with an average F1 score of ~0.72.

I will now try gradient boosting as my classifier. The gradient boost model will use a greater number of estimators and a larger max depth than the defaults in order to try and improve performance.

Based on above results, the gradient boost model seems to be highly variable and isn't comparable to the performance of the SVC. Next, I will try K nearest neighbors as my classifier.

Based on above performance, K Nearest Neighbors performs poorly with default parameters. Given the large difference between KNN and other classifiers, I will not adjust and retry this classifier.

Now I will try a Random Forest classifier. I will increase the number of estimators like I did with the gradient boost model.

Based on above results, the Random Forest model performed alright but not as much as the linear SVC. With some parameter adjustment and modifications, the same performance might be achieveable but since the random forest classifier is more complex, and takes longer to train, I will use SVC as the final model.

Parameter Tweaking

I will now see if I can improve the performance of my SVC model by tweaking the hyperparameters. But with linear SVC, the only option I have to tweak is the C parameter.

I will create a range of values for C and then compare each using cross validation.

Based on the above plot, the model seems to perform best at C value of 0.1 but it's a minor difference. I will try one more thing now.

I will try to change the SVC kernel to the default 'rbf' which would let me adjust C and gamma. I will use a grid search to see if optimizing the rbf kernel would give a better result than a linear kernel.

Based on the above results and modifications, it looks like SVC with an RBF kernel and tweaked hyperparameters also performing the same as SVC with a linear kernel, so I can select either one as my final model. I will choose the SVC with RBF kernel as my final model due to the further adjustments I could make to it if I spend finetuning the model more.

Testing the model

I will now run the model on the remaining data (testing data) and check how accurately it performs.

Based on above results, A F1 score of 0.54 is not very bad for a binary classification problem given that my training dataset was almost perfectly balanced in the number of control and patient subjects. I will check how the model is handling attributes by looking at the confusion matrix.

Based on the above chart, the model appears to be doing very well in predicting patient populations but poorly predicting Control subjects. Since my dataset was very small (total=40, training=32), it is difficult to improve the classifiers more significantly. I am overall satisfied with the performance and application of Machine learning classifiers to predict ADHD diagnosis from a given sample and would like to improve and work on it further with a more larger dataset in the future as this approach looks promising.

SUMMARY

Data Cleaning

Data was prepared in this study by extracting subject-IDs from the nifti file paths and then were merged with the phenotypic data to create the final names. This was done to easily subset attributes for creating data visualizations. The last step was to generate time series and correlation matrix for each subject and the correlation matrices were added to the phenotypic data for each subject.

Data Visualization

Plots of the fMRI features were created that showed the average activation for both patients and controls. A Plotly Express interactive visualization was created to present a histogram of age for both patients and controls. I also created several other visualizations to display various data including the average correlation matrices for controls and patients.

Classification

My primary goal with this project was to predict an ADHD diagnosis from fMRI data. I tried and assesed various machine learning techniques on the dataset. I split the dataset into 80/20 for training and testing. Each classifier model was used on the training data using a 10-fold cross validation. I used the F1 score as a measure to assess performance. I also used a grid search to fine tune the hyperparameters.

RESULTS

Classifier models:

  1. Linear Support Vector Machine (average CV = 0.72 , min F1 score = 0.25, max F1 score = 1.0)
  2. Gradient Boost (average CV = 0.38 , min F1 score = 0., max F1 score = 0.73)
  3. K Nearest Neighbors(average CV = 0.47, min F1 score = 0.25, max F1 score = 1.0)
  4. Random Forest (average CV = 0.52, min F1 score = 0.25, max F1 score = 1.0)
  5. RBF Support Vector Machine (average CV = 0.72 , min F1 score = 0.25, max F1 score = 1.0)

Both Linear Support Vector Machine Classifier (SVC) and SVC using an RBF kernel performed equally well but I proceeded to choose the RBF kernel SVC as my final model in order to experiment and modify further in future with a larger dataset. The values I had for C was 10000000.0 and gamma parameter was 1e-08. I used this model afterwards to predict the ADHD diagnosis on the testing set. On the testing set, the model had a final F1 score of 0.55 and an accuracy of 0.38.

Conclusion and Limitations

An F1 score of 0.55 is not bad for a binary classification application given that I had such a small sample size and the dataset had an equal distribution (0.5) of both patients and controls. It is certainly possible to improve the performance using several other techniques. My analysis used 64 ROIs, but I could use upto 444 (the number supported by the atlas). Increasing the number of features will improve the performance of the model. Also, it is possible to reduce the dimensions to decrease the number of features accessible to the model. Finetuning hyperparameters of more complex machine learning models could also lead to increased performance of the models.

I gained a lot of experience working with fMRI data and applying machine learning techniques to neuroimaging data analysis. The available documentation, readings and textbook made it easier for me to learn and apply the various tools and techniques I implemented in this project. I was also able to further improve my programming skills specifically in relation to its applications to fMRI data. In the future, I would like to further improve upon my fMRI analysis skills and experiment with more advanced machine learning algorithms.