Ανάλυση βιολογικών δεδομένων με χρήση αλγορίθμων μηχανικής μάθησης με εφαρμογή στη διάγνωση του γαστρεντερικού καρκίνου (Bachelor thesis)

Περήφανος, Αλέξανδρος


The purpose of this thesis is to identify as few genes as possible, using Machine Learning methods, which contain important information for the construction of a classifier that will perform extremely well and will, of course, be able to generalize. The dataset that we examined, concerns gastrointestinal cancer and was taken from the TCGA database (The Cancer Genome Atlas). The specific form of cancer studied, has 5 species, but we will deal specifically with 4, esophageal, stomach, pancreatic and gallbladder. The target variables are the existence or absence of the disease. Gene expression data were transformed by RNA sequencing (RNA-Seq). Initially, common genes were found for all types of cancer to be analyzed. Subsequently, using the Dimension Reduction/Feature Selection Methods, Kolmogorov Smirnov 2 Samples Test (KS 2Samples Test), Mutual Information (MI) and Recursive Feature Elimination with Cross Validation (RFE-CV), we evaluated the genes and ranked them according to their significance. For the first two (Mutual Information, Kolmogorov Smirnov 2 Samples), we chose the first more important genes of each feature selection method starting from 10 to 5000. The latter (RFE-CV) refers to the classification of features according to the weights given by the respective model used, with repeated deletion of features in regards to a specific step and then selecting the best number of features through cross-validation. We compared classifiers and then we performed Grid Search to find the parameters at which the most efficient classifiers achieve the best results (k-Fold Cross Validation). In addition, other methods were used such as data standardization and generation of synthetic data for the minority class (healthy) as our sample was very unbalanced. The results of the experiments showed that the rfecv criterion is superior to the other evaluation criteria we tested as it was able to find the smaller number of significant genes containing important information for the construction of an SVM RBF classifier that has a better generalization capability than other subsets of important genes derived from the other evaluation criteria we tested (the classifier's performance did not decrease even after we used methods for synthetic data generation). In addition, it was observed that with standardization of gene values, we achieved the best results.
Institution and School/Department of submitter: Σχολή Τεχνολογικών Εφαρμογών - Τμήμα Μηχανικών Πληροφορικής
Subject classification: Machine learning/
Μηχανική μάθηση
Digestive organs -- Cancer --Diagnosis
Πεπτικά όργανα -- Καρκίνος -- Διάγνωση
Keywords: Μηχανική Μάθηση;Machine Learning;Μέθοδοι Μείωσης Διαστάσεων;Dimension Reduction Methods;Μέθοδοι Επιλογής Χαρακτηριστικών;Mutual Information;rfecv;Kolmogorov-Smirnov 2 Samples;SVM;Γαστρεντερικός Καρκίνος;Gastrointestinal Cancer;Οισοφαγικός Καρκίνος;Esophageal Cancer;Στομαχικός Καρκίνος;Stomach Cancer;Παγκρεατικός Καρκίνος;Καρκίνος της Χοληδόχου Κύστης;Gallbladder Cancer;Γονίδια;Genes;Αλληλλούχιση RNA (RNA-seq);RNA Sequencing (RNA-Seq);Feature Selection Methods;Pancreatic Cancer
Description: Πτυχιακή εργασία - Σχολή Τεχνολογικών Εφαρμογών, 2019 (α/α 11029
URI: http://195.251.240.227/jspui/handle/123456789/15035
Appears in Collections:Μεταπτυχιακές Διατριβές

Files in This Item:
File Description SizeFormat 
PERIFANOS.pdf1.5 MBAdobe PDFView/Open



 Please use this identifier to cite or link to this item:
http://195.251.240.227/jspui/handle/123456789/15035
  This item is a favorite for 0 people.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.