Τεχνικές μείωσης του πληθυσμού των δεδομένων με ανεκτικότητα στις απούσες τιμές (Master thesis)

Κουκάρας, Πολυχρόνης


In recent years, large amounts of training data, from various sources, become available on a daily basis. These quantities are usually not possible to be used by classification algorithms due to the high cost of computing as well as the high memory storage requirements. Therefore, this data is often pre-processed by Data Reduction Techniques in order to reduce computing costs and memory requirements. Many data reduction techniques have been proposed and are available in the literature. These techniques mainly concern the ‗k Nearest Neighbor classifier‘. However, these techniques cannot manage the Missing Values that always appear in real training data sets. Thus, before pre-processing by a data reduction technique, it is necessary to apply another pre-processing step to complete the Missing Values Imputation. In the literature, we come across to several such methods and this paper presents the most important ones. However, by applying an extra pre-processing step is a major drawback that adds computational cost. This is the motivation for this thesis. This thesis proposes a new variant of a data reduction technique that can manage missing values without requiring the additional pre-processing step for data imputation. This technique is a Prototype Generation algorithm and is called the Editing and Reduction through Homogeneous Clusters (ERHC) algorithm. The new ERHC variant manages the missing values using the partial distance technique and applying k-means clustering that does not take into account the missing values. In addition, the performance of ERHC has been tested after the imputation of missing values by the average per class imputation method. The two aforementioned ERHC variants are compared to each other and to the algorithm of the nearest neighbors without reducing the population of data by performing experiments on 13 data sets and estimating the accuracy of classification and reduction ratio (Reduction Rate) achieved by the two ERHC algorithms. The experimental results show remarkable performance for both variants of the ERHC algorithm.
Alternative title / Subtitle: Data reduction techniques with missing values tolerance
Institution and School/Department of submitter: Σχολή Μηχανικών - Τμήμα Μηχανικών Πληροφορικής και Ηλεκτρονικών Συστημάτων
Keywords: Data Reduction Techniques;Missing Values Imputation;ERHC;Partial distance;K-means Clustering;Categorization of Neighboring Neighbors;Calculation
Description: Μεταπτυχιακή εργασία - Σχολή Μηχανικών - Τμήμα Μηχανικών Πληροφορικής και Ηλεκτρονικών Συστημάτων,2020(α/α 11965)
URI: http://195.251.240.227/jspui/handle/123456789/15604
Item type: masterThesis
General Description / Additional Comments: Μεταπτυχιακή εργασία
Submission Date: 2023-01-25T12:26:09Z
Item language: el
Item access scheme: free
Institution and School/Department of submitter: Σχολή Μηχανικών - Τμήμα Μηχανικών Πληροφορικής και Ηλεκτρονικών Συστημάτων
Publication date: 2020-07-15
Bibliographic citation: Κουκάρας, Π. (2020). Τεχνικές μείωσης του πληθυσμού των δεδομένων με ανεκτικότητα στις απούσες τιμές (Μεταπτυχιακή εργασία). ΔΙΠΑΕ.
Abstract: In recent years, large amounts of training data, from various sources, become available on a daily basis. These quantities are usually not possible to be used by classification algorithms due to the high cost of computing as well as the high memory storage requirements. Therefore, this data is often pre-processed by Data Reduction Techniques in order to reduce computing costs and memory requirements. Many data reduction techniques have been proposed and are available in the literature. These techniques mainly concern the ‗k Nearest Neighbor classifier‘. However, these techniques cannot manage the Missing Values that always appear in real training data sets. Thus, before pre-processing by a data reduction technique, it is necessary to apply another pre-processing step to complete the Missing Values Imputation. In the literature, we come across to several such methods and this paper presents the most important ones. However, by applying an extra pre-processing step is a major drawback that adds computational cost. This is the motivation for this thesis. This thesis proposes a new variant of a data reduction technique that can manage missing values without requiring the additional pre-processing step for data imputation. This technique is a Prototype Generation algorithm and is called the Editing and Reduction through Homogeneous Clusters (ERHC) algorithm. The new ERHC variant manages the missing values using the partial distance technique and applying k-means clustering that does not take into account the missing values. In addition, the performance of ERHC has been tested after the imputation of missing values by the average per class imputation method. The two aforementioned ERHC variants are compared to each other and to the algorithm of the nearest neighbors without reducing the population of data by performing experiments on 13 data sets and estimating the accuracy of classification and reduction ratio (Reduction Rate) achieved by the two ERHC algorithms. The experimental results show remarkable performance for both variants of the ERHC algorithm.
Advisor name: Ουγιάρογλου, Στέφανος
Examining committee: Ουγιάρογλου, Στέφανος
Διαμαντάρας, Κωνσταντίνος
Δέρβος, Δημήτριος
Publishing department/division: Σχολή Μηχανικών - Τμήμα Μηχανικών Πληροφορικής και Ηλεκτρονικών Συστημάτων
Publishing institution: ihu
Number of pages: 94 σελ.
Appears in Collections:Πτυχιακές Εργασίες

Files in This Item:
File Description SizeFormat 
ΚΟΥΚΑΡΑΣ ΠΟΛΥΧΡΟΝΗΣ-ΔΙΠΛΩΜΑΤΙΚΗ ERHC-IMP-PD.pdfΜεταπτυχιακή εργασία2.39 MBAdobe PDFView/Open



 Please use this identifier to cite or link to this item:
http://195.251.240.227/jspui/handle/123456789/15604
  This item is a favorite for 0 people.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.