Nearest neighbour classification on data streams and systems with limited memory via prototype replacement policies (Bachelor thesis)

Αραμπατζής, Γεώργιος


The k Nearest Neighbour classifier, due to the fashion it functions, can not handle very large volumes of training data. Thus, rendering the use of k-NN as an inappropriate choice when it comes to streaming environments. Adding to that list, most conventional Data Reduction Techniques, that operate by either selecting training prototypes or by generating a set of training prototypes, are also categorized as unsuitable for such environments. The “dynamic Reduction through Homogeneous Clusters” algorithm, dynamic RHC in short, is a prototype abstraction algorithm. dRHC has the advantages, compared to other data reduction techniques, that it can update its condensing set every time a new training data arrives. Thus, being able to handle large amount of training data. Although, dynamic RHC does not take into account the phenomenon of concept drift, that resides in most data streams. In addition to this, after many consecutive condensing set updates, its size keeps increasing, making its use prohibitive for classification tasks. All the disadvantages mentioned earlier comprise the core motivation behind this thesis. Trying to tackle the problems dRHC algorithm was facing, the new dRHC2 algorithm was proposed. While keeping all the advantages dRHC has, dRHC2 uses a certain, predefined by the user, threshold in order to limit the infinite growth of the condensing set in dRHC. dRHC2 ranks its prototypes, inside the condensing set, in order to select the most relevant ones. More specific, dRHC2 has two different heuristics. One heuristics is considered more appropriate when the goal is solely to restraint the steadily increasing condensing set, while still maintaining high percentage during the classification accuracy. The other heuristic, takes into consideration the phenomenon of concept drift, tackling the other main weakness of dRHC. Both heuristics of the proposed algorithm dRHC2 were evaluated on several datasets, based on the classification accuracy, the reduction rate and the preprocessing cost. In addition, the algorithm was compared to its predecessor, dRHC, the experimental measurements of which were validated by statistical tests of significance. Finally, the results indicate that, not only dRHC2 manages to be a lot faster, but also excels in accuracy, establishing its superiority over dRHC.
Institution and School/Department of submitter: Σχολή Τεχνολογικών Εφαρμογών- Τμήμα Μηχανικών Πληροφορικής
Subject classification: Workflow--Management--Computer programs
Ροή εργασίας--Διαχείριση--Προγράμματα υπολογιστών
Data reduction
Μέιωση δεδομένων
Heuristic algorithms
Ευριστικοί αλγόριθμοι
Keywords: k nearest neighbours, k-NN classification, data reduction, classification, clustering, prototypes, prototype selection, prototype abstraction, editing, condensing, dynamic environments, streaming environments, data streams, k-means, dRHC, classification accuracy, reduction rate, preprocessing cost, computational cost, Wilcoxon signed ranks test;k πλησιέστεροι γείτονες, ταξινόμηση k-NN, μείωση δεδομένων, ταξινόμηση, ομαδοποίηση, πρωτότυπα, επιλογή πρωτοτύπου, αφαίρεση πρωτοτύπου, επεξεργασία, συμπύκνωση, δυναμικά περιβάλλοντα, περιβάλλοντα ροής, ροές δεδομένων, k-mean, dRHC, ακρίβεια ταξινόμησης, ποσοστό μείωσης, κόστος προεπεξεργασίας, υπολογιστικό κόστος, Wilcoxon δοκιμασμένη βαθμολογία
Description: πτυχιακή εργασία- Σχολή Τεχνολογικών Εφαρμογών- Τμήμα Μηχανικών Πληροφορικής, 2017(α/α 8413)
URI: http://195.251.240.227/jspui/handle/123456789/13324
Item type: bachelorThesis
General Description / Additional Comments: πτυχιακή εργασία
Subject classification: Workflow--Management--Computer programs
Ροή εργασίας--Διαχείριση--Προγράμματα υπολογιστών
Data reduction
Μέιωση δεδομένων
Heuristic algorithms
Ευριστικοί αλγόριθμοι
Item access scheme: account
Institution and School/Department of submitter: Σχολή Τεχνολογικών Εφαρμογών- Τμήμα Μηχανικών Πληροφορικής
Publication date: 2017-02-20
Bibliographic citation: Αραμπατζής, Γ. (2017).Κατηγοριοποίηση εγγύτερων γειτόνων σε συστήματα ροών δεδομένων και συστήματα με περιορισμένη μνήμη μέσω πολιτικών αντικατάστασης αντιπροσωπευτικών αντικειμένων . Nearest neighbour classification on data streams and systems with limited memory via prototype replacement policies(πτυχιακή εργασία).Αλεξάνδρειο ΤΕΙ Θεσσαλονίκης
Abstract: The k Nearest Neighbour classifier, due to the fashion it functions, can not handle very large volumes of training data. Thus, rendering the use of k-NN as an inappropriate choice when it comes to streaming environments. Adding to that list, most conventional Data Reduction Techniques, that operate by either selecting training prototypes or by generating a set of training prototypes, are also categorized as unsuitable for such environments. The “dynamic Reduction through Homogeneous Clusters” algorithm, dynamic RHC in short, is a prototype abstraction algorithm. dRHC has the advantages, compared to other data reduction techniques, that it can update its condensing set every time a new training data arrives. Thus, being able to handle large amount of training data. Although, dynamic RHC does not take into account the phenomenon of concept drift, that resides in most data streams. In addition to this, after many consecutive condensing set updates, its size keeps increasing, making its use prohibitive for classification tasks. All the disadvantages mentioned earlier comprise the core motivation behind this thesis. Trying to tackle the problems dRHC algorithm was facing, the new dRHC2 algorithm was proposed. While keeping all the advantages dRHC has, dRHC2 uses a certain, predefined by the user, threshold in order to limit the infinite growth of the condensing set in dRHC. dRHC2 ranks its prototypes, inside the condensing set, in order to select the most relevant ones. More specific, dRHC2 has two different heuristics. One heuristics is considered more appropriate when the goal is solely to restraint the steadily increasing condensing set, while still maintaining high percentage during the classification accuracy. The other heuristic, takes into consideration the phenomenon of concept drift, tackling the other main weakness of dRHC. Both heuristics of the proposed algorithm dRHC2 were evaluated on several datasets, based on the classification accuracy, the reduction rate and the preprocessing cost. In addition, the algorithm was compared to its predecessor, dRHC, the experimental measurements of which were validated by statistical tests of significance. Finally, the results indicate that, not only dRHC2 manages to be a lot faster, but also excels in accuracy, establishing its superiority over dRHC.
Advisor name: Δέρβος, Δημήτριος
Examining committee: Δέρβος, Δημήτριος
Publishing department/division: Σχολή Τεχνολογικών Εφαρμογών- Τμήμα Μηχανικών Πληροφορικής
Publishing institution: teithe
Number of pages: 86 σελ.
Appears in Collections:Πτυχιακές Εργασίες

Files in This Item:
There are no files associated with this item.



 Please use this identifier to cite or link to this item:
http://195.251.240.227/jspui/handle/123456789/13324
  This item is a favorite for 0 people.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.