COS7028-B Data Mining Assessment

110 views 6:44 am 0 Comments September 14, 2023

COS7028-B Data Mining Assessment: Coursework (100%) Date Issued: Semester 2, Monday 23rd January 2017 (Week 1) Date Due: Semester 2, Monday 7th May 2017 (Week 13) The main aim of this coursework is to critically analyse data sources and data sets, critically evaluate possible data analytics challenges and solutions, choose, design and implement data mining algorithms to the chosen data, and apply the data mining techniques to specific case studies. The coursework is worth 100 marks, and each section marks are detailed below. Coursework electronic submission of a report with solutions to the exercises below in.pdf format, and all files generated to solve the exercises, with suggestive file names preferably linked to the question they are related to (e.g. Q1a, Q2e etc) should be submitted using Blackboard student folder following instructions made available on the VLE. You are expected to explore a chosen data set of your choice from (open) data mining/machine learning re/sources, to develop case studies and apply data mining techniques on this data set for supervised and/or unsupervised learning, as motivated and decided which is suitable (depending on the data set characteristics). Exercises A, B, G are compulsory, and you’ll choose 2 exercises from C, D, E, F: A. [20 marks] Data Choice. Name the chosen data set (from module resources, UCI ML Repository or other open data sources or own collection) and describe the data set (e.g. attribute types and values, source of data) [5 marks]. Describe the data mining problem (and background) you will address [5 marks]. Introduce the data mining question(s) related to the problem: e.g. as a classification, prediction, association, clustering, or text mining related exercise [10 marks]. B. [20 marks] Data Analysis Analyse the data. Apply descriptive statistics techniques (using for example, the studied packages to analyse the data set [10 marks]. Add functionality/ies for descriptive statistics in the view of this problem: provide details of contextual programming and usage of graphical 2 representation and analysis of data are expected [10 marks]. For example, sort the data by class, line or bar plot each of the features individually if applicable; for each feature compute characteristics like its minimum, maximum, mean, mode and standard deviation, and study the correlation between features for each class or the distance matrix. If attributes are not normalized then this step will be also considered. C. [20 marks] Classification Task Pick (and motivate your choice) an appropriate classifier (e.g. multilayer perceptron, decision tree, linear regression, Naïve Bayes etc.). Choose and motivate some default parameters for the classifier: the choice of classifier and parameters, training/testing strategy (e.g. percentage split, cross-validation etc.) could be chosen by the user of your application. Add functionality for training the classifier using your training/testing strategy [5 marks]. Develop your application by making use of existing data mining algorithms (available for example as Weka classes or R scripts) for the classification/prediction problem based on the chosen data set [5 marks]. Add one functionality for visualisation of the performance of the model and interpret the obtained results (e.g. using performance metrics and/or comparison with literature [5 marks]. A choice for re-training should be offered to the user, e.g. if the results are unsatisfactory one could consider altering the model (i.e. changing the classifier, its parameters, and/or your training/testing strategy), and re-training/testing [5 marks]. D. [20 marks] Clustering Task Pick (and motivate your choice) an appropriate clustering algorithm to be developed (e.g. using Weka classes, R scripts) [5 marks]. Develop your application by making use of the clustering algorithm. Add functionality/ies for descriptive statistics in the view of this problem [5 marks]. Include one visualisation functionality to interpret the clustering results and analyse them [5 marks]. Interpret results based on setting choices, initial data types and range [5 marks]. E. [20 marks] Association Rule Mining Task Study the applicability of association rule mining (e.g. data types, size, and type of problem) [15marks]. Develop your application by making use of existing algorithms (e.g. Weka classes, R scripts) for association rule mining [5 marks]. Demonstrate the correctness of the development by providing a case study [5 marks] and interpreting the results [5 marks]. F. [20 marks] Text Mining Task 3 Study the applicability of text mining (e.g. data types, and type of problem) [5 marks]. Develop your application by making use of existing algorithms (e.g. Weka classes, R scripts, KNIME workflows) for text mining [5 marks]. Discuss results providing a case study [5 marks] and interpreting visual and numerical outputs [5 marks]. G. [20 marks] Critical Review What difficulties did you have using the tools/techniques as above? Provide reflections focused on technical, interpretational and functional issues [5 marks]. Discuss the results of some case studies you developed for each technique and any interesting observations by comparing them with published work (use journal papers, conference proceedings, books and online resources available) [5 marks]. What did you observe? What conclusions could you deduce/induce from each result? Which techniques were most helpful to evaluate, explore, analyse, or answer? How did you compare techniques? What techniques/activities would you do next (in future follow-up work) to continue this work? [10 marks] 4 COS7028-B Data Mining Assessment: Marking Scheme A. [20 marks]: – name and describe data set and include source: 2 marks – attribute and value domain discussion: 2 marks – literature review (background): 1 mark – problem specifications provided: 5 marks – list the data mining techniques applicable for the exercise: 5 marks – motivate the choice and applicability of the data mining techniques (such as referring to the existing literature): 5 marks B. [20 marks]: – details and interpretation of data analysis using existing environments (such as KNIME, WEKA or R): 10 marks, 1 mark for each of the relevant ones: minimum, maximum, mean, mode and standard deviation, and correlation or similar for categorical values – presentation of own code/workflow/process to make use of existing data analysis packages: 5 marks – quality of code (comments, documentation): 5 marks C. [20 marks]: – name the classification/prediction algorithm: 2 marks – motivate choice (eg use references to similar exercises): 2 marks – classifier parameter settings (includes motivation): 2 marks – training and testing strategies nomination and motivation: 2 marks (1 for training, 1 for testing) marks – class/application documentation: 2 marks – result visualisation coding: 2 marks – visualisation code documentation: 2 marks – result interpretation (to include results reported in the literature): 2 marks (including references 1 mark) 5 – setting a retraining/tuning strategy: 2 marks – final result comparison: 2 marks D. [20 marks]: – name the clustering algorithm: 2 marks – motivate choice (eg use descriptive statistics, references to similar exercises): 2 marks – parameter settings (includes motivation): 2 marks – training and testing strategies nomination and motivation: 2 marks – class/application coding: 2 marks – class/application documentation: 2 marks – result visualisation coding: 2 marks – visualisation code documentation: 2 marks – result interpretation (to include results reported in the literature): 2 marks (including references 1 mark) – conclusions: 2 marks E. [20 marks]: – name the association rule mining algorithm: 2 marks – motivate choice (eg use descriptive statistics, references to similar exercises): 2 marks – parameter settings (includes motivation): 2 marks – discussion of applicability: 2 marks – class/application documentation: 2 marks – result visualisation: 2 marks – visualisation code documentation: 2 marks – case study description: 2 marks – result interpretation (to include results reported in the literature): 2 marks (including references 1 mark) 6 – conclusions: 2 marks F. [20 marks]: – name the text mining algorithm: 2 marks – motivate choice (eg use descriptive statistics, references to similar exercises): 2 marks – parameter settings (includes motivation): 2 marks – discussion of applicability: 2 marks – class/application documentation: 2 marks – result visualisation: 2 marks – visualisation code documentation: 2 marks – case study description: 2 marks – result interpretation (to include results reported in the literature): 2 marks (including references 1 mark) – conclusions: 2 marks G. [20 marks]: – name difficulties: 2 marks – reflection on what went wrong: 3 marks – discussion and comparison of your work with published references: 5 marks – reflection/discussion of lessons learned on techniques used and relevant use of references: 5 marks – future work provision: 5 marks

Tags: , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *