Data Mining and Visualisation Coursework

Task 1: Your Personal Data Warehouse Application (PDWA)

As the module progresses you will build a substantial data warehouse application for a real-world scenario of your choosing. You will

design a star schema for the data warehouse. The data warehouse is a metaphor for multidimensional data storage. The actual physical

storage of such data may differ from its logical representation. Assuming the data are stored in a relational database (RDMS), you will

create an actual data warehouse using either Microsoft Access, Microsoft SQL Server, or Oracle, etc.

Your first step is to identify the domain you would like to manage with your data warehouse, and to construct an entity-relationship

diagram for the data warehouse. I suggest that you pick an application that you will enjoy working with –a hobby, material from

another course, or a research project, etc.

Try to pick an application that is relatively substantial, but not too enormous. For example, a data warehouse for a university

consists of the following four dimensions: student, module, semester, and lecturer, and two measures: count and avg_grade. When at the

lowest conceptual level (e.g., for a given
student, module, semester and lecturer combination), the avg_grade measure stores the actual module grade of the student. At higher

conceptual levels, avg_grade stores the average grade for the given combination. [Note: in your coursework, you should not use the

university scenario or similar
ones any longer!]

Your data warehouse should consist of at least four dimensions, one of which should be time dimension, when expressed in the entity-

relationship model, you might want your design to have one fact table plus four(or more) dimensional tables, and a similar number of

relationships. You should certainly include
one-many relationships. Each dimension has at least three levels (including all), such as student < course < university (all).

a) Describe the data warehouse application you propose to work with and the objectives of the possible OLAP. Your description should be

brief and relatively formal. If there are any unique or particularly difficult aspects of your proposed application, please point them

out. Your description will be graded only on suitability and conciseness.
[3 marks]

b) [ROLAP] Draw a star schema diagram including attributes for your data warehouse. Don’t forget to underline primary key attributes

and include arrowheads indicating the multiplicity of relationships. Create database tables for your PDA in your chosen RDMS, e.g.,

using the CREATE TABLE commands (Pick suitable datatypes for each attribute), or using INSERT commands to insert tuples. You need to

populate the data warehouse with sample data (at least five attributes for each dimensional table and at least three records in each

table) for manipulating the data warehouse. For this task, you ONLY need to submit the star schema diagram, and the populated tables

(screenshots).
[8 marks]

c) [ROLAP] Starting with the base cuboid [e.g., student, module, semester, lecturer], carry out two OLAP operations that you have set

in a). For example, what specific OLAP operations should you perform in order to list the average grade of the Data Mining module for

each university student in the university scenario? Write and run an SQL query to obtain your list resembling the
above example. Provide a screenshot as a proof your query worked.
[8 marks]

Task 2: [30 marks in total] – Files Attached (Bank & Question3DataSetSupplement)

A dataset in .ARFF format has been attached for you. Analyse this dataset using the WEKA toolkit and tools introduced within this

module.

Produce a report to recognise the pattern of the data. In your answer, you should explain which algorithms you used and why, what

results you obtained, and what this tells you about the data.

Marks will be awarded for: variety of algorithms used, quality of analysis, and interpretation of the results. An extensive report is

not required (at most 2000 words), nor is detailed explanation of the algorithms employed, but any graphs or tables produced should be

described and analysed in the text.

A reasonable report could be achieved by doing a thorough analysis using three algorithms. An excellent report would use at least four

tools to analyse the dataset, and provide detailed comparisons between the results.

You should perform the following tasks:

1. Analyse the attributes in the data, and consider their relative importance with respect to the target class. You may visualise the

attributes and perform association test for this task.
[6 marks]

2. Construct graphs of classification performance against different sizes of training-testing splitting, for a range of classifiers

taken from those considered in the module. Interpret the performance criteria you choose to use in the graphs. Analyse the results and

indicate what this tells you about the classifiers.
[9 marks]

3. Analyse the data structure/representation generated by each classifier using cross validation with various numbers of folds in

tables/graphs. Explain what this tells you about the data.
[9 Marks]

4. Combine the results from the previous three tasks and all your classifiers to develop a model of why instances fall into particular

classes. The model could be a rule set, a decision tree or a formula, etc.
[6 Marks]

Produce a report containing your answers to the above.
[Total 30 marks]