Database and Data Mining
IST-4-DBM1 - ECTS 3.0
Part 1: Databases
Abstract
This course introduces the most important concepts concerning the design and the use of databases. It covers the main issues in data modeling when considering the relational data model (structure, languages like relational algebra and relational calculus) and the key features of attribute-oriented models with respect to constructor-oriented models (like, for instance, Extended Entity-Relationship models). The fundamental concepts like the constraints and the normal forms will be explained and their impact on the quality of designed databases will be discussed. Finally, practical aspects of relational database design and use will be discussed. It includes a presentation of the SQL query language but also a short introduction to data warehousing and On Line Analytical Processing queries (OLAP). To be concrete, the students will have to understand the design of a real relational database and to practice database querying with a professional DataBase Management System (DBMS).
Note: Slides are taken and partially modified from https://www.db-book.com/
Schedule
Date | Lecture | Slides | Chapters | Extra |
---|---|---|---|---|
04/10/2021 | Introduction to DB | 1 | Chp1 | |
06/10/2021 | Relational Algebra and ER Modelling | 1 2 until 43 | Chp 2, 6 | Extended ER Figure (slides) |
08/10/2021 | Project Setup + Practice with Postgres | 1 2 | Chp 3, 4 | Docker pull, build, run, exec, + Docker compose |
11/10/2021 | Normal Forms | 1 | Chp 7 | |
13/10/2021 | Normal Forms pt 2 | 1 | Chp 7 | |
15/10/2021 | Exercises on SQL | Chp 3,4 | ||
18/10/2021 | Index & Transactions | Chp 14,17 | ||
20/10/2021 | Index & Transactions | Chp 14,17 | ||
10/11/2021 | Project Presentations |
Projects Description
The course exam consists into a small project to perform in pairs. Each project should have a presentation and 2-5 pages project report.
Project Checklist
- Identify the domain of choice, e.g., internet memes
- formulate some questions in natural language that you would like to answer
- submit the questions for approval
- design the database and perform data entry in Postgres
- ER Diagram and comments about the design decision (using appropriate terminology)
- Implement the questions in SQL, analyse the query plan in relational algebra, experiment alternative implemetations
- Basic queries should cover standard RA operators, Joins, and aggregation functions
- Advanced features nice to show example of
- Indexes
- Transactions
NB: Full-stack application is not required, but encouraged.
Project Grading
- 1-15
- Natural Language Queries
- ER Diagram Design
- SQL Implementation
- DDL + Data Entry
- Basic SQL Usage (no joins) + Relational Algebra (should also go in the presentation)
- Intermediate SQL usages
- Joins and Aggregations
- Integrity Constratints (check or queries) + Ad-hoc data to test
- Indexes and Transaction
- 15-20
- 10 minutes Presentation ( you both have to speak) + 5 minutes Questions (Individual)
- Use of appropriate terminology (also when speaking)
- Clarity of the explanation
- Appropriateness of the answer
- What to put in the report
- a general presentation
- design decisions
- ER + Implementation
- problems that you solved
- what to put in the presentation (4-8 slides):
- a general presentation
- ER + Comments of entity
- Example of SQL + RA
- Demo is welcomed
- 10 minutes Presentation ( you both have to speak) + 5 minutes Questions (Individual)
- yes, you can decide not to present (max grade is 12)
- Few extra points will be assigned by interacting with each other projects by proposing extensions or commenting on the result.
Proposed Dataset
-
Meme Datasets:
- Know Your Meme Scraped
- DBPedia Enrichment
- Google Vision Dataset
- Instances from Image Flip
- Industrial ERP (shared by email) (Courtesy of LEM International)
- Hackathon Dataset (Courtesy of Prof Alex Nolte)
Suggested Material
Pre Lab Installations (Courtesy of Mohamed Ragab)
- Install PostgreSQL
- On Docker ``
- on Windows
- Download and Install PostgreSQL for all platforms
- Follow this tutorial for more details (Windows installation).
- on linux
- Install the psycopg2 Python Package to connecto to PostgreSQL
- use th command
in your command line.1
pip install psycopg2
- use th command
- Accessing PostgreSQL from Command Line:
- Add the PostgreSQL installation \/home\ and \/bin\ directories to the enviroment variables.
- use the command
to connect to the the by-default created database \postgres\ with the user \postgres.1
psql -U postgres postgres
- Enter your set \postgres\ password (i.e password of the default password that you have been asked at the time of installtion).
- Connect To a PostgreSQL Database Server
- show you how to connect to the PostgreSQL using psql tool and pgAdmin 4 GUI tool.
Part 2: Data Mining
Abstract
This course introduces the basics notions of datamining. It surveys techniques for data preprocessing, i.e., cleansing, integration, data reduction. Then, it jumps into foundational aspects of data mining like: Frequent Patterns mining, classification, clustering, outliers detection. The course examination consists of both a project and an exam with open and multiple choice questions and exercises.
but also a short introduction to data warehousing and On Line Analytical Processing queries (OLAP).
Books
Professor’s Choice: Data Mining: Concepts and Techniques, 3rd ed.
-
Exercise (we will do them) http://www.cse.msu.edu/~ptan/dmbook/software/
-
Please have a look at 0, 1 and 2 by Friday
Others Resources
Tentative Schedule
Date | Lecture | Chapters | Extra |
---|---|---|---|
10/11/2021 | Introduction to Data Mining | Chp 1, 2 | |
12/11/2021 | Data Pre Processing | Chp 3 | |
17/11/2021 | OLAP | Chp 4 | |
19/11/2021 | Practice on Pre Processing | exercises | |
24/11/2021 | Frequent Patterns | Chp 6 | |
26/11/2021 | Classification | Chp 8 | |
03/12/2021 | Clustering | Chp 10 | |
08/12/2021 | Outliers | Chp 12 | |
10/12/2021 | Project Q/A | ||
15/12/2021 | Exam + Greetings | ||
22/12/2021 | Project Submission |
Project Checklist
- Identify the domain of choice, e.g., dataset presented above
- YOU DO NOT HAVE TO PICK ONLY ONE DATASET
- BUT BE CAREFUL BECAUSE YOU MIGHT HAVE TO CLEAN MORE THAN ONE
- YOU CAN SUBMIT THE TAKS FOR APPROVAL
- you can pick out of problems (preprocessing + 2)
- cleansing/preprocessing (mandatory)
- frequent pattern mining
- classification
- clustering
- Explain why why you’re happy with the results.
- evaluate the associations rules
- evaluate the classification
- evaluate the clustering proceedure
- present a notebook with comments and discussion of your result (github)