Database and Data Mining

IST-4-DBM1 - ECTS 3.0

Part 1: Databases

Abstract

This course introduces the most important concepts concerning the design and the use of databases. It covers the main issues in data modeling when considering the relational data model (structure, languages like relational algebra and relational calculus) and the key features of attribute-oriented models with respect to constructor-oriented models (like, for instance, Extended Entity-Relationship models). The fundamental concepts like the constraints and the normal forms will be explained and their impact on the quality of designed databases will be discussed. Finally, practical aspects of relational database design and use will be discussed. It includes a presentation of the SQL query language but also a short introduction to data warehousing and On Line Analytical Processing queries (OLAP). To be concrete, the students will have to understand the design of a real relational database and to practice database querying with a professional DataBase Management System (DBMS).

Note: Slides are taken and partially modified from https://www.db-book.com/

Schedule

Date Lecture Slides Chapters Extra
04/10/2021 Introduction to DB 1 Chp1  
06/10/2021 Relational Algebra and ER Modelling 1 2 until 43 Chp 2, 6 Extended ER Figure (slides)
08/10/2021 Project Setup + Practice with Postgres 1 2 Chp 3, 4 Docker pull, build, run, exec, + Docker compose
11/10/2021 Normal Forms 1 Chp 7  
13/10/2021 Normal Forms pt 2 1 Chp 7  
15/10/2021 Exercises on SQL   Chp 3,4  
18/10/2021 Index & Transactions   Chp 14,17  
20/10/2021 Index & Transactions   Chp 14,17  
10/11/2021 Project Presentations      

Projects Description

The course exam consists into a small project to perform in pairs. Each project should have a presentation and 2-5 pages project report.

Project Checklist

  • Identify the domain of choice, e.g., internet memes
  • formulate some questions in natural language that you would like to answer
    • submit the questions for approval
  • design the database and perform data entry in Postgres
  • ER Diagram and comments about the design decision (using appropriate terminology)
  • Implement the questions in SQL, analyse the query plan in relational algebra, experiment alternative implemetations
  • Basic queries should cover standard RA operators, Joins, and aggregation functions
  • Advanced features nice to show example of
    • Indexes
    • Transactions

NB: Full-stack application is not required, but encouraged.

Project Grading

  • 1-15
    • Natural Language Queries
    • ER Diagram Design
    • SQL Implementation
      • DDL + Data Entry
      • Basic SQL Usage (no joins) + Relational Algebra (should also go in the presentation)
    • Intermediate SQL usages
      • Joins and Aggregations
      • Integrity Constratints (check or queries) + Ad-hoc data to test
      • Indexes and Transaction
  • 15-20
    • 10 minutes Presentation ( you both have to speak) + 5 minutes Questions (Individual)
      • Use of appropriate terminology (also when speaking)
      • Clarity of the explanation
      • Appropriateness of the answer
    • What to put in the report
      • a general presentation
      • design decisions
      • ER + Implementation
      • problems that you solved
    • what to put in the presentation (4-8 slides):
      • a general presentation
      • ER + Comments of entity
      • Example of SQL + RA
      • Demo is welcomed
  • yes, you can decide not to present (max grade is 12)
  • Few extra points will be assigned by interacting with each other projects by proposing extensions or commenting on the result.

Proposed Dataset

Suggested Material

Pre Lab Installations (Courtesy of Mohamed Ragab)

Lab Repository

  1. Install PostgreSQL
    1. On Docker ``
    2. on Windows
    3. on linux
  2. Install the psycopg2 Python Package to connecto to PostgreSQL
    • use th command
      1
      
      pip install psycopg2
      
      in your command line.
  3. Accessing PostgreSQL from Command Line:
    • Add the PostgreSQL installation \/home\ and \/bin\ directories to the enviroment variables.
    • use the command
      1
      
      psql -U postgres postgres
      
      to connect to the the by-default created database \postgres\ with the user \postgres.
    • Enter your set \postgres\ password (i.e password of the default password that you have been asked at the time of installtion).
  4. Connect To a PostgreSQL Database Server
    • show you how to connect to the PostgreSQL using psql tool and pgAdmin 4 GUI tool.

Part 2: Data Mining

Abstract

This course introduces the basics notions of datamining. It surveys techniques for data preprocessing, i.e., cleansing, integration, data reduction. Then, it jumps into foundational aspects of data mining like: Frequent Patterns mining, classification, clustering, outliers detection. The course examination consists of both a project and an exam with open and multiple choice questions and exercises.

but also a short introduction to data warehousing and On Line Analytical Processing queries (OLAP).

Books

Professor’s Choice: Data Mining: Concepts and Techniques, 3rd ed.

Others Resources

Tentative Schedule

Date Lecture Chapters Extra
10/11/2021 Introduction to Data Mining Chp 1, 2  
12/11/2021 Data Pre Processing Chp 3  
17/11/2021 OLAP Chp 4  
19/11/2021 Practice on Pre Processing exercises  
24/11/2021 Frequent Patterns Chp 6  
26/11/2021 Classification Chp 8  
03/12/2021 Clustering Chp 10  
08/12/2021 Outliers Chp 12  
10/12/2021 Project Q/A    
15/12/2021 Exam + Greetings    
22/12/2021 Project Submission    

Project Checklist

  • Identify the domain of choice, e.g., dataset presented above
    • YOU DO NOT HAVE TO PICK ONLY ONE DATASET
    • BUT BE CAREFUL BECAUSE YOU MIGHT HAVE TO CLEAN MORE THAN ONE
    • YOU CAN SUBMIT THE TAKS FOR APPROVAL
  • you can pick out of problems (preprocessing + 2)
    • cleansing/preprocessing (mandatory)
    • frequent pattern mining
    • classification
    • clustering
  • Explain why why you’re happy with the results.
    • evaluate the associations rules
    • evaluate the classification
    • evaluate the clustering proceedure
  • present a notebook with comments and discussion of your result (github)