CSE634, cse590
DATA MINING
Spring 2009



Course Information


NEWS 1:


  • May 5: You must submit project description and about 10 slides describing the results. Project description is in the downloads.
  • April 15: The test is in Downloads section. For extra credit you have to solve only problems which you didn't finish during the regular test.
  • March 26: Downloads section is updated; Apriory slides are updated;
  • March 26: You can now submit grades for presentations of your classmates online here
  • March 24: There will be no project presentations (a.k.a. Presentation #3). The relevant documents will be updated on the website within a week.
  • March 24: Remember to fill out and submit presentation evaluation forms. The form is on the course web page. Inspite of what the form says, do NOT email the final grade to the TA. Instead you will insert the final grade for each evaluation into a web form. The address will be published on this website shortly.
  • March 16: Midterm date is MOVED to April 14. The midterm is open book. Covers chapters 1, 2, 5, 6, and part of 7.
  • March 16: Reminder: by March 20 all students / groups must have a presentation topic and date settled with the TA. The current schedule is here
  • March 11: CHANGE: each presentation group has to have maximum 2 students. If there are more than 2 students in your group email the TA new groups / subjects.
  • March 9: Homework and TEST examples are in Downloads
  • March 6: by March 20 every student MUST be in a group, and each group MUST have a topic
  • March 6: Midterm exam will be held on March 26
  • March 6: student presentations start March 31
  • One person from each team should mail the TA names of all the team members and all other relevant info (see NEWS 2 below) by March 15

    Time:

    Tuesday, Thursday, 3:50 - 5:10 pm

    Place:

    SB Union 237

    Professor:

    Anita Wasilewska

    1428 CS Building; tel: 632-8458
    e-mail: anita@cs.sunysb.edu
    Office Hours: Tue, Th 2:15 - 3:30 pm and by appointment

    Teaching Assistant:

    Yury Puzis
    e-mail: ypuzis@cs.sunysb.edu
    Office Hours: Tue, Th 12:50 - 14:10 pm

    Book:

    DATA MINING Concepts and Techniques
    Jiawei Han, Micheline Kamber
    Morgan Kaufman Publishers, 2003

    General Course Description:

    Data Mining, called also Knowledge Discovery in Databases (KDD) is a new multidisciplinary field, It brings together research and ideas from database technology, machine learning, neural networks, statistics, pattern recognition, knowledge based systems, information retrieval, high-performance computing, and data visualization. Its main focus is the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.
    The course will closely follow the book and is designed to give a broad, yet in-depth overview of the Data Mining field and examine the most recognized techniques in a more rigorous detail. It also will explore the newest trends and developments of the field in form of talks based on newest research papers from the field.

    Student Information

    Grading

    During the course of the semester there will be:

  • 1. Presentations 1, 2 (100pts total) given in teams of 2-4 students.
    The team will be graded for the presentation skills, the content, organization, clarity, and amount of work put into research and preparation.
    Each member of the team has to present its own well defined part and will be graded individually on this part as an overall evaluation of the group.
  • Presentation 1(70pts) is a lecture type one hour presentation (see description in the Syllabus) given in 2-4 students groups. All members of the group must present the material in more or less equal manner.
  • Presentation 2 (30pts) is a short, 10-20 minutes presentation of a research paper, or an application.(see description in the Syllabus) given by the same group as Presentation 1. All members of the group must present the material in more or less equal manner.
  • Presentation 1 and 2 can be combined in one, whole class period long presentation, or can be delivered separately.
  • 2. Midterm (100pts) test covering the material from chapters 1, 2, 5, 6 included in Professor Lectures.
    Midterm will be given after we finish my lectures. I plan it for the week of March 16, but it could be changed.
  • 3. Project and Project Presentation (70pts).
  • 4. Presentations evaluation reports](30 points).
  • Final grade computation.
    During the semester you can earn 300pts or more (in the case of extra points). The grade will be determine in the following way: # of earned points divided by 3 = % grade.
    The grade will be determine in the following way: of earned points = % grade.
    The % grade which is translated into letter grade in a standard way i.e. 100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 % is C range, 69 - 60 % is D range, and F is below 60%.

    Downloads:

    PROJECT DESCRIPTION
    MIDTERM
    COURSE SYLLABUS 2009
    CLASSIFICATION HMK EXAMPLE
    CLASSIFICATION TEST EXAMPLE
    APRIORY PROBLEM 1
    APRIORY PROBLEM 2
    APRIORY PROBLEM 3
    APRIORY PROBLEM 4

    Lecture Notes:

    01. 2009 Introduction (Chapter 1)
    02. 2009 Preprocessing (Chapter 2)
    03. 2009 Classification 1(Chapters 6)
    04. 2009 Classification 2(Chapters 6)
    05. 2009 Examples of Decision Tree
    06. 2009 Testing Classification (Chapter 6)
    07. 2009 Classification By Neural Networks (Chapter 6)
    08. APRIORI Algorithm (Chapter 5)
    09. Association Analysis (Chapter 5)
    10. Classification (Example): Protein Secondary Structure Prediction

    DATASETS

    Datasets for data mining and knowledge discovery
    Datasets for data mining competitions
    University California Irvine KDD Archive
    World Bank datasets

    Project Data

  • Play around with the data and familiarize yourself with it.
    DOWNLOAD: PROJECT DATA
  • You can download the project description from here.
    DOWNLOAD: Project Description (to be changed))
  • This project will be done in groups.
  • More details on the project will be put up soon.
  • 2009 Presentations' Subjects:

    Please refer to the following link to see presentation subjects and schedule.

    NEWS 2

  • Check 'Possible Presentations Subjects' to get an idea of the subjects to choose from.
  • Please mail the T.A the number of members in the group (not exceeding 3), the name, E-mail id and the SB Id of each group member, along with the subject of the presentation.
  • If you have not formed a group, please provide the T.A with the subject you are interested in. We will assign a group for you.
  • All groups presenting the same subject MUST collaborate.
  • Presentations' General Principles:

    1. Groups must consist of 2-3 students
    2. No more than 2-3 presentations on the same general topic (like Clustering, Association Analysis, Neural Network, etc...) are allowed
    3. Groups that choose the same general subject MUST collaborate.
    4. No repetition of information within the same general subject is allowed, except for one - two slides refering to previous presentation(s) of other group, or groups.
    5. "No repetition" principle applies to lecture type content as well as reasearch papers, applications.
    6. YOU MUST USE language developed in Professor Lecture Notes and the book.

    Possible Presentations Subjects:


    1. Data Warehouse and OLAP technology for Data Mining.
    2. Data Mining Primitives, Languages and System Architectures
    3. CRISP standards for Data Mining
    4. Mining Association Rules in Large Databases
    5. Classification based on Concepts from Association rule mining.
    6. Classification Accuracy testing methods and problems
    7. Statistical Methods 1: Statistical Prediction, Prediction by regression, other purely statistical methods
    8. Statistical Methods 2: Classification by Neural Networks
    9. Statistical Methods 3: Bayesian Classification.
    10. Statistical Methods 4: Cluster Analysis. A Categorization of major Clustering methods
    11. Evolutionary Computing: Genetic algorithms as optimization, Genetic algorithms as classification. Other evolutionary computing methods.
    12. NEW ADVANCES in Data Mining, for example:
            Web Mining: an overview of methods and problems
            Text Mining: an overview of methods and problems
            Visualization and Data Mining techniques
            Natural Language Processing and Data Mining techniques
    13. FIND YOUR OWN subject and discuss it with the Professor.

    Presentations' Groups, Topics, Schedule and Peer Evaluations:


    Students' Presentations Report:

    Download a pdf of the report form from here

    Students' Presentations Spring 2009

    Support Vector Machines
    Bayesian Classification
    Cluster Analysis - I
    Web Mining - I (very good!)
    Text Mining
    Decision Trees
    Typical OLAP operations, ETL (and paper)
    Genetic Algorithms - I
    Data Warehouse
    Mining Association Rules in Large Databases
    Neural Networks (and paper plus slides)
    Cluster Analusis - II (and paper slides)
    Web Mining - II (and paper slides)
    Visualization in Data Mining (and paper slides)
    Genetic Algorithms - II (and paper slides)

    Students' Presentations Spring 2007

    Data Warehousing & Olap Technologies
    Mining Association Rules in Large Databases
    Cluster Analysis - I
    Cluster Analysis - II
    Artificial Neural Networks
    Genetic Algorithms
    Bayesian Classification
    Web Mining - I
    Web Mining - II
    Text Mining

    Students' Presentations Spring 2006:


    Data Mining Primitives, Languages, and System Architecture
    Cluster Analysis - I
    Neural Networks - I
    Neural Networks - II
    Bayesian Network
    Web Mining - I
    Data Warehouse and OLAP Technology - I
    Data Warehouse and OLAP Technology - II
    Clustering - II
    Web Mining - II
    Text Mining
    Decision Trees I, Decision Trees II
    Visualization in Data Mining
    Genetic Algorithms

    Students' Presentations Spring 2005:


    Presentation 1 Data Mining Primitives, Languages, and System Architectures
    Presentation 2 Neural Network
    Presentation 3 Genetic Algorithms
    Presentation 4 Data Warehouse and OLAP Technology For Data Mining
    Presentation 5 CRISP-DM
    Presentation 6 Mining Association Rules in Large Databases
    Presentation 7 Association Rules Hiding (Not Mining)
    Presentation 8 Introduction of Bayesian Network
    Presentation 9 Cluster Analysis