CSE634, cse590
DATA MINING
Spring 2009
Course Information
NEWS 1:
May 5: You must submit project description and about 10 slides describing the results. Project description is in the downloads.
April 15: The test is in Downloads section. For extra credit you have to solve only problems which you didn't finish during the regular test.
March 26: Downloads section is updated; Apriory slides are updated;
March 26: You can now submit grades for presentations of your classmates online here
March 24: There will be no project presentations (a.k.a. Presentation #3). The relevant documents will be updated on the website within a week.
March 24: Remember to fill out and submit presentation evaluation forms. The form is on the course web page. Inspite of what the form says, do NOT email the final grade to the TA. Instead you will insert the final grade for each evaluation into a web form. The address will be published on this website shortly.
March 16: Midterm date is MOVED to April 14. The midterm is open book. Covers chapters 1, 2, 5, 6, and part of 7.
March 16: Reminder: by March 20 all students / groups must have a presentation topic and date settled with the TA. The current schedule is here
March 11: CHANGE: each presentation group has to have maximum 2 students. If there are more than 2 students in your group email the TA new groups / subjects.
March 9: Homework and TEST examples are in Downloads
March 6: by March 20 every student MUST be in a group, and each group MUST have a topic
March 6: Midterm exam will be held on March 26
March 6: student presentations start March 31
One person from each team should mail the TA names of all the team members and all other relevant info (see NEWS 2 below) by March 15
Time:
Tuesday, Thursday, 3:50 - 5:10 pm
Place:
SB Union 237
Professor:
Anita Wasilewska
1428 CS Building; tel: 632-8458
e-mail: anita@cs.sunysb.edu
Office Hours: Tue, Th 2:15 - 3:30 pm and by appointment
Teaching Assistant:
Yury Puzis
e-mail: ypuzis@cs.sunysb.edu
Office Hours: Tue, Th 12:50 - 14:10 pm
Book:
DATA MINING Concepts and Techniques
Jiawei Han, Micheline Kamber
Morgan Kaufman Publishers, 2003
General Course Description:
Data Mining, called also
Knowledge Discovery in Databases (KDD) is a new multidisciplinary
field, It brings together research and ideas from database
technology, machine learning, neural networks, statistics, pattern
recognition, knowledge based systems, information retrieval,
high-performance computing, and data visualization. Its main focus is
the automated extraction of patterns
representing knowledge
implicitly stored in large databases, data warehouses, and other
massive information repositories.
The course will closely follow the book and is designed to give
a broad, yet in-depth overview of the Data Mining field and
examine the most recognized techniques in a more rigorous detail.
It also will explore the newest trends and developments of the
field in form of talks based on newest research papers from the
field.
Student Information
Grading
During the course of the semester there will be:
1. Presentations 1, 2 (100pts total) given
in teams of 2-4 students.
The team will be graded for the
presentation skills,
the content, organization, clarity, and
amount of work put into research and preparation.
Each member of
the team has to present its own well defined part and will be graded
individually on this part as an overall evaluation of the group.
Presentation 1(70pts) is a lecture type one hour presentation (see
description in the Syllabus) given in 2-4 students groups.
All members of the group must present the
material in more or less equal manner.
Presentation 2 (30pts) is a short, 10-20 minutes presentation
of a research paper, or an application.(see description in the Syllabus) given
by the same group as Presentation 1. All members of the group must
present the material in more or less equal manner.
Presentation 1 and 2 can be combined in one, whole
class period long presentation, or can be delivered separately.
2. Midterm (100pts) test covering the material from chapters
1, 2, 5, 6 included in Professor Lectures.
Midterm will be given after we
finish my lectures. I plan it for the week of March 16, but it could
be changed.
3. Project and Project Presentation (70pts).
4. Presentations evaluation reports](30 points).
Final grade computation.
During the semester you can earn 300pts or more (in the case of
extra points). The grade will be determine in the following way:
# of earned points divided by 3 = % grade.
The grade will be determine in the following way:
of earned points = % grade.
The % grade which is translated into letter grade in a standard
way i.e. 100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 %
is C range, 69 - 60 % is D range, and F is below 60%.
Downloads:
PROJECT DESCRIPTION
MIDTERM
COURSE SYLLABUS 2009
CLASSIFICATION HMK EXAMPLE
CLASSIFICATION TEST EXAMPLE
APRIORY PROBLEM 1
APRIORY PROBLEM 2
APRIORY PROBLEM 3
APRIORY PROBLEM 4
Lecture Notes:
01. 2009 Introduction (Chapter 1)
02. 2009 Preprocessing (Chapter 2)
03. 2009 Classification 1(Chapters 6)
04. 2009 Classification 2(Chapters 6)
05. 2009 Examples of Decision Tree
06. 2009 Testing Classification (Chapter 6)
07. 2009 Classification By Neural Networks (Chapter 6)
08. APRIORI Algorithm (Chapter 5)
09. Association Analysis (Chapter 5)
10. Classification (Example):
Protein Secondary Structure Prediction
DATASETS
Datasets for data mining and knowledge discovery
Datasets for data mining competitions
University California Irvine KDD Archive
World Bank datasets
Project Data
Play around with the data and familiarize yourself with it.
DOWNLOAD: PROJECT DATA
You can download the project description from here.
DOWNLOAD: Project Description (to be changed))
This project will be done in groups.
More details on the project will be put up soon.
2009 Presentations' Subjects:
Please refer to the following
link
to see presentation subjects and schedule.
NEWS 2
Check 'Possible Presentations Subjects' to get an idea of the subjects to choose from.
Please mail the T.A the number of members in the group (not exceeding 3), the name, E-mail id and the SB Id of each group member, along with the subject of the presentation.
If you have not formed a group, please provide the T.A with the subject you are interested in. We will assign a group for you.
All groups presenting the same subject MUST collaborate.
Presentations' General Principles:
1. Groups must consist of 2-3 students
2. No more than 2-3 presentations on the same general topic (like
Clustering, Association Analysis, Neural Network, etc...) are allowed
3. Groups that choose the same general subject MUST collaborate.
4. No repetition of information within the same general subject is
allowed, except for one - two slides refering to previous
presentation(s) of other group, or groups.
5. "No repetition" principle applies to lecture type content as well as
reasearch papers, applications.
6. YOU MUST USE language developed in Professor Lecture Notes and the book.
Possible Presentations Subjects:
1. Data Warehouse and OLAP technology for Data Mining.
2. Data Mining Primitives, Languages and System Architectures
3. CRISP standards for Data Mining
4. Mining Association Rules in Large Databases
5. Classification based on Concepts from Association rule mining.
6. Classification Accuracy testing methods and problems
7. Statistical Methods 1: Statistical Prediction, Prediction by
regression, other purely statistical methods
8. Statistical Methods 2: Classification by Neural Networks
9. Statistical Methods 3: Bayesian Classification.
10. Statistical Methods 4: Cluster Analysis. A Categorization of major
Clustering methods
11. Evolutionary Computing: Genetic algorithms as optimization, Genetic
algorithms as classification. Other evolutionary computing methods.
12. NEW ADVANCES in Data Mining, for example:
Web Mining: an overview
of methods and problems
Text Mining: an
overview of methods and problems
Visualization and Data
Mining techniques
Natural Language
Processing and Data Mining techniques
13. FIND YOUR OWN subject and discuss it with the Professor.
Presentations' Groups, Topics, Schedule and Peer Evaluations:
Students' Presentations Report:
Download a pdf of the report form from here
Students' Presentations Spring 2009
Support Vector Machines
Bayesian Classification
Cluster Analysis - I
Web Mining - I (very good!)
Text Mining
Decision Trees
Typical OLAP operations, ETL (and paper)
Genetic Algorithms - I
Data Warehouse
Mining Association Rules in Large Databases
Neural Networks (and paper plus slides)
Cluster Analusis - II (and paper slides)
Web Mining - II (and paper slides)
Visualization in Data Mining (and paper slides)
Genetic Algorithms - II (and paper slides)
Students' Presentations Spring 2007
Data Warehousing & Olap Technologies
Mining Association Rules in Large Databases
Cluster Analysis - I
Cluster Analysis - II
Artificial Neural Networks
Genetic Algorithms
Bayesian Classification
Web Mining - I
Web Mining - II
Text Mining
Students' Presentations Spring 2006:
Data Mining
Primitives, Languages, and System Architecture
Cluster Analysis - I
Neural
Networks - I
Neural
Networks - II
Bayesian
Network
Web Mining - I
Data
Warehouse and OLAP Technology - I
Data Warehouse and OLAP
Technology - II
Clustering - II
Web
Mining - II
Text Mining
Decision Trees I, Decision Trees II
Visualization
in Data Mining
Genetic Algorithms
Students' Presentations Spring 2005:
Presentation 1 Data
Mining Primitives, Languages, and System Architectures
Presentation 2 Neural
Network
Presentation 3 Genetic
Algorithms
Presentation 4 Data
Warehouse and OLAP Technology For Data Mining
Presentation 5 CRISP-DM
Presentation 6 Mining
Association Rules in Large Databases
Presentation 7 Association
Rules Hiding (Not Mining)
Presentation 8 Introduction
of Bayesian Network
Presentation 9 Cluster
Analysis