Data Mining e Scoperta di Conoscenza


Libri di testo:

TM = Tom Mitchell, Machine Learning. McGraw Hill, 1997.

WF = Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000

PC = R. Duda, P. Hart, D. Stork. Pattern Classification, Wiley, 2001.

HK = J. Han and M. Kamber, Data Mining Techniques, Morgan-Kaufman, 2000.

MS = D. Hand, H. Mannila, P. Smyth.Principles of Data Mining, MIT Press, 2001.

HA = S. Haykin, Neural Networks, Prentice Hall, 1999.

PY = D. Pyle, Data Preparation for Data Mining, Morgan-Kaufman, 1999.

Schedule:

Data Argomenti Materiale Docente Approfondimenti
26 settembre 2006 Introduzione. Caratterizzazione del Knowledge discovery come processo. Lucidi [pdf ]

Dispense [pdf]

Manco HK, cap.1; MS, cap. 1
U. Fayyad and others, "From Data Mining to Knowledge discovery in Databases".
Applicazioni di Data Mining
S. Chaudhuri, U. Dayal, V. Ganti. "Database technology for Decision Support Systems".
28 settembre 2006 Data Preprocessing. Statistiche descrittive. Pulizia e trasformazione dei dati. Discretizzazione. Lucidi [pdf ]

Manco HK, cap.2-3; MS cap.2-3; PY.
Exploratory data Analysis at NIST.
Codd, "Providing OLAP to the user analyst: An IT mandate".
S. Chaudhuri, U. Dayal, "An Overview of Data warehouse and OLAP technology".
E. Galhardas and others, "Declarative data cleaning: Languages, models and algorithms".
M .Hernandez, S. Stolfo, "Real-world data is dirty".
H. Lee and others, "Cleansing data for mining and warehousing".
29 settembre 2006 Discretizzazione algoritmo ChiMerge [java] Manco H. Liu and others, "Discretization: an enabling technique".
J. Dougherty, R. Kohavi, M. Sahami, "Supervised and Unsupervised discretization of Continuous features".
R. Holte, "Very simple classification Rules perform well on most commonly used datasets".Weka è disponibile da questo sito.
3 ottobre 2006 Esercitazione su data preprocessing. Un caso di studio. Lucidi [pdf ]

Dataset [arff]  

Folino  
5 ottobre 2006 Concept Learning. Apprendimento induttivo e bias induttivo Lucidi [pdf] Manco TM, cap. 2.

H. Hirsch. "Polynomial-Time Learning with Version Spaces".

6 ottobre 2006 L'algoritmo Candidate Elimination. Alberi di Decisione   Manco TM, cap. 2.HK, cap. 7, TM, cap. 3.
 
10 ottobre 2006 Decision Tree Learning Lucidi [pdf]

Dispense [pdf

Manco HK, cap. 7, TM, cap. 3.
M. Mehta, R. Agrawal, J. Rissanen, "SLIQ: A scalable Decision-Tree classifier for Data Mining"
Freund, Y., Mason, L, "The alternating decision tree learning algorithm".
L. Breiman, "Random Forests".
J. Gehrke, R. Ramakrishnan, V. Ganti, "RainForest: A Framework for Large Decision Tree Construction of Large DataSets"
Lim, Loh, Shih, "An Empirical Comparison of Decision Trees and Other Classification Methods"
12 ottobre 2006 Decision Trees. Model Evaluation Lucidi [pdf Manco T. Fawcett, "ROC Graphs: Notes and practical Considerations for data mining researchers".
C. Ferri, P. Flach, J-H. Orallo, "Lerning Decision Trees using the area under the ROC Curve".
E. Frank et al. "Using Model Trees for Regression".
A. Moore, M. Lee, "Efficient Algorithms for minimizing Cross-Validation Error". 
13 ottobre 2006 Esercitazione su Classificazione

Dispense [pdf]

Altro Materiale

Locane  
17-20 ottobre 2006 Model Evaluation Lucidi [pdf] Folino  
26 ottobre 2006 Support Vector Machines Dispense [pdf] Astorino C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition".
J. Platt, "Fast Training of Support Vector Machines using Sequential Minimal Optimization".
22-27 Ottobre 2006 Neural Networks Lucidi [pdf] Manco TM, cap. 3; HA cap.3-4;PC cap.5.1-5.5,6.1-6.8.
A.K. Jain, J. Mao, "A tutorial on Neural Networks".IEEE Computer, march 1996
B.D. Ripley, "Pattern Recognition via Neural Networks".
Y. Freund, R. Schapire, "Large Margin Classification using the Perceptron Algorithm".
B.D. Ripley, "Can Statistical Theory Help us use Neural Networks Better?"
M.J.J. Orr, "Introduction to Radial Basis Function Networks".

PC, cap. 3.8.1; HK cap. 3.4.3;MS cap.3.6.
AA. VV., The New Jersey Data Reduction Report.
M. Wall, A. Rechsteiner, L. Rocha, Singular Value Decomposition and Principal Component Analysis.

31 Ottobre - 2 Novembre 2006 Apprendimento Bayesiano Lucidi [pdf] Manco TM, cap. 3; HA cap.3-4;PC cap.5.1-5.5,6.1-6.8.
J. Elder, J. Pregibon, "A statistical Perspective on Knowledge Discovery in Databases".
G. John, P. Langley, "Estimating Continuous Distributions in Bayesian Classifiers".
A. Mccallum, K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification".
P. Langley et al. "An Analysis of Bayesian Classifiers".
Webb. Boughon, Wang, "Not so Naive Bayes".
J. Provost, "Naive Bayes vs Rule Learning for E-mail classification".
R. Kohavi, "Scaling up the accuracy of naive-Bayes classifiers: a decision tree hybrid".
3 Novembre 2006 Instance-Based Learning. MetaClassificazione. Model evaluation Lucidi 1 [pdf]

Lucidi 2 [pdf]

Lucidi 3 [pdf]

Manco TM, cap. 8; PC cap.5.1-5.5,6.1-6.8.
D. Aha, D. Kibler, M. Albert, "Instance-Based Learning Algorithms".
C. Atkenson et al. "Locally Weighted Learning".
E. Frank, M. Hall, B. Pfharinger, "Locally Based Naive Bayes".
P. Langley, W. Iba, "Average-Case Analysis of a Nearest Neighbor Algorithm".
R. Bouckaert, "Bayesian Network Classifiers in Weka".
D. Heckerman, "A Tutorial on Learning with Bayesian Networks".
W. Emde, D. Wettscherek, "Relational Instance-Based Learning".
R. Freund, "The Boosting approach to Machine Learning".
L. Breiman , "Bagging predictors"
Website su Ensemble Learning
7 Novembre 2006 Esercitazione su SVM, Neural Networks, Bayesian Classification Esercitazione Folino  
9 Novembre 2006

Introduzione al Clustering

Lucidi [pdf] Manco HK. cap.9.
A. K. Jain and R. C. Dubes. "Data Clustering: A review". 
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
Fayyad U., Reina C., Bradley P. S. "Initialization of Iterative Refinement Clustering Algorithms",
A. Strehl, J. Gosh, R. Mooney, "Impact of Similarity Measures on Web Document Clustering".
10 Novemre 2006 K-Means Lucidi [pdf] Manco  
14 Novembre 2006 Clustering Basato su densità. Clustering gerarchico

Lucidi [pdf]

Lucidi [pdf]

Manco HK. cap.9.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases.
Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. "Optics: Ordering points to identify the clustering structure".
D. Fisher. "Knowledge acquisition via incremental conceptual clustering".
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases.
S. Guha, R. Rastogi, and K. Shim: "ROCK: A robust clustering algorithm for categorical Data".
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases
16-17 Novembre 2006 Clustering Model-Based. Altri approcci al Clustering Lucidi [pdf] Manco HK. cap.9; PC cap. 3.14;.SL, cap. 14
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.
T. Moon, The EM Algorithm.
P. Cheeseman, J. Stoutz, "Bayesian Classification (Autoclass): Theory and Results".
I. V. Cadez, and others, "Model-Based Clustering and visualization of navigation patterns on the web". 

D. Heckerman; C. Meek; B. Thiesson "Accelerating EM for large databases".
V. Ganti, J. Gehrke, R. Ramakrishnan "CACTUS: Clustering Cateogorical Data"
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. "Automatic subspace clustering of high dimensional data for data mining applications".
21 Novembre 2006 Programmare in Weka: Struttura, estensioni Lucidi [pdf] Scordio Wiki di Weka.
23 Novembre 2006 Regole associative. L'algoritmo Apriori Lucidi [pdf]

Codice Apriori

(con Prefix-Tree)

Manco HK. cap.6; SL cap. 14
R. Agrawal, T. Imielinski, and A. Swami.  Mining association rules between sets of items in large databases.  
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
Ashoka Savasere, Edward Omiecinski, Shamkant B. Navathe: An Efficient Algorithm for Mining Association Rules in Large Databases.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
H. Toivonen.  Sampling large databases for association rules.  . (citeseer)
R. Srikant and R. Agrawal. Mining generalized association rules.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations.
D. Tsur, and others. Query flocks:  A generalization of association-rule mining.  (citeseer)
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules.
J. Han, J. Wang, Y. Lu, and P. Tzvetkov, “Mining Top-K Frequent Closed Patterns without Minimum Support”.
A. Savasere, E. Omiecinski, S. B. Navathe, Mining for Strong Negative Associations in a Large Database of Customer Transactions.
E. Omiecinski. Alternative Interest Measures for Mining Associations.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules..
J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation.
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. (citeseer)
Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining.
R. J. Bayardo. Efficiently mining long patterns from databases. (citeseer)
Y. Xu, J. X. Yu, G. Liu, H. Lu, From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns.
G. Liu, H. Lu, W. Lou, J. X. Yu , On Computing, Storing and Querying Frequent Patterns.
B. Goethals, M. Zaki: FIMI: Workshop on Frequent Itemset Mining Implementations (An Introduction).  
24 Novembre 2006 Esercitazione su Clustering e Regole associative Esercizi

Soluzioni

Ulteriori esercizi

Folino  
28 Novembre 2006 Patterns Sequenziali. Serie temporali   Folino Time Series Data Mining archive, mantenuto da Eamonn Keogh.
R. Agrawal, C. Faloutsos, A. Swami, "Efficient Similarity Search in Sequence databases".
R. Srikant, R. Agrawal, "Finding Sequential Patterns".