Data Mining e Scoperta di Conoscenza


 

Informazioni generali sui seminari

Un buon seminario (20 slides circa) dovrebbe riassumere il problema affrontato dal lavoro in oggetto, e la soluzione proposta (in particolare, illustrandola su un esempio giocattolo). Il seminario dovrebbe discutere i punti di forza e di debolezza del lavoro e suggerire delle direzioni per possibili miglioramenti.
Per facilitare il lavoro, di seguito sono elencate alcune domande a cui si dovrebbe rispondere nel corso del seminario.

  1. Qual'è il problema affontato?
  2. Qual'è l'impatto del lavoro? Quali sono le implicazioni pratiche e/o teoriche?
  3. Quanto i problemi affrontati differiscono da quelli della letteratura? I lavori correlati sono adeguatamente presi in considerazione?
  4. Qual'è la soluzione proposta dagli autori al problema?
  5. Qual'è il contributo principale del lavoro? Ci sono contributi minori da evidenziare?
  6. Quali sono i punti di forza e di debolezza del lavoro?
  7. Ci sono delle scorrettezze/assunzioni non realistiche nel lavoro?
  8. Ci sono margini per migliorare il contributo esposto nel lavoro? Come?

La presentazione verrà valutata in base a due criteri:

 

 

PROPOSTE DI SEMINARI:

CLUSTERING
Numero

Argomenti

Abstract Assegnatario Materiale
  Spherical K.Means e Refinement   Attanà Francesco  
0

Topic driven Clustering for Document Datasets

In this paper we define the problem of topic-driven clustering, which organizes a document collection according to a given set of topics. We propose a three topic-driven schemes that consider the similarity between documents and topics and the relationship among documents themselves simultaneously. We present a comprehensive experimental evaluation of the proposed topic-driven schemes on five datasets. Our experimental results show that the proposed topic-driven schemes are efficient and effective with topic prototypes of different levels of specificity.

 

Articolo

1

K-means Clustering via Principal Component Analysis

Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noisereduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.

  Articolo
3

ClusterMap: Labeling Clusters in Large Datasets via Visualization.

With the rapid increase of data in many areas, clustering on large datasets has become an important problem in data analysis. Since cluster analysis is a highly iterative process, cluster analysis on large datasets prefers short iteration on a relatively small representative set. Thus, a two-phase framework “sampling/summarization– iterative cluster analysis” is often applied in practice. Since the clustering result only labels the small representative set, there are problems with extending the result to the entire large dataset, which are almost ignored by the traditional clustering research. This extending is often named as labeling process. Labeling irregular shaped clusters, distinguishing outliers and extending cluster boundary are the main problems in this stage. We address these problems and propose a visualization-based approach to dealing with them precisely. This approach partially involves human into the process of defining and refining the structure “ClusterMap”. Based on this structure, the ClusterMap algorithm scans the large dataset to adapt the boundary extension and generate the cluster labels for the entire dataset. Experimental result shows that ClusterMap can preserve cluster quality considerably with low computational cost, compared to the distance-comparison-based labeling algorithms.

  Articolo
4

Consistent Bipartite Graph Co-Partitioning for Star-Structured High-Order Heterogeneous Data Co-Clustering

Heterogeneous data co-clustering has attracted more and more attention in recent years due to its high impact on various applications. While the co-clustering algorithms for two types of heterogeneous data (denoted by pair-wise co-clustering), such as documents and terms, have been well studied in the literature, the work on more types of heterogeneous data (denoted by high-order co-clustering) is still very limited. As an attempt in this direction, in this paper, we worked on a specific case of high-order coclustering in which there is a central type of objects that connects the other types so as to form a star structure of the interrelationships. Actually, this case could be a very good abstract for many real-world applications, such as the co-clustering of categories, documents and terms in text mining. In our philosophy, we treated such kind of problems as the fusion of multiple pairwise co-clustering sub-problems with the constraint of the star structure. Accordingly, we proposed the concept of consistent bipartite graph co-partitioning, and developed an algorithm based on semi-definite programming (SDP) for efficient computation of the clustering results. Experiments on toy problems and real data both verified the effectiveness of our proposed method.

  Articolo
5

Clustering Aggregation

We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each categorical variable can be viewed as a clustering of the input rows. Moreover, clustering aggregation can be used as a meta-clustering method to improve the robustness of clusterings. The problem formulation does not require apriori information about the number of clusters, and it gives a natural way for handling missing values. We give a formal statement of the clustering-aggregation problem, we discuss related work, and we suggest a number of algorithms. For several of the methods we provide theoretical guarantees on the quality of the solutions. We also show how sampling can be used to scale the algorithms for large data sets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions.

  Articolo
6

Semi-supervised Graph Clustering: A Kernel Approach

Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective. A recent theoretical connection between kernel kmeans and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For vector data, the kernel approach also enables us to find clusters with nonlinear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.

  Articolo
7

Discovering Large Dense Subgraphs in Massive Graphs

We present a new algorithm for finding large, dense subgraphs in massive graphs. Our algorithm is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources. We apply our algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web; this graph contains over 50M hosts and 11B edges, gathered from 2.1B web pages. We measure the distribution of these dense subgraphs and their evolution over time. We show that more than half of these hosts participate in some dense subgraph found by the analysis. There are several hundred giant dense subgraphs of at least ten thousand hosts; two thousand dense subgraphs at least a thousand hosts; and almost 64K dense subgraphs of at least a hundred hosts. Upon examination, many of the dense subgraphs output by our algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. We therefore propose dense subgraph extraction as a useful primitive for spam detection, and discuss its incorporation into the workflow of web search engines   Articolo
8

Clustering With Constraints: Feasibility Issues and the k-Means Algorithm

Recent work has looked at extending the k-Means algorithm to incorporate background information in the form of instance level must-link and cannot-link constraints. We introduce two ways of specifying additional background information in the form of δ and ε constraints that operate on all instances but which can be interpreted as conjunctions or disjunctions of instance level constraints and hence are easy to implement. We present complexity results for the feasibility of clustering under each type of constraint individually and several types together. A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NP-complete. Thus, an iterative algorithm such as k-Means should not try to find a feasible partitioning at each iteration. This motivates our derivation of a new version of the k-Means algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints. Using standard UCI datasets, we find that using constraints improves accuracy as others have reported, but we also show that our algorithm reduces the number of iterations until convergence. Finally, we illustrate these benefits and our new constraint types on a complex real world object identification problem using the infra-red detector on an Aibo robot.   Articolo
9

Clustering with Bregman Divergences

A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate-distortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an efficient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.   Articolo

CLASSIFICATION
Numero

Argomenti

Abstract Assegnatario Materiale
0

A Model for Handling Approximate, Noisy or Incomplete Labeling in Text Classification

We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling.

  Articolo
1

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when longrange dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges a distributed state representation as in dynamic Bayesian networks (DBNs) and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linearchain CRFs, achieving comparable performance using only half the training data.

  Articolo
3

Multilabelled Classification Using Maximum Entropy Method

Many classification problems require classifiers to assign each single document into more than one category, which is called multilabelled classification. The categories in such problems usually are neither conditionally independent from each other nor mutually exclusive, therefore it is not trivial to directly employ state-of-theart classification algorithms without losing information of relation among categories. In this paper, we explore correlations among categories with maximum entropy method and derive a classification algorithm for multi-labelled documents. Our experiments show that this method significantly outperforms the combination of single label approach.

Brescia Santino Articolo
4

Learning the Structure of Markov Logic Networks

Markov logic networks (MLNs) combine logic and probability by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational databases, combining ideas from inductive logic programming (ILP) and feature induction in Markov networks. The algorithm performs a beam or shortest search of the space of clauses, guided by a weighted pseudo-likelihood measure. This requires computing the optimal weights for each candidate structure, but we show how this can be done efciently. The algorithm can be used to learn an MLN from scratch, or to rene an existing knowledge base. We have applied it in two real-world domains, and found that it outperforms using off-the-shelf ILP systems to learn the MLN structure, as well as pure ILP, purely probabilistic and purely knowledge-based approaches.   Articolo
5

Focused Named Entity Recognition Using Machine Learning

In this paper we study the problem of finding most topical named entities among all entities in a document, which we refer to as focused named entity recognition. We show that these focused named entities are useful for many natural language processing applications, such as document summarization, search result ranking, and entity detection and tracking. We propose a statistical model for focused named entity recognition by converting it into a classification problem. We then study the impact of various linguistic features and compare a number of classification algorithms. From experiments on an annotated Chinese news corpus, we demonstrate that the proposed method can achieve near human-level accuracy.   Articolo
6

On the Collective Classification of Email “Speech Acts”

We consider classification of email messages as to whether or not they contain certain “email acts”, such as a request or a commitment. We show that exploiting the sequential correlation among email messages in the same thread can improve email-act classification. More specifically, we describe a new text classification algorithm based on a dependency-network based collective classification method, in which the local classifiers are maximum entropy models based on words and certain relational features. We show that statistically significant improvements over a bag-of-words baseline classifier can be obtained for some, but not all, email-act classes. Performance improvement obtained by collective classification is appears to be consistent across email acts suggested by prior speech-act theory.   Articolo
7

Markov Logic: A Unifying Framework for Statistical Relational Learning

Interest in statistical relational learning (SRL) has grown rapidly in recent years. Several key SRL tasks have been identi ed, and a large number of approaches have been proposed. Increasingly, a unifying framework is needed to facilitate transfer of knowledge across tasks and approaches, to compare approaches, and to help bring structure to the field. We propose Markov logic as such a framework. Syntactically, Markov logic is indistinguishable from first-order logic, except that each formula has a weight attached. Semantically, a set of Markov logic formulas represents a probability distribution over possible worlds, in the form of a log-linear model with one feature per grounding of a formula in the set, with the corresponding weight. We show how approaches like probabilistic relational models, knowledge-based model construction and stochastic logic programs are special cases of Markov logic. We also show how tasks like collective classication, link prediction, link-based clustering, social network modeling, and object identication can be concisely formulated in Markov logic. Finally, we briey describe learning and inference algorithms for Markov logic and report positive results on a link prediction task.   Articolo

MINING DATA STREAMS
Numero

Argomenti

Abstract Assegnatarion Materiale
0

Approximate Frequency Counts over Data Streams

 

We present algorithms for computing frequency counts exceeding a user-specified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. Our algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. We can also handle streams of variable sized sets of items exemplified by a sequence of market basket transactions at a retail store. For such streams, we describe an optimized implementation to compute frequent itemsets in a single pass.  

Articolo

1

Bursty and Hierarchical Structure in Streams

 

A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research eld can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise that the appearance of a topic in a document stream is signaled by a "burst of activity" with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and effiiciently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an in nite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.   Articolo
2

Better Streaming Algorithms for Clustering Problems

 

We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the k–Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2^O(1) approximation using O(n^ε) space. Next we give a streaming algorithm for the k–Median problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.

  Articolo
3

Mining HighSpeed Data Streams

 

Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example. VFDT can incorporate tens of thousands of examples per second using off-the-shelf hardware. It uses Hoe
ding bounds to guarantee that its output is asymptotically nearly identical to that of a conventional learner. We study VFDT's properties and demonstrate its utility through an extensive set of experiments on synthetic data. We apply VFDT to mining the continuous stream of Web access data from the whole University of Washington main campus.

Silvia Salsone Articolo1,

Articolo2

4

Finding Frequent Items in Data Streams

 

We present a 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algo rithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this problem has not been previously studied in the literature.

  Articolo
5

Accurate Decision Trees for Mining Highspeed Data Streams

 

In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. In this paper we extend the VFDT system in two directions: the ability to deal with continuous data and the use of more powerful classiffication techniques at tree leaves. The proposed system, VFDTc, can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We study the behaviour of VFDTc in different problems and demonstrate its utility in large and medium data sets. Under a bias-variance analysis we observe that VFDTc in comparison to C4.5 is able to reduce the variance component.

  Articolo
7 Efficient Decision Tree Construction on Streaming Data

 

Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. In this paper, we revisit this problem. We make the following two contributions: 1)We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.

  Articolo
8

Detecting Change in Data Streams

 

Detecting changes in a data stream is an important area of research with many applications. In this paper, we present novel methods for the detection and estimation of change. In contrast to previously proposed tools, our techniques provide proven guarantees on the statistical significance of detected change. In addition to providing reliable detection of change, our method allows for a meaningful description and quantification of those changes. Our techniques are nonparametric and so they require no prior assumptions on the nature of the distribution that generates the data, except for assuming that points in the stream are generated independently. Additionally, these techniques work for both continuous and discrete data. In an experimental study we demonstrate the usefulness of our techniques.

  Articolo
9

A Framework for Projected Clustering of High Dimensional Data Streams

 

The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is high-dimensional in nature. High-dimensional data is inherently more complex in clustering, classiffication, and similarity search. Recent   research discusses methods for projected clustering over high-dimensional data sets. This method is however difficult to generalize to data streams because of the complexity of the method and the large volume of the data streams. In this paper, we propose a new, high-dimensional, projected data stream clustering method, called HPStream. The method incorporates a fading cluster structure, and the projection based clustering methodology. It is incrementally updatable and is highly scalable on both the number of dimensions and the size of the data streams, and it achieves better clustering quality in comparison with the previous stream clustering methods. Our performance study with both real and synthetic data sets demonstrates the efficiency and effectiveness of our proposed framework and implementation methods.

  Articolo
10 Sequential Pattern Mining: A Survey

 

-----

  Articolo
11

Loadstar: A Load Shedding Scheme for Classifying Data Streams

 

We consider the problem of resource allocation in mining multiple data streams. Due to the large volume and the high speed of streaming data, mining algorithms must cope with the e®ects of system overload. How torealize maximum mining bene¯ts under resource constraints becomes a challenging task. In this paper, we propose a load shedding scheme for classifying multiple data streams. We focus on the following problems: i) how to classify data that are dropped by the load shedding scheme? and ii) how to decide when to drop data from a stream? We introduce a quality of decision (QoD) metric to measure the level of uncertainty in classiffication when exact feature values of the data are not available because of load shedding. A Markov model is used to predict the distribution of feature values and we make classiffication decisions using the predicted values and the QoD metric. Thus, resources are allocated among multiple data streams to maximize the quality of classiffication decisions. Furthermore, our load shedding scheme is able to learn and adapt to changing data characteristics in the data streams. Experiments on both synthetic data and real-life data show that our load shedding scheme is effective in improving the overall accuracy of classiffication under resource constraints.

  Articolo

ASSOCIATION RULES
Numero

Argomenti

Abstract Materiale
0

Feasible Itemset Distributions in Data Mining: Theory and Application

 

 

Computing frequent itemsets and maximally frequent itemsets in a database are classic problems in data mining. The resource requirements of all extant algorithms for both problems depend on the distribution of frequent patterns, a topic that has not been formally investigated. In this paper, we study properties of length distributions of frequent and maximal frequent itemset collections and provide novel solutions for computing tight lower bounds for feasible distributions. We show how these bounding distributions can help in generating realistic synthetic datasets, which can be used for algorithm benchmarking.

Articolo

1

Tree Structures for Mining Association Rules

 

A well-known approach to Knowledge Discovery in Databases involves the identification of association rules linking database attributes. Extracting all possible association rules from a database, however, is a computationally intractable problem, because of the combinatorial explosion in the number of sets of attributes for which incidence-counts must be computed. Existing methods for dealing with this may involve multiple passes of the database, and tend still to cope badly with densely-packed database records. We describe here a class of methods we have introduced that begin by using a single database pass to perform a partial computation of the totals required, storing these in the form of a set enumeration tree, which is created in time linear to the size of the database. Algorithms for using this structure to complete the count summations are discussed, and a method is described, derived from the well-known Apriori algorithm. Results are presented demonstrating the performance advantage to be gained from the use of this approach. Finally, we discuss possible further applications of the method.

Articolo
2

MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

 

We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five.

Articolo
3

Alternative Interest Measures for Mining Associations in Databases
 

 

Data mining is defined as the process of discovering significant and potentially useful patterns in large volumes of data. Discovering associations between items in a large database is one such data mining activity. In finding associations, support is used as an indicator as to whether an association is interesting. In this paper, we discuss three alternative interest measures for associations: any-confidence, all-confidence, and bond. We prove that the important downward closure property applies to both all-confidence and bond. We show that downward closure does not hold for any-confidence. We also prove that, if associations have a minimum all-confidence or minimum bond, then those associations will have a given lower bound on their minimum support and the rules produced from those associations will have a given lower bound on their minimum confidence as well. However, associations that have that minimum support (and likewise their rules that have minimum confidence) may not satisfy the minimum all-confidence or minimum bond constraint. We describe the algorithms that efficiently find all associations with a minimum all-confidence or minimum bond and present some experimental results.

Articolo
4

A Statistical Theory for Quantitative Association Rules

 

Association rules are a key data-mining tool and as such have been well researched. So far, this research has focused predominantly on databases containing categorical data only. However, many real-world databases contain quantitative attributes and current solutions for this case are so far inadequate. In this paper we introduce a new definition of quantitative association rules based on statistical inference theory. Our definition reflects the intuition that the goal of association rules is to find extraordinary and therefore interesting phenomena in databases. We also introduce the concept of sub-rules which can be applied to any type of association rule. Rigorous experimental evaluation on real-world datasets is presented, demonstrating the usefulness and characteristics of rules mined according to our definition. Articolo
5

Finding Interesting Associations without Support Pruning

 

 

Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis. Articolo
6

Mining Non-Redundant Association Rules

 

 

 

The traditional association rule mining framework produces many redundant rules. The extent of redundancy is a lot larger than previously suspected. We present a new framework for associations based onthe concept of closed frequent itemsets. The number of non-redundant rules produced by the new approach is exponentially (in the length of the longest frequent itemset) smaller than the rule set from the traditional approach. Experiments using several “hard” as well as “easy” real and synthetic databases confirm the utility of our framework in terms of reduction in the number of rules presented to the user, and in terms of time. Articolo
7

The Complexity of Mining Maximal Frequent Itemsets and Maximal Frequent Patterns

 

 

Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the rst formal proof that the problem of counting the number of distinct maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #P-complete, thereby providing strong theoretical evidence that the problem of mining maximal frequent itemsets is NP-hard. This result is of particular interest since the associated decision problem of checking the existence of a maximal frequent itemset is in P. We also extend our complexity analysis to other similar data mining problems dealing with complex data structures, such as sequences, trees, and graphs, which have attracted intensive research interests in recent years. Normally, in these problems a partial order among frequent patterns can be de ned in such a way as to preserve the downward closure property, with maximal frequent patterns being those without any successor with respect to this partial order. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #P-complete or #P-hard. Articolo
8

Scalable Algorithms for Association Mining

 

 

Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets and then, forming conditional implication rules among them. In this paper, we present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases. Articolo
9

Mining sequential patterns by pattern-growth: the PrefixSpan approach
 

 

 

Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures. Articolo
10

Efficient Mining of Both Positive and Negative Association Rules

 

 

This paper presents an efficient method for mining both positive and negative association rules in databases. The method extends traditional associations to include association rules of forms A→ not B, not A → B, and not A→ not B, which indicate negative associations between itemsets. With a pruning strategy and an interestingness measure, our method scales to large databases. The method has been evaluated using both synthetic and real-world databases, and our experimental results demonstrate its effectiveness and efficiency. Articolo

PRIVACY-PRESERVING DATA MINING
Numero Argomenti Abstract Materiale
0

A New Scheme on Privacy-Preserving Data Classification

We address privacy-preserving classification problem in a distributed system. Randomization has been the approach proposed to preserve privacy in such scenario. However, this approach is now proven to be insecure as it has been discovered that some privacy intrusion techniques can be used to reconstruct private information from the randomized data tuples. We introduce an algebraic technique-based scheme. Compared to the randomization approach, our new scheme can build classifiers more accurately but discloseless private information. Furthermore, our new scheme can be readily integrated as a middleware with existing systems.

Articolo
1

Anonymity Preserving Data Collection

Protection of privacy has become an important problem in data mining. In particular, individuals have become increasingly unwilling to share their data, frequently resulting in individuals either refusing to share their data or providing incorrect data. In turn, such problems in data collection can affect the success of data mining, which relies on sufficient amounts of accurate data in order to produce meaningful results. Random perturbation and randomized response techniques can provide some level of privacy in data collection, but they have an associated cost in accuracy. Cryptographic privacy-preserving data mining methods provide good privacy and accuracy properties. However, in order to be effcient, those solutions must be tailored to specific mining tasks, thereby losing generality. In this paper, we propose efficient cryptographic techniques for online data collection in which data from a large number of respondents is collected anonymously, without the help of a trusted third party. That is, our solution allows the miner to collect the original data from each respondent, but in such a way that the miner cannot link a respondent's data to the respondent. An advantage of such a solution is that, because it does not change the actual data, its success does not depend on the underlying data mining problem. We provide proofs of the correctness and privacy of our solution, as well as experimental data that demonstrates its efficiency. We also extend our solution to tolerate certain kinds of malicious behavior of the participants.

Articolo
2

A Framework for High-Accuracy Privacy-Preserving Mining

To preserve client privacy in the data mining process, a variety of techniques based on random perturbation of individual data records have been proposed recently. In this paper, we present FRAPP, a generalized matrix-theoretic framework of random perturbation, which facilitates a systematic approach to the design of perturbation mechanisms for privacy-preserving mining. Specifically, FRAPP is used to demonstrate that (a) the prior techniques differ only in their choices for the perturbation matrix elements, and (b) a symmetric perturbation matrix with minimal condition number can be identified, maximizing the accuracy even under strict privacy guarantees. We also propose a novel perturbation mechanism wherein the matrix elements are themselves characterized as random variables, and demonstrate that this feature provides significant improvements in privacy at only a marginal cost in accuracy. The quantitative utility of FRAPP, which applies to random-perturbation-based privacy-preserving mining in general, is evaluated specifically with regard to frequent itemset mining on a variety of real datasets. Our experimental results indicate that, for a given privacy requirement, substantially lower errors are incurred, with respect to both itemset identity and itemset support, as compared to the prior techniques.

Articolo
3

Data Privacy Through Optimal k-Anonymization

Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as -anonymization. A -anonymized dataset has the property that each record is indistinguishable from at least others. Even simple restrictions of optimized -anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal -anonymizations under two representative cost measures and a wide range of . We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal -anonymization of a non-trivial dataset under a general model of the problem.

Articolo
4

Privacy-Aware Market Basket Data Set Generation: A Feasible Approach for Inverse Frequent Set Mining

Association rule mining has received a lot of attention in the data mining community and several algorithms were proposed to improve the performance of association rule or frequent itemset mining. The IBM Almaden synthetic data generator has been commonly used for performance evaluation. One recent work shows that the data generated is not good enough for benchmarking as it has very different characteristics from real-world data sets. Hence there is a great need to use real-world data sets as benchmarks. However, organizations hesitate to provide their data due to privacy concerns. Recent work on privacy preserving association rule mining addresses this issue by modifying real data sets to hide sensitive or private rules. However, modifying individual values in real data may impact on other, non-sensitive rules. In this paper, we propose a feasible solution to the NP-complete problem of inverse frequent set mining. Since solving this problem by linear programming techniques is very computationally prohibitive, we apply graph-theoretical results to divide the original itemsets into components that preserve maximum likelihood estimation. We then use iterative proportional tting method to each component. The technique is experimentally evaluated with two real data sets and one synthetic data set. The results show that our approach is effective and effcient for reconstructing market basket data set from a given set of frequent itemsets while preserving sensitive information. Articolo
5

State-of-the-art in Privacy Preserving Data Mining

We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made. Articolo
6

Top-Down Specialization for Information and Privacy Preservation

Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity. Articolo
7

Privacy-Preserving Classiffication of Customer Data without Loss of Accuracy

Privacy has become an increasingly important issue in data mining. In this paper, we consider a scenario in which a data miner surveys a large number of customers to learn classiffication rules on their data, while the sensitive attributes of these customers need to be protected. Solutions have been proposed to address this problem using randomization techniques. Such sol utions exhibit a tradeoff of accuracy and privacy: the more each customer's private information is protected, the less accurate result the miner obtains; conversely, the more accurate the result, the less privacy for the customers. In this paper, we propose a simple cryptographic approach that is e±cient even in a many-customer setting, provides strong privacy for each customer, and does not lose any accuracy as the cost of privacy. Our key technical contribution is a privacy-preserving method that allows a data miner to compute frequencies of values or tuples of values in the customers' data, without revealing the privacy-sensitive part of the data. Unlike general-purpose cryptographic protocols, this method requires no interaction between customers, and each customer only needs to send a single flow of communication to the data miner. However, we are still able to ensure that nothing about the sensitive data beyond the desired frequencies is revealed to the data miner. To illustrate the power of our approach, we use our frequency mining computation to obtain a privacy-preserving naive Bayes classiffier learning algorithm. Initial experimental results demonstrate the practicale±ciency of our solution. We also suggest some other applications of privacy-preserving frequency mining.

Articolo

MISCELLANEOUS
Numero Argomenti Abstract Materiale
0

Web Object Indexing Using Domain Knowledge

Web object is defined to represent any meaningful object embedded in web pages (e.g. images, music) or pointed to by hyperlinks (e.g. downloadable files). Users usually search for information of a certain ‘object’, rather than a web page containing the query terms. To facilitate web object searching and organizing, in this paper, we propose a novel approach to web object indexing, by discovering its inherent structure information with domain knowledge. In our approach, Layered LSI spaces are built for the hierarchically structured domain knowledge, in order to emphasize the specific semantics and term space in each layer of the domain knowledge. Then, the web object representation is constructed by hyperlink analysis, and further pruned to remove the noises. Finally, the structure attributes of the web object are extracted with the knowledge document that best matches the web object. Our approach also indicates a new way to use trust-worthy Deep Web knowledge to help organize dispersive information of Surface Web.

Articolo