Filter by type:

Sort by year:

Recent advances in mining patterns from complex data

Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari
Journal PaperJournal of Intelligent Information Sciences, Volume 47, Issue 1, August 2016, Pages 1-3

An end to end framework for building data cubes over trajectory data streams

Elio Masciari
Journal PaperJournal of Intelligent Information Sciences, Volume 45, Issue 2, October 2015, Pages 131–164

Abstract

In this paper we propose an end to end framework that allows efficient analysis for trajectory streams. In particular, our approach consists of several steps. First, we perform a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories using a suitable data structure. After the encoding step we build specialized cuboids for trajectories in order to make the querying step quite effective. This problem revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant thus making the analysis quite harder than for classical transactional data. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

Analysing microarray expression data through effective clustering

Elio Masciari, Giuseppe Massimiliano Mazzeo, Carlo Zaniolo
Journal PaperInformation Sciences, Volume 262, 20 March 2014, Pages 32–45

Abstract

The recent advances in genomic technologies and the availability of large-scale microarray datasets call for the development of advanced data analysis techniques, such as data mining and statistical analysis to cite a few. Among the mining techniques proposed so far, cluster analysis has become a standard method for the analysis of microarray expression data. It can be used both for initial screening of patients and for extraction of disease molecular signatures. Moreover, clustering can be profitably exploited to characterize genes of unknown function and uncover patterns that can be interpreted as indications of the status of cellular processes. Finally, clustering biological data would be useful not only for exploring the data but also for discovering implicit links between the objects. To this end, several clustering approaches have been proposed in order to obtain a good trade-off between accuracy and efficiency of the clustering process. In particular, great attention has been devoted to hierarchical clustering algorithms for their accuracy in unsupervised identification and stratification of groups of similar genes or patients, while, partition based approaches are exploited when fast computations are required. Indeed, it is well known that no existing clustering algorithm completely satisfies both accuracy and efficiency requirements, thus a good clustering algorithm has to be evaluated with respect to some external criteria that are independent from the metric being used to compute clusters. In this paper, we propose a clustering algorithm called M-CLUBS (for Microarray data CLustering Using Binary Splitting) exhibiting higher accuracy than the hierarchical ones proposed so far while allowing a faster computation with respect to partition based approaches. Indeed, M-CLUBS is faster and more accurate than other algorithms, including k-means and its recently proposed refinements, as we will show in the experimental section. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. M-CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties. Due to the structural feature of microarray data (they are represented as arrays of numeric values), M-CLUBS is suitable for analyzing them since it is designed to perform well for Euclidean distances. In order to stronger the obtained results we interpreted the obtained clusters by a domain expert and the evaluation by quality measures specifically tailored for biological validity assessment.

Dealing with trajectory streams by clustering and mathematical transforms

Gianni Costa, Giuseppe Manco, Elio Masciari
Journal PaperJournal of Intelligent Information Systems, February 2014, Volume 42, Issue 1, pp 155-177

Abstract

Nowadays, almost all kind of electronic devices leave traces of their movements (e.g. smartphone, GPS devices and so on). Thus, the huge number of this “tiny” data sources leads to the generation of massive data streams of geo-referenced data. As a matter of fact, the effective analysis of such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams management poses new challenges both for their proper definition and acquisition, thus making the overall process harder than for classical point data. In particular, we are interested in solving the problem of effective trajectory data streams clustering, that revealed really intriguing as we deal with sequential data that have to be properly managed due to their ordering. We propose a framework that allow data pre-elaboration in order to make the mining step more effective. As for every data mining tool, the experimental evaluation is crucial, thus we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed approach.

Mining complex patterns

Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Elio Masciari, Giuseppe Manco
Journal PaperJournal of Intelligent Information Systems, April 2014, Volume 42, Issue 2, pp 179-180

RFID-data compression for supporting aggregate queries

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari
Journal PaperACM Transactions on Database Systems (TODS), Volume 38 Issue 2, June 2013

Abstract

RFID-based systems for object tracking and supply chain management have been emerging since the RFID technology proved effective in monitoring movements of objects. The monitoring activity typically results in huge numbers of readings, thus making the problem of efficiently retrieving aggregate information from the collected data a challenging issue. In fact, tackling this problem is of crucial importance, as fast answers to aggregate queries are often mandatory to support the decision making process. In this regard, a compression technique for RFID data is proposed, and used as the core of a system supporting the efficient estimation of aggregate queries. Specifically, this technique aims at constructing a lossy synopsis of the data over which aggregate queries can be estimated, without accessing the original data. Owing to the lossy nature of the compression, query estimates are approximate, and are returned along with intervals that are guaranteed to contain the exact query answers. The effectiveness of the proposed approach has been experimentally validated, showing a remarkable trade-off between the efficiency and the accuracy of the query estimation.

SMART: Stream Monitoring enterprise Activities by RFID Tags

Elio Masciari
Journal PaperInformation Sciences, Volume 195, 15 July 2012, Pages 25–44

Abstract

Datastreams are potentially infinite data sources that flow continuously while monitoring a physical phenomenon, like temperature levels or other kind of human activities, such as clickstreams, telephone call records, and so on. RFID technology has lead in recent years the generation of huge streams of data. Moreover, RFID based systems allow the effective management of items tagged by RFID tags, especially for supply chain management or objects tracking. In this paper we introduce SMART (Stream Monitoring enterprise Activities by RFID Tags) a system based on an outlier template definition for detecting anomalies in RFID streams. We describe SMART features and its application on a real life scenario that shows the effectiveness of the proposed method for enterprise management. Moreover, we describe an outlier detection approach we defined and effectively exploited in SMART.

The IS-BioBank project: a framework for biological data normalization, interoperability, and mining for cancer microenvironment analysis

Michelangelo Ceci, Pietro Hiram Guzzi, Elio Masciari, Mauro Coluccia, Federica Mandreoli, Massimo Mecella, Fabio Fumarola, Riccardo Martoglia, Wilma Penzo
Journal PaperACM SIGHIT Record, Volume 2 Issue 2, September 2012, Pages 16-21

Abstract

Advances of high throughput technologies have yielded the possibility to investigate human cells of healthy and morbid ones at different levels. Consequently, this has made possible the discovery of new biological and biomedical data and the proliferation of a large number of databases. In this paper, we describe the IS-BioBank (Integrated Semantic Biological Data Bank) proposal. It consists of the realization of a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. In this framework, a key role has been played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).

A Fuzzy Logic Approach to Wrapping PDF Documents

Sergio Flesca, Elio Masciari, Andrea Tagarelli
Journal PaperKnowledge and Data Engineering, IEEE Transactions on Volume 23, Issue 12

Abstract

PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the “uncertainty” which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains.

On the minimization of XPath queries

Sergio Flesca, Filippo Furfaro, Elio Masciari
Journal PaperJournal of the ACM (JACM), Volume 55 Issue 1, February 2008

Abstract

XPath expressions define navigational queries on XML data and are issued on XML documents to select sets of element nodes. Due to the wide use of XPath, which is embedded into several languages for querying and manipulating XML data, the problem of efficiently answering XPath queries has received increasing attention from the research community. As the efficiency of computing the answer of an XPath query depends on its size, replacing XPath expressions with equivalent ones having the smallest size is a crucial issue in this direction. This article investigates the minimization problem for a wide fragment of XPath (namely X P[✶]), where the use of the most common operators (child, descendant, wildcard and branching) is allowed with some syntactic restrictions. The examined fragment consists of expressions which have not been specifically studied in the relational setting before: neither are they mere conjunctive queries (as the combination of “//” and “*” enables an implicit form of disjunction to be expressed) nor do they coincide with disjunctive ones (as the latter are more expressive). Three main contributions are provided. The “global minimality” property is shown to hold: the minimization of a given XPath expression can be accomplished by removing pieces of the expression, without having to re-formulate it (as for “general” disjunctive queries). Then, the complexity of the minimization problem is characterized, showing that it is the same as the containment problem. Finally, specific forms of XPath expressions are identified, which can be minimized in polynomial time.

Mining categories for emails via clustering and pattern discovery

Giuseppe Manco, Elio Masciari, Andrea Tagarelli
Journal PaperJournal of Intelligent Information Systems, April 2008, Volume 30, Issue 2, pp 153-181

Abstract

The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.

Exploiting structural similarity for effective Web information extraction

Sergio Flesca, Giuseppe Manco, Elio Masciari, Luigi Pontieri, Andrea Pugliese
Journal PaperData & Knowledge Engineering, Volume 60, Issue 1, January 2007, Pages 222–234

Abstract

In this paper, we propose a classification technique for Web pages, based on the detection of structural similarities among semistructured documents, and devise an architecture exploiting such technique for the purpose of information extraction. The proposal significantly differs from standard methods based on graph-matching algorithms, and is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to an impulse. The degree of similarity between documents is then stated by analyzing the frequencies of the corresponding Fourier transform. Experiments on real data show the effectiveness of the proposed technique.

Fast Detection of XML Structural Similarity

Sergio Flesca, Giuseppe Manco, Elio Masciari, Luigi Pontieri, Andrea Pugliese
Journal PaperIEEE Transactions on Knowledge and Data Engineering, Issue No.02 - February (2005 vol.17) pp: 160-175

Abstract

Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of Discrete Fourier Transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.

Web wrapper induction: a brief survey

Sergio Flesca, Giuseppe Manco, Elio Masciari, Eugenio Rende, Andrea Tagarelli
Journal PaperAI Communications, Volume 17, Number 2/2004, Pages 57-61

Abstract

Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.

Efficient and effective Web change detection

Sergio Flesca, Elio Masciari
Journal PaperData & Knowledge Engineering, Volume 46, Issue 2, August 2003, Pages 203–224

Abstract

In this paper we present a new technique for detecting changes in Web documents. The technique is based on a new method to measure the similarity of two documents, that represent the actual and the previous version of the monitored page. The technique has been effectively used to discover changes in selected portions of the original document. The proposed technique has been implemented in the CMW system providing a change monitoring service on the Web. The main features of CMW are the detection of changes on selected portions of web documents and the possibility to express complex queries on the changed information. For instance, a query can require to check if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.

Combining inductive and deductive tools for data analysis

Sergio Greco, Elio Masciari, Luigi Pontieri
Journal PaperAI Communications, Volume 14, Number 2/2001, Pages 69-82

Abstract

In this paper we propose the combined use of different methods to improve the data analysis process. This is obtained by combining inductive and deductive techniques. We also use different inductive techniques such as clustering algorithms, to derive data partition, and decision trees induction, characterizing classes in terms of logical rules. Inductive techniques are used for generating hypotheses from data whereas deductive techniques are used to derive knowledge and to verify hypotheses. In order to guide users in the analysis process, we have developed a system which integrates deductive tools and data mining tools such as classification algorithms, features selection algorithms, visualization tools and tools to manipulate data sets easily. The system developed is currently used in a large project whose aim is the integration of information sources containing data concerning the socio-economic aspects of Calabria and its subsequent analysis. Several experiments on the socio-economic data have shown that the combined use of different techniques improves both the comprehensibility and the accuracy of models.

A Big Data Approach For Querying Data in EHR Systems

Nunziato Cassavia, Mario Ciampi, Giuseppe De Pietro, Elio Masciari
Conference PapersIDEAS '16 Proceedings of the 20th International Database Engineering & Applications Symposium , Pages 212-217

Abstract

Information management in healthcare is nowadays experiencing a great revolution. After the impressive progress in digitizing medical data by private organizations, also the federal government and other public stakeholders have also started to make use of healthcare data for data analysis purposes in order to extract actionable knowledge. In this paper, we propose an architecture for supporting interoperability in healthcare systems by exploiting Big Data techniques. In particular, we describe a proposal based on big data techniques to implement a nationwide system able to improve EHR data access efficiency and reduce costs.

How, Who and When: Enhancing Business Process Warehouses By Graph Based Queries

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari, Luigi Pontieri, Chiara Pulice
Conference PapersIDEAS '16 Proceedings of the 20th International Database Engineering & Applications Symposium , Pages 242-247

Abstract

Log analysis and querying recently received a renewed interest from the research community, as the effective understanding of process behavior is crucial for improving business process management. Indeed, currently available log querying tools are not completely satisfactory, especially from the viewpoint of easiness of use. As a matter of fact, there is no framework which meets the requirements of easiness of use, flexibility and efficiency of query evaluation. In this paper, we propose a framework for graphical querying of (process) log data that makes the log analysis task quite easy and efficient, adopting a very general model of process log data which guarantees a high level of flexibility. We implemented our framework by using a flexible storage architecture and a user-friendly data analysis interface, based on an intuitive and yet expressive graph-based query language. Experiments performed on real data confirm the validity of the approach.

Enhancing EHR Systems Interoperability by Big Data Techniques

Nunziato Cassavia, Mario Ciampi, Giuseppe De Pietro, Elio Masciari
Conference PapersITBAM 2016, Information Technology in Bio- and Medical Informatics, Volume 9832 of the series Lecture Notes in Computer Science, Pages 34-48

Abstract

Information management in healthcare is nowadays experiencing a great revolution. After the impressive progress in digitizing medical data by private organizations, also the federal government and other public stakeholders have also started to make use of healthcare data for data analysis purposes in order to extract actionable knowledge. In this paper, we propose an architecture for supporting interoperability in healthcare systems by exploiting Big Data techniques. In particular, we describe a proposal based on big data techniques to implement a nationwide system able to improve EHR data access efficiency and reduce costs.

A Framework Enhancing the User Search Activity Through Data Posting

Nunziato Cassavia, Elio Masciari, Chiara Pulice, Domenico Saccà
Conference PapersRuleML 2016, Rule Technologies. Research, Tools, and Applications, Volume 9718 of the series Lecture Notes in Computer Science, Pages 373-377

Abstract

Due to the increasing availability of huge amounts of data, traditional data management techniques result inadequate in many real life scenarios. Furthermore, heterogeneity and high speed of this data require suitable data storage and management tools to be designed from scratch. In this paper, we describe a framework tailored for analyzing user interactions with intelligent systems while seeking for some domain specific information (e.g., choosing a good restaurant in a visited area). The framework enhances user quest for information by performing a data exchange activity (called data posting) which enriches the information sources with additional background information and knowledge derived from experiences and behavioral properties of domain experts and users.

Enhanced User Search Activity by Big Data Tools

Nunziato Cassavia, Elio Masciari, Chiara Pulice, Domenico Saccà
Conference PapersSEBD 2016, Proceedings of the 24th Italian Symposium on Advanced Database Systems, Pages 206-213

Efficient Analysis of Process Logs

Bettina Fazzinga, Filippo Furfaro, Elio Masciari, Giuseppe Massimiliano Mazzeo
Conference PapersSEBD 2016, Proceedings of the 24th Italian Symposium on Advanced Database Systems, Pages 214-221

Privacy or Security?: Take A Look And Then Decide

Bettina Fazzinga, Filippo Furfaro, Elio Masciari, Giuseppe Massimiliano Mazzeo
Conference PapersSSDBM 2016, Proceedings of the 28th International Conference on Scientific and Statistical Database Management, Article No. 23

Abstract

Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied occur at rates that are too fast to be effectively analyzed in real time. For example, in order to detect possible security threats, millions of records in a high-speed flow stream must be screened. To ameliorate this problem, a viable solution is the use of data compression for reducing the amount of data to be analyzed. In this paper we propose the use of privacy-preserving histograms, that provide approximate answers to 'safe' queries, for analyzing data in the cybersecurity scenario without compromising individuals' privacy, and we describe our system that has been used in a real life scenario.

Surfing Big Data Warehouses for Effective Information Gathering

Nunziato Cassavia, Pietro Dicosta, Elio Masciari, Domenico Saccà
Conference PapersDATA 2015, Proceedings of 4th International Conference on Data Management Technologies and Applications, Pages 373-377

Abstract

Due to the emerging Big Data paradigm traditional data management techniques result inadequate in many real life scenarios. In particular, OLAP techniques require substantial changes in order to offer useful analysis due to huge amount of data to be analyzed and their velocity and variety. In this paper, we describe an approach for dynamic Big Data searching that based on data collected by a suitable storage system, enriches data in order to guide users through data exploration in an efficient and effective way.

Big Data Techniques For Supporting Accurate Predictions of Energy Production From Renewable Sources

Michelangelo Ceci, Roberto Corizzo, Fabio Fumarola, Michele Ianni, Donato Malerba, Gaspare Maria, Elio Masciari, Marco Oliverio, Aleksandra Rashkovska
Conference PapersIDEAS 2015, Proceedings of the 19th International Database Engineering & Applications Symposium, Pages 62-71

Abstract

Predicting the output power of renewable energy production plants distributed on a wide territory is a really valuable goal, both for marketing and energy management purposes. Vi-POC (Virtual Power Operating Center) project aims at designing and implementing a prototype which is able to achieve this goal. Due to the heterogeneity and the high volume of data, it is necessary to exploit suitable Big Data analysis techniques in order to perform a quick and secure access to data that cannot be obtained with traditional approaches for data management. In this paper, we describe Vi-POC -- a distributed system for storing huge amounts of data, gathered from energy production plants and weather prediction services. We use HBase over Hadoop framework on a cluster of commodity servers in order to provide a system that can be used as a basis for running machine learning algorithms. Indeed, we perform one-day ahead forecast of PV energy production based on Artificial Neural Networks in two learning settings, that is, structured and non-structured output prediction. Preliminary experimental results confirm the validity of the approach, also when compared with a baseline approach.

Improving tourist experience by Big Data tools

Nunziato Cassavia, Pietro Dicosta, Elio Masciari, Domenico Saccà
Conference PapersHPCS 2015, 2015 International Conference on High Performance Computing & Simulation

Abstract

Due to the emerging Big Data applications traditional data management techniques result inadequate in many real life scenarios. In particular, OLAP techniques require substantial changes in order to offer useful analysis due to huge amount of data to be analyzed and their velocity and variety. In this paper, we describe an approach for dynamic Big Data searching that based on data collected by a suitable storage system, enrich data in order to guide users through data exploration in a efficient and effective way.

A Framework Supporting the Analysis of Process Logs Stored in Either Relational or NoSQL DBMSs

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari, Luigi Pontieri
Conference PapersISMIS 2015, Foundations of Intelligent Systems, Volume 9384 of the Series Lecture Notes in Computer Science, Pages 52-58

Abstract

The issue of devising efficient and effective solutions for supporting the analysis of process logs has recently received great attention from the research community, as effectively accomplishing any business process management task requires understanding the behavior of the processes. In this paper, we propose a new framework supporting the analysis of process logs, exhibiting two main features: a flexible data model (enabling an exhaustive representation of the facets of the business processes that are typically of interest for the analysis) and a graphical query language, providing a user-friendly tool for easily expressing both selection and aggregate queries over the business processes and the activities they are composed of. The framework can be easily and efficiently implemented by leveraging either “traditional” relational DBMSs or “innovative” NoSQL DBMSs, such as Neo4J.

A Probabilistic Unified Framework for Event Abstraction and Process Detection from Log Data

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari, Luigi Pontieri
Conference PapersOn the Move to Meaningful Internet Systems: OTM 2015 Conferences, Volume 9415 of the series Lecture Notes in Computer Science, Pages 320-328

Abstract

We consider the scenario where the executions of different business processes are traced into a log, where each trace describes a process instance as a sequence of low-level events (representing basic kinds of operations). In this context, we address a novel problem: given a description of the processes’ behaviors in terms of high-level activities (instead of low-level events), and in the presence of uncertainty in the mapping between events and activities, find all the interpretations of each trace Φ. Specifically, an interpretation is a pair ⟨σ,W⟩ that provides a two-level “explanation” for Φ: σ is a sequence of activities that may have triggered the events in Φ, and W is a process whose model admits σ. To solve this problem, we propose a probabilistic framework representing “consistent” Φ’s interpretations, where each interpretation is associated with a probability score.

VIPOC Project Research Summary (Discussion Paper)

Michelangelo Ceci, Roberto Corizzo, Fabio Fumarola, Michele Ianni, Donato Malerba, Gaspare Maria, Elio Masciari, Marco Oliverio, Aleksandra Rashkovska
Conference PapersSEBD 2015, Proceedings of the 23rd Italian Symposium on Advanced Database Systems, Pages 208-215

Hierarchical Big Data Clustering (Discussion Paper)

Michele Ianni, Elio Masciari, Giuseppe Massimiliano Mazzeo, Marco Oliverio, Carlo Zaniolo
Conference PapersSEBD 2015, Proceedings of the 23rd Italian Symposium on Advanced Database Systems, Pages 224-231

Effective Big Data Warehouses Surfing (Discussion Paper)

Nunziato Cassavia, Pietro Dicosta, Elio Masciari, Domenico Saccà
Conference PapersSEBD 2015, Proceedings of the 23rd Italian Symposium on Advanced Database Systems, Pages 264-271

A compression-based framework for the efficient analysis of business process logs

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari, Luigi Pontieri
Conference PapersSSDBM '15, Proceedings of the 27th International Conference on Scientific and Statistical Database Management, Article No. 6

Abstract

The increasing availability of large process log repositories calls for efficient solutions for their analysis. In this regard, a novel specialized compression technique for process logs is proposed, that builds a synopsis supporting a fast estimation of aggregate queries, which are of crucial importance in exploratory and high-level analysis tasks. The synopsis is constructed by progressively merging the original log-tuples, which represent single activity executions within the process instances, into aggregate tuples, summarizing sets of activity executions. The compression strategy is guided by a heuristic aiming at limiting the loss of information caused by summarization, while guaranteeing that no information is lost on the set of activities performed within the process instances and on the order among their executions. The selection conditions in an aggregate query are specified in terms of a graph pattern, that allows precedence relationships over activity executions to be expressed, along with conditions on their starting times, durations, and executors. The efficacy of the compression technique, in terms of capability of reducing the size of the log and of accuracy of the estimates retrieved from the synopsis, has been experimentally validated.

Data Preparation for Tourist Data Big Data Warehousing

Nunziato Cassavia, Pietro Dicosta, Elio Masciari, Domenico Saccà
Conference PapersDATA 2014: 419-426

Abstract

The pervasive diffusion of new generation devices like smart phones and tablets along with the widespread use of social networks causes the generation of massive data flows containing heterogeneous information generated at different rates and having different formats. These data are referred as Big Data and require new storage and analysis approaches to be investigated for managing them. In this paper we will describe a system for dealing with massive tourism flows that we exploited for the analysis of tourist behavior in Italy. We defined a framework that exploits a NoSQL approach for data management and map reduce for improving the analysis of the data gathered from different sources.

Innovative power operating center management exploiting big data techniques

Michelangelo Ceci, Nunziato Cassavia, Roberto Corizzo, Pietro Dicosta, Donato Malerba, Gaspare Maria, Elio Masciari, Camillo Pastura
Conference PapersIDEAS 2014, Proceedings of the 18th International Database Engineering & Applications Symposium, Pages 326-329

Abstract

The problem of accurately predicting the energy production from renewable sources has recently received an increasing attention from both the industrial and the research communities. It presents several challenges, such as facing with the rate data are provided by sensors, the heterogeneity of the data collected, power plants efficiency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe Vi-POC (Virtual Power Operating Center), a project conceived to assist energy producers and decision makers in the energy market. In this paper we present the Vi-POC project and how we face with challenges posed by the specific application. The solutions we propose have roots both in big data management and in stream data mining.

A Discussion on the Biological Relevance of Clustering Results

Pietro Hiram Guzzi, Elio Masciari, Giuseppe Massimiliano Mazzeo, Carlo Zaniolo
Conference PapersInformation Technology in Bio- and Medical Infomatics, Lecture Notes in Computer Science Volume 8649, 2014, pp 30-44

Abstract

The recent advances in genomic technologies and the availability of large-scale datasets call for the development of advanced data analysis techniques, such as data mining and statistical analysis to cite a few. A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. Among the mining techniques proposed so far, cluster analysis has become a standard method for the analysis of microarray expression data. It can be used both for initial screening of patients and for extraction of disease molecular signatures. Moreover, clustering can be profitably exploited to characterize genes of unknown function and uncover patterns that can be interpreted as indications of the status of cellular processes. Finally, clustering biological data would be useful not only for exploring the data but also for discovering implicit links between the objects. Indeed, a key feature that lacks in many proposed approach is the biological interpretation of the obtained results. In this paper, we will discuss such an issue by analysing the results obtained by several clustering algorithms w.r.t. their biological relevance.

Effective Analysis Of Massive Tourist Information Flows

Nunziato Cassavia, Pietro Dicosta, Elio Masciari, Domenico Saccà
Conference PapersSEBD 2014: 345-352

Abstract

Big Data Techniques For Renewable Energy Market

Michelangelo Ceci, Nunziato Cassavia, Roberto Corizzo, Pietro Dicosta, Donato Malerba, Gaspare Maria, Elio Masciari, Camillo Pastura
Conference PapersSEBD 2014: 369-377

Abstract

Sequential pattern mining from trajectory data

Elio Masciari, Shi Gao, Carlo Zaniolo
Conference PapersIDEAS 2013, Proceedings of the 17th International Database Engineering & Applications Symposium, Pages 162-167

Abstract

In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.

A New, Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances

Elio Masciari, Giuseppe Massimiliano Mazzeo, Carlo Zaniolo
Conference PapersPAKDD (2) 2013: 111-122

Abstract

A simple hierarchical clustering algorithm called CLUBS (for CLustering Using Binary Splitting) is proposed. CLUBS is faster and more accurate than existing algorithms, including k-means and its recently proposed refinements. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties.

Trajectory Data Pattern Mining

Elio Masciari, Shi Gao, Carlo Zaniolo
Conference PapersNew Frontiers in Mining Complex Patterns, Lecture Notes in Computer Science 2014, pp 51-66

Abstract

In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.

Warehousing and querying trajectory data streams with error estimation

Elio Masciari
Conference PapersDOLAP 2012 Proceedings of the fifteenth international workshop on Data warehousing and OLAP, Pages 113-120

Abstract

In this paper, we address the problem of trajectory data streams warehousing and querying, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

XML class outlier detection

Giuseppe Manco, Elio Masciari
Conference PapersIDEAS 2012 Proceedings of the 16th International Database Engineering & Applications Sysmposium, Pages 155-164

Abstract

XML (eXtensible Markup Language) became in recent years the new standard for data representation and exchange on the WWW. This has resulted in a great need for data cleaning techniques in order to identify outlying data. In this paper, we present a technique for outlier detection that singles out anomalies with respect to a relevant group of objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier Transforms. Outliers are identified by simply looking at the signal spectra. The results show the effectiveness of our approach.

Efficient MD5 hash reversing using D.E.A. framework for sharing computational resources

Nunzio Cassavia, Elio Masciari
Conference PapersIDEAS 2012 Proceedings of the 16th International Database Engineering & Applications Sysmposium, Pages 211-215

Abstract

The recent advances in computing technology lead to the availability of a huge number of computational resources that can be easily connected through network infrastructures. Indeed, a really small fraction of the available computing power is fully exploited for performing effective computation of user tasks. On the contrary, there are several research projects that require a lot of computing power to reach their goals, but they usually lack adequate resources thus making the project activities quite hard to be completed. In this paper we describe D.E.A. (Distributed Execution Agent), a framework for sharing computational resources. We will exploit D.E.A. framework to tame the high computational demanding problem of hash MD5 reversing. We performed several experiments that confirmed the validity of our approach.

Toward a Semantic Framework for the Querying, Mining and Visualization of Cancer Microenvironment Data

Michelangelo Ceci, Fabio Fumarola, Pietro Hiram Guzzi, Federica Mandreoli, Riccardo Martoglia, Elio Masciari, Massimo Mecella, Wilma Penzo
Conference PapersInformation Technology in Bio- and Medical Informatics, Lecture Notes in Computer Science Volume 7451, 2012, pp 109-123

Abstract

Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. The system will be used in a pilot study on the Multiple Myeloma (MM).

Effective Detection of XML Outliers

Alfredo Cuzzocrea, Giuseppe Manco, Elio Masciari
Conference PapersKES 2012: 1221-1232

Abstract

XML (eXtensible Markup Language) became in recent years the new standard for data representation and exchange on the WWW. This has resulted in a great need for data cleaning techniques in order to identify outlying data. In this paper, we present a technique for outlier detection that singles out anomalies with respect to a relevant group of objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier Transforms. Outliers are identified by simply looking at the signal spectra. The results show the effectiveness of our approach.

Effectively Grouping Trajectory Streams

Gianni Costa, Giuseppe Manco, Elio Masciari
Conference PapersNew Frontiers in Mining Complex Patterns, Lecture Notes in Computer Science Volume 7765, 2013, pp 94-108

Abstract

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amount of data is a challenging problem, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams pose interesting challenges for their proper representation, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really intriguing as we deal with a kind of data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed technique.

A Framework for Biological Data Normalization, Interoperability, and Mining for Cancer Microenvironment Analysis

Michelangelo Ceci, Mauro Coluccia, Fabio Fumarola, Pietro Hiram Guzzi, Federica Mandreoli, Riccardo Martoglia, Elio Masciari, Massimo Mecella, Wilma Penzo
Conference PapersSEBD 2012: 67-74

Abstract

Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in a such huge quantity of data. In this framework, a key role is played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).

Efficient and Effective Query Answering for Trajectory Cuboids

Elio Masciari
Conference PapersFlexible Query Answering Systems, Lecture Notes in Computer Science Volume 7022, 2011, pp 270-281

Abstract

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects generated by different sources continuously using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams On Line Analytical Processing, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

Trajectory Outlier Detection Using an Analytical Approach

Elio Masciari
Conference Papers2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp: 377-384

Abstract

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data outlier detection, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed technique.

Query answering on trajectory cuboids using prime numbers encodings

Elio Masciari
Conference PapersIDEAS 2011 Proceedings of the 15th Symposium on International Database Engineering & Applications, Pages 214-218

Abstract

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects generated by different sources continuously using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams On Line Analytical Processing, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

Non-separable Transforms for Clustering Trajectories

Alfredo Cuzzocrea, Elio Masciari
Conference PapersKnowlege-Based and Intelligent Information and Engineering Systems, Lecture Notes in Computer Science Volume 6882, 2011, pp 571-580

Abstract

Trajectory data refer to time and position of moving objects generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from these peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks and supply chain management. In this paper, we address the problem of trajectory data streams clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

A Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances (Extended Abstract)

Elio Masciari, Giuseppe M. Mazzeo, Carlo Zaniolo
Conference PapersSEBD 2011: 41-48

Abstract

Fast and Accurate Trajectory Streams Clustering

Elio Masciari
Conference PapersScientific and Statistical Database Management, Lecture Notes in Computer Science Volume 6809, 2011, pp 592-593

Abstract

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant.

Effectively Monitoring RFID Based Systems

Fabrizio Angiulli, Elio Masciari
Conference PapersAdvances in Databases and Information Systems, Lecture Notes in Computer Science Volume 6295, 2010, pp 31-4

Abstract

Datastreams are potentially infinite sources of data that flow continuously while monitoring a physical phenomenon, like temperature levels or other kind of human activities, such as clickstreams, telephone call records, and so on. Radio Frequency Identification (RFID) technology has lead in recent years the generation of huge streams of data. Moreover, RFID based systems allow the effective management of items tagged by RFID tags, especially for supply chain management or objects tracking. In this paper we introduce SMART (Simple Monitoring enterprise Activities by RFID Tags) a system based on outlier template definition for detecting anomalies in RFID streams. We describe SMART features and its application on a real life scenario that shows the effectiveness of the proposed method for effective enterprise management.

Lifting Trajectories for Effective Clustering

Elio Masciari
Conference Papers2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp: 256-259

Abstract

The increasing availability of huge amounts of data pertaining to time and position of moving objects generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatial data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

Efficient and Effective RFID Data Warehousing (Extended Abstract)

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Elio Masciari
Conference PapersSEBD 2010: 274-281

Abstract

Trajectory Clustering via Effective Partitioning

Elio Masciari
Conference PapersFlexible Query Answering Systems, Lecture Notes in Computer Science Volume 5822, 2009, pp 358-370

Abstract

The increasing availability of huge amounts of data pertaining to time and positions generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatial data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. In this paper, we address the problem of clustering spatial trajectories. In the context of trajectory data, clustering is really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient and effective clustering technique based on a proper metric. Finally, we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.

A Framework for Trajectory Clustering

Elio Masciari
Conference PapersGeoSensor Networks, Lecture Notes in Computer Science Volume 5659, 2009, pp 102-111

Abstract

The increasing availability of huge amounts of “thin” data, i.e. data pertaining to time and positions generated by different sources with a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatio-temporal data collections. Mining such amounts of data is challenging, since the possibility of extracting useful information from this particular type of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks and supply chain management. In this paper, we address the issue of clustering spatial trajectories. In the context of trajectory data, this problem is even more challenging than in classical transactional relationships, as here we deal with data (trajectories) in which the order of items is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient clustering technique based on edit distance. Experiments performed on real world datasets have confirmed the efficiency and effectiveness of the proposed techniques.

A Complete Framework for Clustering Trajectories

Elio Masciari
Conference Papers2009 21st IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2009), pp: 9-16

Abstract

The increasing availability of huge amounts of thin data, i.e. data pertaining to time and positions generated by different sources with a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatio-temporal data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. In this paper, we address the clustering of spatial trajectories. In the context of trajectory data, this problem is even more challenging than in the classical transactions, as here we deal with data (trajectories) in which the order of items is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient clustering technique based on edit distance. Experiments performed on real world datasets have confirmed the efficiency and effectiveness of the proposed techniques.

Efficient and effective RFID data warehousing

Bettina Fazzinga, Sergio Flesca, Elio Masciari, Filippo Furfaro
Conference PapersIDEAS 2009 Proceedings of the 2009 International Database Engineering & Applications Symposium, Pages 251-258

Abstract

Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems since in the next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to the streaming nature of RFID readings, large amounts of data are generated by these devices at high production rates. This phenomenon is even more relevant since RFIDs are so cheap that every individual item can be tagged thus leaving a "trail" of data as it moves across different locations. This scenario raises new challenges in effectively and efficiently exploiting such large amounts of data. In this paper we address the problem of compressing RFID data in order to enable devices with limited amount of available memory (such as PDAs) to issue queries on RFID warehouses. In particular, we designed a lossy strategy for collapsing tuples carrying information about items being delivered at different location of the supply chain.

Sequential Pattern Mining from Trajectory Data

Elio Masciari, Barzan Mozafari
Conference PapersSEBD 2009: 125-132

Abstract

A wrapper generation system for PDF documents

Bettina Fazzinga, Sergio Flesca, Andrea Tagarelli, Salvatore Garruzzo, Elio Masciari
Conference PapersSAC 2008 Proceedings of the 2008 ACM symposium on Applied computing, Pages 442-446

Abstract

The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.

A Framework for Outlier Mining in RFID Data

Elio Masciari, Giuseppe M. Mazzeo
Conference PapersSEBD 2008: 287-293

Abstract

A Framework for Outlier Mining in RFID data

Elio Masciari
Conference Papers2007 11th International Database Engineering and Applications Symposium, IDEAS-07, pp: 263-267

Abstract

Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems. In next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to RFID readings features this will result in a huge amount of information generated by such systems when costs will be at a level such that each individual item could be tagged thus leaving a trail of data as it moves through different locations. We define a technique for efficiently detecting anomalous data in order to prevent problems related to inefficient shipment or fraudulent actions. Since items usually move together in large groups through distribution centers and only in stores do they move in smaller groups we exploit such a feature in order to design our technique. The preliminary experiments show the effectiveness of our approach.

RFID data management for effective objects tracking

Elio Masciari
Conference PapersSAC 2007 Proceedings of the 2007 ACM symposium on Applied computing, Pages 457-461

Abstract

Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems. In next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to RFID readings features this will result in a huge amount of information generated by such systems when costs will be at a level such that each individual item could be tagged thus leaving a trail of data as it moves through different locations. We define a technique for efficiently detecting anomalous data in order to prevent problems related to inefficient shipment or fraudulent actions. Since items usually move together in large groups through distribution centers and only in stores do they move in smaller groups we exploit such a feature in order to design our technique. The preliminary experiments show the effectiveness of our approach.

Wrapping PDF Documents Exploiting Uncertain Knowledge

Sergio Flesca, Salvatore Garruzzo, Elio Masciari, Andrea Tagarelli
Conference PapersAdvanced Information Systems Engineering, Lecture Notes in Computer Science Volume 4001, 2006, pp 175-189

Abstract

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.

Efficiently Representing and Querying Sensor Network Readings on Data Grids

Alfredo Cuzzocrea, Filippo Furfaro, Elio Masciari, Giuseppe M. Mazzeo, Domenico Saccà
Conference PapersSEBD 2006: 373-382

Abstract

Exploiting Structural Similarity For Effective Web Information Extraction

Elio Masciari, Sergio Flesca, Giuseppe Manco, Luigi Pontieri, Andrea Pugliese
Conference PapersFoundations of Semistructured Data 2005

Abstract

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significantly differs from standard methods based on graph-matching algorithms. The technique is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. Experiments on real data show the effectiveness of the proposed technique.

A Distributed System for Answering Range Queries on Sensor Network Data

Alfredo Cuzzocrea, Filippo Furfaro, Sergio Greco, Elio Masciari, Giuseppe M. Mazzeo, Domenico Saccà
Conference Papers Pervasive Computing and Communications Workshops, IEEE International Conference on, pp: 369-373

Abstract

A distributed system for approximate query answering on sensor network data is proposed, where a suitable compression technique is exploited to represent data and support query answering. Each node of the system stores either detailed or summarized sensor readings. Query answers are computed by identifying the set of nodes that contain (either compressed or not) data involved in the query, and eventually partitioning the query in a set of sub-queries to be evaluated at different nodes. Queries are partitioned according to a cost model aiming at making the evaluation efficient and guaranteeing the desired degree of accuracy of query answers.

Wrapping PDF Documents: A Preliminary Study

Sergio Flesca, Salvatore Garruzzo, Elio Masciari, Andrea Tagarelli
Conference PapersSEBD 2005: 272-283

Abstract

A Framework for minimizing Xpath queries

Sergio Flesca, Filippo Furfaro, Elio Masciari, Francesco Parisi
Conference PapersSEBD 2004: 326-333

Abstract

Approximate Query Answering on Sensor Network Data Streams

Alfredo Cuzzocrea, Filippo Furfaro, Elio Masciari, Cristina Sirangelo
Conference PapersSEBD 2003: 93-108

Abstract

On the minimization of Xpath queries

Sergio Flesca, Filippo Furfaro, Elio Masciari
Conference PapersVLDB 2003: 153-164

Abstract

A Framework for Adaptive Mail Classification

Giuseppe Manco, Elio Masciari, Andrea Tagarelli
Conference Papers2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp: 387

Abstract

We introduce a technique based on data mining algorithms for classifying incoming messages, as a basis for an overall architecture for maintenance and management of e-mail messages. We exploit clustering techniques for grouping structured and unstructured information extracted from e-mail messages in an unsupervised way, and exploit the resulting algorithm in the process of folder creation (and maintenance) and e-mail redirection. Some initial experimental results show the effectiveness of the technique, both from an efficiency and a quality-of-results viewpoint.

Fast Detection of XML Structural Similarity

Sergio Flesca, Giuseppe Manco, Elio Masciari, Luigi Pontieri, Andrea Pugliese
Conference PapersSEBD 2002: 193-207

Abstract

Detecting Structural Similarities between XML Documents.

Sergio Flesca, Giuseppe Manco, Elio Masciari, Luigi Pontieri, Andrea Pugliese
Conference PapersWebDB 2002: 55-60

Abstract

Meaningful Change Detection on the Web

Sergio Flesca, Filippo Furfaro, Elio Masciari
Conference PapersDatabase and Expert Systems Applications, Lecture Notes in Computer Science Volume 2113, 2001, pp 22-31

Abstract

In this paper we present a new technique for detecting changes on the Web. We propose a new method to measure the similarity of two documents, that can be efficiently used to discover changes in selected portions of the original document. The proposed technique has been implemented in the CDWeb system providing a change monitoring service on theWeb. CDWeb differs from other previously proposed systems since it allows the detection of changes on portions of documents and specific changes expressed by means of complex conditions, i.e. users might want to know if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.

Using an Out-of-Core Technique for Clustering Large Data Sets

Elio Masciari, Clara Pizzuti, Giuseppe Raimondo, Domenico Talia
Conference Papers2012 23rd International Workshop on Database and Expert Systems Applications, pp: 0133

Abstract

Data mining algorithms generally deal with very large data sets that do not fit in main memory. Therefore, techniques that manage huge data sets need to be developed. Any algorithm that is proposed for mining data should have to account for out-of-core data structures. However, most of the existing algorithms haven't yet addressed this issue. In this paper we describe the implementation of an out-of-core technique for the data analysis of very large data sets with the sequential and parallel version of the clustering algorithm AutoClass. We discuss the out-of-core technique and show performance results in terms of execution time and speed up.

Monitoring Web Information Changes

Sergio Flesca, Filippo Furfaro, Elio Masciari
Conference PapersInformation Technology: Coding and Computing, International Conference on, pp: 0421

Abstract

Often web users want to be notified when a specific information contained in a web page has been modified. The problem of detecting web document changes has been deeply investigated, and several systems providing notification of web page changes are available. These systems do not provide notification of changes on a specific information contained in a web page. In this work we present a system called CDWeb that performs this task. It allows users to monitor a whole document or specific portions of it. Users can also specify what kind of changes they are interested in, such as structural changes, or semantic changes. The system provides a flexible and adaptive view of the Web: it tracks user queries and creates user profiles, in order to associate a personalized view to each user.

A Hybrid Technique for Data Mining on Balance-Sheet Data

Giuseppe Dattilo, Sergio Greco, Elio Masciari, Luigi Pontieri
Conference PapersData Warehousing and Knowledge Discovery, Lecture Notes in Computer Science Volume 1874, 2000, pp 419-424

Abstract

Recent rapid growth in the ability to generate and store data by more powerful Database Management Systems and hardware architecture, leads to a question: how can we take advantage of this large amount of information? Traditional methods for querying and reporting are inadequate because they can only manipulate data and the information content derived is very low. Obtaining new relationships among data and new hypotheses about them is the aim of Knowledge Discovery in Databases (KDD) which makes use of Data Mining techniques. These techniques have interesting applications for business data such as market basket analysis, financial resource planning, fraud detection and the scheduling of production processes. In this work we consider the application of Data Mining techniques for the analysis of the balance-sheets of Italian companies.

Combining Different Data Mining Techniques to Improve Data Analysis

Sergio Greco, Elio Masciari, Luigi Pontieri
Conference PapersFlexible Query Answering Systems, Advances in Soft Computing Volume 7, 2001, pp 455-464

Abstract

In this paper we propose the combined use of different methods to improve the data analysis process. This is obtained by combining inductive and deductive techniques. Inductive techniques are used for generating hypotheses from data whereas deductive techniques are used to derive knowledge and to verify hypotheses. In order to guide users in the the analysis process, we have developed a system which integrates deductive tools, data mining tools (such as classification algorithms and features selection algorithms), visualization tools and tools for the easy manipulation of data sets. The system developed is currently used in a large project whose aim is the integration of information sources containing data concerning the socio-economic aspects of Calabria and the analysis of the integrated data. Several experiments on socio-economic indicators of Calabrian cities have shown that the combined use of different techniques improves both the comprehensibility and the accuracy of models.

Un sistema per la classificazione dei bilanci aziendali

Giuseppe Dattilo, Elio Masciari, Luigi Pontieri
Conference PapersSEBD 2000: 257-270

Abstract

New Frontiers in Mining Complex Patterns

Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, Zbigniew W. Ras
Book4th International Workshop, NFMCP 2015, Held in Conjunction with ECML-PKDD 2015, Porto, Portugal, September 7, 2015, Revised Selected Papers ISBN: 978-3-319-39314-8
image

New Frontiers in Mining Complex Patterns

Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, Zbigniew W. Ras
BookThird International Workshop, NFMCP 2014, Held in Conjunction with ECML-PKDD 2014, Nancy, France, September 19, 2014, Revised Selected Papers ISBN: 978-3-319-17875-2
image

New Frontiers in Mining Complex Patterns

Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, Zbigniew W. Ras
BookSecond International Workshop, NFMCP 2013, Held in Conjunction with ECML-PKDD 2013, Prague, Czech Republic, September 27, 2013, Revised Selected Papers ISBN: 978-3-319-08406-0
image

New Frontiers in Mining Complex Patterns

Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, Zbigniew W. Ras
BookFirst International Workshop, NFMCP 2012, Held in Conjunction with ECML/PKDD 2012, Bristol, UK, September 24, 2012, Rivesed Selected Papers ISBN: 978-3-642-37381-7
image

Current Teaching

  • Present 2020

    Hardware and Software for Big Data

    Università Federico II di Napoli

  • Present 2020

    Artificial Intelligence and Big Data

    Università Federico II di Napoli

  • Present 2019

    Technologies for Information Systems

    Università Federico II di Napoli

  • Present 2019

    Sistemi Informativi

    Università Federico II di Napoli

  • Present 2019

    Basi di Dati

    Università Federico II di Napoli

Teaching History

  • 2019 2011

    Fondamenti di Informatica per Scienze e Tecnologie Biologiche

    Università della Calabria, Faculty of S. M. F. N.

  • 2019 2007

    Sistemi di Elaborazione

    Università Magna Graecia

  • 2019 2014

    Informatica per STPA

    Università Magna Graecia

  • 2019 2014

    Informatica per Biotecnologie

    Università Magna Graecia

  • 2005 2001

    Introduzione all'Informatica

    Università della Calabria

  • 2006 2004

    Fondamenti di Informatica

    Università della Calabria

  • 2007 2006

    Data e Text Mining

    Università della Calabria

  • 2005 2001

    Laboratorio di Programmazione

    Università della Calabria

  • 2011 2008

    Informatica per l'ambiente ed il territorio

    Università della Calabria

  • 2002 2001

    Programmazione Orientata Agli Oggetti

    Università della Calabria

  • 2009 2002

    Laboratorio di Programmazione

    Università Magna Graecia

  • 2012 2005

    Informatica per le professioni sanitarie

    Università Magna Graecia

  • 2010 2006

    Informatica per la facoltà di Giurisprudenza

    Università Magna Graecia

  • 2009 2003

    Fondamenti di Telecomunicazioni e Reti, modulo di Informatica

    Università Magna Graecia

  • 2014 2012

    Abilità Informatiche e Telematiche

    Università Magna Graecia

  • 2014 2012

    Matematica Statistica e Informatica

    Università Magna Graecia

At My Office

Napoli: DIETI - via claudio 21, Palazzina 3/A stanza 4.05

Rende: ICAR-CNR - via P. Bucci 9C, secondo piano

I am at my office every day from 9:30 am until 07:00 pm, but you may consider a call or email to fix an appointment

At My Spin-off

Sometimes you can find me at my Spin-off Coremuniti located at TechNest, piazza Vermicelli, Arcavacata di Rende, Cosenza

At My Lab

My lab is located close to my office, on the same floor

Please refer to laboratory staff