Elio Masciari is currently associate professor at University Federico II of Naples.
Previously he was senior researcher at the Institute for High Performance Computing and Networks (ICAR-CNR) of the National Research Council of Italy. He has been advisor of several master thesis at the University of Calabria and at Univeristy Magna Graecia in Catanzaro. He was advisor of PhD thesis in computer engineering at University of Calabria. He has served as a member of the program committee of several international conferences. He served as a reviewer for several scientific journal of international relevance. He is author of more than 140 publications on journals and both national and international conferences. He also holds "Abilitazione Scientifica Nazionale" for Full Professor role.
NFMCP Series
ECML PKDD Discovery Challenge 2016
NFMCP@ECML/PKDD
University Federico II of Naples
ICAR CNR
UCLA, WEB Information System Laboratory
Master Degree in Physics
Università della Calabria
Fifteenth International Summer School for Computer Science Researchers: "Algorithmics for Data Mining and Pattern Discovery"
Lipari
"Scientific Data Mining 2002" IPAM International School for Researcher
Ph.D. "Ingegneria Informatica e Dei Sistemi XV ciclo"
Università della Calabria
Thirteenth International Summer School for Computer Science Researchers: "Foundations of wide area network programming".
Lipari
Engineering Licence
Università della Calabria
Master Degree in Computer Science Engineering
Università della Calabria
Elio Masciari has been visiting researcher at the Department of Computer Science of University of California, Los Angeles (2005 and 2006). Since 2001 he also teaches at the University of Calabria and since 2002 at the University Magna Graecia of Catanzaro.
His current research interests include but are not limited to: Knowledge Discovery e Data Mining, Web Databases and Semistructured Data Management, Data Stream Mining and Spatio-Temporal Data Mining. Within ICAR-CNR, he has been coordinator and contributor of several national and international research/industrial projects. Among these, we mention : "Pushing Intelligence into Workflow Systems" (a national project aimed at defining novel techniques for extending current Workflow Management Systems with advanced analytical methods that support workflow design, optimization and monitoring); "GeoPKDD: Geographic Privacy-Aware Knowledge Discovery and Delivery" (a STREP European project and an homonymous national project, both aimed at studying techniques for harvesting and analyzing spatio-temporal information by privacy-aware methods); "TESEO: Techniques and methods for personalization of on-line services" (a regional project aimed at studying techniques and methods for providing multimodal and multichannel adaptive services to web users); "INFOMIX: Boosting the Information Integration" (a STREP european project aimed at defining a robust innovative theory and methodology for flexible information integration); "ECD: Technologies and Services for Enhanced Content Delivery" (a national project aimed at creating an advanced technology for organizing and handling web contents, by means of data mining techniques); "Discovery Farm" (a regional project aimed at defining and implementing a Pervasive Knowledge Management platform). "DICET-INMOTO" (Big data application for tourism) and VIPOC (Big data application for renewable energy market)
Here you can find a selected list of my publications.
For a more detailed (but still incomplete) list you can visit my profile on Google scholar
In this paper we propose an end to end framework that allows efficient analysis for trajectory streams. In particular, our approach consists of several steps. First, we perform a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories using a suitable data structure. After the encoding step we build specialized cuboids for trajectories in order to make the querying step quite effective. This problem revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant thus making the analysis quite harder than for classical transactional data. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
The recent advances in genomic technologies and the availability of large-scale microarray datasets call for the development of advanced data analysis techniques, such as data mining and statistical analysis to cite a few. Among the mining techniques proposed so far, cluster analysis has become a standard method for the analysis of microarray expression data. It can be used both for initial screening of patients and for extraction of disease molecular signatures. Moreover, clustering can be profitably exploited to characterize genes of unknown function and uncover patterns that can be interpreted as indications of the status of cellular processes. Finally, clustering biological data would be useful not only for exploring the data but also for discovering implicit links between the objects. To this end, several clustering approaches have been proposed in order to obtain a good trade-off between accuracy and efficiency of the clustering process. In particular, great attention has been devoted to hierarchical clustering algorithms for their accuracy in unsupervised identification and stratification of groups of similar genes or patients, while, partition based approaches are exploited when fast computations are required. Indeed, it is well known that no existing clustering algorithm completely satisfies both accuracy and efficiency requirements, thus a good clustering algorithm has to be evaluated with respect to some external criteria that are independent from the metric being used to compute clusters. In this paper, we propose a clustering algorithm called M-CLUBS (for Microarray data CLustering Using Binary Splitting) exhibiting higher accuracy than the hierarchical ones proposed so far while allowing a faster computation with respect to partition based approaches. Indeed, M-CLUBS is faster and more accurate than other algorithms, including k-means and its recently proposed refinements, as we will show in the experimental section. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. M-CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties. Due to the structural feature of microarray data (they are represented as arrays of numeric values), M-CLUBS is suitable for analyzing them since it is designed to perform well for Euclidean distances. In order to stronger the obtained results we interpreted the obtained clusters by a domain expert and the evaluation by quality measures specifically tailored for biological validity assessment.
Nowadays, almost all kind of electronic devices leave traces of their movements (e.g. smartphone, GPS devices and so on). Thus, the huge number of this “tiny” data sources leads to the generation of massive data streams of geo-referenced data. As a matter of fact, the effective analysis of such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams management poses new challenges both for their proper definition and acquisition, thus making the overall process harder than for classical point data. In particular, we are interested in solving the problem of effective trajectory data streams clustering, that revealed really intriguing as we deal with sequential data that have to be properly managed due to their ordering. We propose a framework that allow data pre-elaboration in order to make the mining step more effective. As for every data mining tool, the experimental evaluation is crucial, thus we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed approach.
RFID-based systems for object tracking and supply chain management have been emerging since the RFID technology proved effective in monitoring movements of objects. The monitoring activity typically results in huge numbers of readings, thus making the problem of efficiently retrieving aggregate information from the collected data a challenging issue. In fact, tackling this problem is of crucial importance, as fast answers to aggregate queries are often mandatory to support the decision making process. In this regard, a compression technique for RFID data is proposed, and used as the core of a system supporting the efficient estimation of aggregate queries. Specifically, this technique aims at constructing a lossy synopsis of the data over which aggregate queries can be estimated, without accessing the original data. Owing to the lossy nature of the compression, query estimates are approximate, and are returned along with intervals that are guaranteed to contain the exact query answers. The effectiveness of the proposed approach has been experimentally validated, showing a remarkable trade-off between the efficiency and the accuracy of the query estimation.
Datastreams are potentially infinite data sources that flow continuously while monitoring a physical phenomenon, like temperature levels or other kind of human activities, such as clickstreams, telephone call records, and so on. RFID technology has lead in recent years the generation of huge streams of data. Moreover, RFID based systems allow the effective management of items tagged by RFID tags, especially for supply chain management or objects tracking. In this paper we introduce SMART (Stream Monitoring enterprise Activities by RFID Tags) a system based on an outlier template definition for detecting anomalies in RFID streams. We describe SMART features and its application on a real life scenario that shows the effectiveness of the proposed method for enterprise management. Moreover, we describe an outlier detection approach we defined and effectively exploited in SMART.
Advances of high throughput technologies have yielded the possibility to investigate human cells of healthy and morbid ones at different levels. Consequently, this has made possible the discovery of new biological and biomedical data and the proliferation of a large number of databases. In this paper, we describe the IS-BioBank (Integrated Semantic Biological Data Bank) proposal. It consists of the realization of a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. In this framework, a key role has been played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).
PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the “uncertainty” which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains.
XPath expressions define navigational queries on XML data and are issued on XML documents to select sets of element nodes. Due to the wide use of XPath, which is embedded into several languages for querying and manipulating XML data, the problem of efficiently answering XPath queries has received increasing attention from the research community. As the efficiency of computing the answer of an XPath query depends on its size, replacing XPath expressions with equivalent ones having the smallest size is a crucial issue in this direction. This article investigates the minimization problem for a wide fragment of XPath (namely X P[✶]), where the use of the most common operators (child, descendant, wildcard and branching) is allowed with some syntactic restrictions. The examined fragment consists of expressions which have not been specifically studied in the relational setting before: neither are they mere conjunctive queries (as the combination of “//” and “*” enables an implicit form of disjunction to be expressed) nor do they coincide with disjunctive ones (as the latter are more expressive). Three main contributions are provided. The “global minimality” property is shown to hold: the minimization of a given XPath expression can be accomplished by removing pieces of the expression, without having to re-formulate it (as for “general” disjunctive queries). Then, the complexity of the minimization problem is characterized, showing that it is the same as the containment problem. Finally, specific forms of XPath expressions are identified, which can be minimized in polynomial time.
The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.
In this paper, we propose a classification technique for Web pages, based on the detection of structural similarities among semistructured documents, and devise an architecture exploiting such technique for the purpose of information extraction. The proposal significantly differs from standard methods based on graph-matching algorithms, and is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to an impulse. The degree of similarity between documents is then stated by analyzing the frequencies of the corresponding Fourier transform. Experiments on real data show the effectiveness of the proposed technique.
Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of Discrete Fourier Transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.
Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.
In this paper we present a new technique for detecting changes in Web documents. The technique is based on a new method to measure the similarity of two documents, that represent the actual and the previous version of the monitored page. The technique has been effectively used to discover changes in selected portions of the original document. The proposed technique has been implemented in the CMW system providing a change monitoring service on the Web. The main features of CMW are the detection of changes on selected portions of web documents and the possibility to express complex queries on the changed information. For instance, a query can require to check if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.
In this paper we propose the combined use of different methods to improve the data analysis process. This is obtained by combining inductive and deductive techniques. We also use different inductive techniques such as clustering algorithms, to derive data partition, and decision trees induction, characterizing classes in terms of logical rules. Inductive techniques are used for generating hypotheses from data whereas deductive techniques are used to derive knowledge and to verify hypotheses. In order to guide users in the analysis process, we have developed a system which integrates deductive tools and data mining tools such as classification algorithms, features selection algorithms, visualization tools and tools to manipulate data sets easily. The system developed is currently used in a large project whose aim is the integration of information sources containing data concerning the socio-economic aspects of Calabria and its subsequent analysis. Several experiments on the socio-economic data have shown that the combined use of different techniques improves both the comprehensibility and the accuracy of models.
Information management in healthcare is nowadays experiencing a great revolution. After the impressive progress in digitizing medical data by private organizations, also the federal government and other public stakeholders have also started to make use of healthcare data for data analysis purposes in order to extract actionable knowledge. In this paper, we propose an architecture for supporting interoperability in healthcare systems by exploiting Big Data techniques. In particular, we describe a proposal based on big data techniques to implement a nationwide system able to improve EHR data access efficiency and reduce costs.
Log analysis and querying recently received a renewed interest from the research community, as the effective understanding of process behavior is crucial for improving business process management. Indeed, currently available log querying tools are not completely satisfactory, especially from the viewpoint of easiness of use. As a matter of fact, there is no framework which meets the requirements of easiness of use, flexibility and efficiency of query evaluation. In this paper, we propose a framework for graphical querying of (process) log data that makes the log analysis task quite easy and efficient, adopting a very general model of process log data which guarantees a high level of flexibility. We implemented our framework by using a flexible storage architecture and a user-friendly data analysis interface, based on an intuitive and yet expressive graph-based query language. Experiments performed on real data confirm the validity of the approach.
Information management in healthcare is nowadays experiencing a great revolution. After the impressive progress in digitizing medical data by private organizations, also the federal government and other public stakeholders have also started to make use of healthcare data for data analysis purposes in order to extract actionable knowledge. In this paper, we propose an architecture for supporting interoperability in healthcare systems by exploiting Big Data techniques. In particular, we describe a proposal based on big data techniques to implement a nationwide system able to improve EHR data access efficiency and reduce costs.
Due to the increasing availability of huge amounts of data, traditional data management techniques result inadequate in many real life scenarios. Furthermore, heterogeneity and high speed of this data require suitable data storage and management tools to be designed from scratch. In this paper, we describe a framework tailored for analyzing user interactions with intelligent systems while seeking for some domain specific information (e.g., choosing a good restaurant in a visited area). The framework enhances user quest for information by performing a data exchange activity (called data posting) which enriches the information sources with additional background information and knowledge derived from experiences and behavioral properties of domain experts and users.
Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied occur at rates that are too fast to be effectively analyzed in real time. For example, in order to detect possible security threats, millions of records in a high-speed flow stream must be screened. To ameliorate this problem, a viable solution is the use of data compression for reducing the amount of data to be analyzed. In this paper we propose the use of privacy-preserving histograms, that provide approximate answers to 'safe' queries, for analyzing data in the cybersecurity scenario without compromising individuals' privacy, and we describe our system that has been used in a real life scenario.
Due to the emerging Big Data paradigm traditional data management techniques result inadequate in many real life scenarios. In particular, OLAP techniques require substantial changes in order to offer useful analysis due to huge amount of data to be analyzed and their velocity and variety. In this paper, we describe an approach for dynamic Big Data searching that based on data collected by a suitable storage system, enriches data in order to guide users through data exploration in an efficient and effective way.
Predicting the output power of renewable energy production plants distributed on a wide territory is a really valuable goal, both for marketing and energy management purposes. Vi-POC (Virtual Power Operating Center) project aims at designing and implementing a prototype which is able to achieve this goal. Due to the heterogeneity and the high volume of data, it is necessary to exploit suitable Big Data analysis techniques in order to perform a quick and secure access to data that cannot be obtained with traditional approaches for data management. In this paper, we describe Vi-POC -- a distributed system for storing huge amounts of data, gathered from energy production plants and weather prediction services. We use HBase over Hadoop framework on a cluster of commodity servers in order to provide a system that can be used as a basis for running machine learning algorithms. Indeed, we perform one-day ahead forecast of PV energy production based on Artificial Neural Networks in two learning settings, that is, structured and non-structured output prediction. Preliminary experimental results confirm the validity of the approach, also when compared with a baseline approach.
Due to the emerging Big Data applications traditional data management techniques result inadequate in many real life scenarios. In particular, OLAP techniques require substantial changes in order to offer useful analysis due to huge amount of data to be analyzed and their velocity and variety. In this paper, we describe an approach for dynamic Big Data searching that based on data collected by a suitable storage system, enrich data in order to guide users through data exploration in a efficient and effective way.
The issue of devising efficient and effective solutions for supporting the analysis of process logs has recently received great attention from the research community, as effectively accomplishing any business process management task requires understanding the behavior of the processes. In this paper, we propose a new framework supporting the analysis of process logs, exhibiting two main features: a flexible data model (enabling an exhaustive representation of the facets of the business processes that are typically of interest for the analysis) and a graphical query language, providing a user-friendly tool for easily expressing both selection and aggregate queries over the business processes and the activities they are composed of. The framework can be easily and efficiently implemented by leveraging either “traditional” relational DBMSs or “innovative” NoSQL DBMSs, such as Neo4J.
We consider the scenario where the executions of different business processes are traced into a log, where each trace describes a process instance as a sequence of low-level events (representing basic kinds of operations). In this context, we address a novel problem: given a description of the processes’ behaviors in terms of high-level activities (instead of low-level events), and in the presence of uncertainty in the mapping between events and activities, find all the interpretations of each trace Φ. Specifically, an interpretation is a pair ⟨σ,W⟩ that provides a two-level “explanation” for Φ: σ is a sequence of activities that may have triggered the events in Φ, and W is a process whose model admits σ. To solve this problem, we propose a probabilistic framework representing “consistent” Φ’s interpretations, where each interpretation is associated with a probability score.
The increasing availability of large process log repositories calls for efficient solutions for their analysis. In this regard, a novel specialized compression technique for process logs is proposed, that builds a synopsis supporting a fast estimation of aggregate queries, which are of crucial importance in exploratory and high-level analysis tasks. The synopsis is constructed by progressively merging the original log-tuples, which represent single activity executions within the process instances, into aggregate tuples, summarizing sets of activity executions. The compression strategy is guided by a heuristic aiming at limiting the loss of information caused by summarization, while guaranteeing that no information is lost on the set of activities performed within the process instances and on the order among their executions. The selection conditions in an aggregate query are specified in terms of a graph pattern, that allows precedence relationships over activity executions to be expressed, along with conditions on their starting times, durations, and executors. The efficacy of the compression technique, in terms of capability of reducing the size of the log and of accuracy of the estimates retrieved from the synopsis, has been experimentally validated.
The pervasive diffusion of new generation devices like smart phones and tablets along with the widespread use of social networks causes the generation of massive data flows containing heterogeneous information generated at different rates and having different formats. These data are referred as Big Data and require new storage and analysis approaches to be investigated for managing them. In this paper we will describe a system for dealing with massive tourism flows that we exploited for the analysis of tourist behavior in Italy. We defined a framework that exploits a NoSQL approach for data management and map reduce for improving the analysis of the data gathered from different sources.
The problem of accurately predicting the energy production from renewable sources has recently received an increasing attention from both the industrial and the research communities. It presents several challenges, such as facing with the rate data are provided by sensors, the heterogeneity of the data collected, power plants efficiency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe Vi-POC (Virtual Power Operating Center), a project conceived to assist energy producers and decision makers in the energy market. In this paper we present the Vi-POC project and how we face with challenges posed by the specific application. The solutions we propose have roots both in big data management and in stream data mining.
The recent advances in genomic technologies and the availability of large-scale datasets call for the development of advanced data analysis techniques, such as data mining and statistical analysis to cite a few. A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. Among the mining techniques proposed so far, cluster analysis has become a standard method for the analysis of microarray expression data. It can be used both for initial screening of patients and for extraction of disease molecular signatures. Moreover, clustering can be profitably exploited to characterize genes of unknown function and uncover patterns that can be interpreted as indications of the status of cellular processes. Finally, clustering biological data would be useful not only for exploring the data but also for discovering implicit links between the objects. Indeed, a key feature that lacks in many proposed approach is the biological interpretation of the obtained results. In this paper, we will discuss such an issue by analysing the results obtained by several clustering algorithms w.r.t. their biological relevance.
In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.
A simple hierarchical clustering algorithm called CLUBS (for CLustering Using Binary Splitting) is proposed. CLUBS is faster and more accurate than existing algorithms, including k-means and its recently proposed refinements. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties.
In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.
In this paper, we address the problem of trajectory data streams warehousing and querying, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
XML (eXtensible Markup Language) became in recent years the new standard for data representation and exchange on the WWW. This has resulted in a great need for data cleaning techniques in order to identify outlying data. In this paper, we present a technique for outlier detection that singles out anomalies with respect to a relevant group of objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier Transforms. Outliers are identified by simply looking at the signal spectra. The results show the effectiveness of our approach.
The recent advances in computing technology lead to the availability of a huge number of computational resources that can be easily connected through network infrastructures. Indeed, a really small fraction of the available computing power is fully exploited for performing effective computation of user tasks. On the contrary, there are several research projects that require a lot of computing power to reach their goals, but they usually lack adequate resources thus making the project activities quite hard to be completed. In this paper we describe D.E.A. (Distributed Execution Agent), a framework for sharing computational resources. We will exploit D.E.A. framework to tame the high computational demanding problem of hash MD5 reversing. We performed several experiments that confirmed the validity of our approach.
Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. The system will be used in a pilot study on the Multiple Myeloma (MM).
XML (eXtensible Markup Language) became in recent years the new standard for data representation and exchange on the WWW. This has resulted in a great need for data cleaning techniques in order to identify outlying data. In this paper, we present a technique for outlier detection that singles out anomalies with respect to a relevant group of objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier Transforms. Outliers are identified by simply looking at the signal spectra. The results show the effectiveness of our approach.
Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amount of data is a challenging problem, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams pose interesting challenges for their proper representation, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really intriguing as we deal with a kind of data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed technique.
Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in a such huge quantity of data. In this framework, a key role is played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).
Trajectory data streams are huge amounts of data pertaining to time and position of moving objects generated by different sources continuously using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams On Line Analytical Processing, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data outlier detection, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed technique.
Trajectory data streams are huge amounts of data pertaining to time and position of moving objects generated by different sources continuously using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams On Line Analytical Processing, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
Trajectory data refer to time and position of moving objects generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from these peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks and supply chain management. In this paper, we address the problem of trajectory data streams clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant.
Datastreams are potentially infinite sources of data that flow continuously while monitoring a physical phenomenon, like temperature levels or other kind of human activities, such as clickstreams, telephone call records, and so on. Radio Frequency Identification (RFID) technology has lead in recent years the generation of huge streams of data. Moreover, RFID based systems allow the effective management of items tagged by RFID tags, especially for supply chain management or objects tracking. In this paper we introduce SMART (Simple Monitoring enterprise Activities by RFID Tags) a system based on outlier template definition for detecting anomalies in RFID streams. We describe SMART features and its application on a real life scenario that shows the effectiveness of the proposed method for effective enterprise management.
The increasing availability of huge amounts of data pertaining to time and position of moving objects generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatial data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory clustering, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
The increasing availability of huge amounts of data pertaining to time and positions generated by different sources using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatial data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. In this paper, we address the problem of clustering spatial trajectories. In the context of trajectory data, clustering is really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient and effective clustering technique based on a proper metric. Finally, we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.
The increasing availability of huge amounts of “thin” data, i.e. data pertaining to time and positions generated by different sources with a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatio-temporal data collections. Mining such amounts of data is challenging, since the possibility of extracting useful information from this particular type of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks and supply chain management. In this paper, we address the issue of clustering spatial trajectories. In the context of trajectory data, this problem is even more challenging than in classical transactional relationships, as here we deal with data (trajectories) in which the order of items is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient clustering technique based on edit distance. Experiments performed on real world datasets have confirmed the efficiency and effectiveness of the proposed techniques.
The increasing availability of huge amounts of thin data, i.e. data pertaining to time and positions generated by different sources with a wide variety of technologies (e.g., RFID tags, GPS, GSM networks) leads to large spatio-temporal data collections. Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. In this paper, we address the clustering of spatial trajectories. In the context of trajectory data, this problem is even more challenging than in the classical transactions, as here we deal with data (trajectories) in which the order of items is relevant. We propose a novel approach based on a suitable regioning strategy and an efficient clustering technique based on edit distance. Experiments performed on real world datasets have confirmed the efficiency and effectiveness of the proposed techniques.
Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems since in the next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to the streaming nature of RFID readings, large amounts of data are generated by these devices at high production rates. This phenomenon is even more relevant since RFIDs are so cheap that every individual item can be tagged thus leaving a "trail" of data as it moves across different locations. This scenario raises new challenges in effectively and efficiently exploiting such large amounts of data. In this paper we address the problem of compressing RFID data in order to enable devices with limited amount of available memory (such as PDAs) to issue queries on RFID warehouses. In particular, we designed a lossy strategy for collapsing tuples carrying information about items being delivered at different location of the supply chain.
The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.
Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems. In next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to RFID readings features this will result in a huge amount of information generated by such systems when costs will be at a level such that each individual item could be tagged thus leaving a trail of data as it moves through different locations. We define a technique for efficiently detecting anomalous data in order to prevent problems related to inefficient shipment or fraudulent actions. Since items usually move together in large groups through distribution centers and only in stores do they move in smaller groups we exploit such a feature in order to design our technique. The preliminary experiments show the effectiveness of our approach.
Radio Frequency Identification (RFID) applications are emerging as key components in object tracking and supply chain management systems. In next future almost every major retailer will use RFID systems to track the shipment of products from suppliers to warehouses. Due to RFID readings features this will result in a huge amount of information generated by such systems when costs will be at a level such that each individual item could be tagged thus leaving a trail of data as it moves through different locations. We define a technique for efficiently detecting anomalous data in order to prevent problems related to inefficient shipment or fraudulent actions. Since items usually move together in large groups through distribution centers and only in stores do they move in smaller groups we exploit such a feature in order to design our technique. The preliminary experiments show the effectiveness of our approach.
The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.
In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significantly differs from standard methods based on graph-matching algorithms. The technique is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. Experiments on real data show the effectiveness of the proposed technique.
A distributed system for approximate query answering on sensor network data is proposed, where a suitable compression technique is exploited to represent data and support query answering. Each node of the system stores either detailed or summarized sensor readings. Query answers are computed by identifying the set of nodes that contain (either compressed or not) data involved in the query, and eventually partitioning the query in a set of sub-queries to be evaluated at different nodes. Queries are partitioned according to a cost model aiming at making the evaluation efficient and guaranteeing the desired degree of accuracy of query answers.
We introduce a technique based on data mining algorithms for classifying incoming messages, as a basis for an overall architecture for maintenance and management of e-mail messages. We exploit clustering techniques for grouping structured and unstructured information extracted from e-mail messages in an unsupervised way, and exploit the resulting algorithm in the process of folder creation (and maintenance) and e-mail redirection. Some initial experimental results show the effectiveness of the technique, both from an efficiency and a quality-of-results viewpoint.
In this paper we present a new technique for detecting changes on the Web. We propose a new method to measure the similarity of two documents, that can be efficiently used to discover changes in selected portions of the original document. The proposed technique has been implemented in the CDWeb system providing a change monitoring service on theWeb. CDWeb differs from other previously proposed systems since it allows the detection of changes on portions of documents and specific changes expressed by means of complex conditions, i.e. users might want to know if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.
Data mining algorithms generally deal with very large data sets that do not fit in main memory. Therefore, techniques that manage huge data sets need to be developed. Any algorithm that is proposed for mining data should have to account for out-of-core data structures. However, most of the existing algorithms haven't yet addressed this issue. In this paper we describe the implementation of an out-of-core technique for the data analysis of very large data sets with the sequential and parallel version of the clustering algorithm AutoClass. We discuss the out-of-core technique and show performance results in terms of execution time and speed up.
Often web users want to be notified when a specific information contained in a web page has been modified. The problem of detecting web document changes has been deeply investigated, and several systems providing notification of web page changes are available. These systems do not provide notification of changes on a specific information contained in a web page. In this work we present a system called CDWeb that performs this task. It allows users to monitor a whole document or specific portions of it. Users can also specify what kind of changes they are interested in, such as structural changes, or semantic changes. The system provides a flexible and adaptive view of the Web: it tracks user queries and creates user profiles, in order to associate a personalized view to each user.
Recent rapid growth in the ability to generate and store data by more powerful Database Management Systems and hardware architecture, leads to a question: how can we take advantage of this large amount of information? Traditional methods for querying and reporting are inadequate because they can only manipulate data and the information content derived is very low. Obtaining new relationships among data and new hypotheses about them is the aim of Knowledge Discovery in Databases (KDD) which makes use of Data Mining techniques. These techniques have interesting applications for business data such as market basket analysis, financial resource planning, fraud detection and the scheduling of production processes. In this work we consider the application of Data Mining techniques for the analysis of the balance-sheets of Italian companies.
In this paper we propose the combined use of different methods to improve the data analysis process. This is obtained by combining inductive and deductive techniques. Inductive techniques are used for generating hypotheses from data whereas deductive techniques are used to derive knowledge and to verify hypotheses. In order to guide users in the the analysis process, we have developed a system which integrates deductive tools, data mining tools (such as classification algorithms and features selection algorithms), visualization tools and tools for the easy manipulation of data sets. The system developed is currently used in a large project whose aim is the integration of information sources containing data concerning the socio-economic aspects of Calabria and the analysis of the integrated data. Several experiments on socio-economic indicators of Calabrian cities have shown that the combined use of different techniques improves both the comprehensibility and the accuracy of models.
Teaching activity has been performed in three Universities: Università della Calabria, Università Magna Graecia and Università Federico II di Napoli
Università Federico II di Napoli
Università Federico II di Napoli
Università Federico II di Napoli
Università Federico II di Napoli
Università Federico II di Napoli
Università della Calabria, Faculty of S. M. F. N.
Università Magna Graecia
Università Magna Graecia
Università Magna Graecia
Università della Calabria
Università della Calabria
Università della Calabria
Università della Calabria
Università della Calabria
Università della Calabria
Università Magna Graecia
Università Magna Graecia
Università Magna Graecia
Università Magna Graecia
Università Magna Graecia
Università Magna Graecia
I would be happy to talk to you if you need my assistance. Though I have limited time for students, please contact me in advance.
Napoli: DIETI - via claudio 21, Palazzina 3/A stanza 4.05
Rende: ICAR-CNR - via P. Bucci 9C, secondo piano
I am at my office every day from 9:30 am until 07:00 pm, but you may consider a call or email to fix an appointment
Sometimes you can find me at my Spin-off Coremuniti located at TechNest, piazza Vermicelli, Arcavacata di Rende, Cosenza
My lab is located close to my office, on the same floor
Please refer to laboratory staff