WikiISI/Data/Text/Web Mining

From AI Wiki

Jump to: navigation, search

relricrol baschisit Data/Text/Web Mining in ISI

Contents

Abstract

In the aftermath of the terrorist attacks on the World Trade Center and the Pentagon, data mining has become one of the key features of many homeland security initiatives. Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets[1][2]. Text mining and web mining are dealing with textual and web data sources respectively and they share a lot of similarities with data mining. Sometimes they are even treated as special cases of data mining. At the same time, various recommendations and efforts have also been made with the intention of improving information sharing among government entities at all levels within the United States, the private sector, and certain foreign governments, with a view to countering terrorists and strengthening homeland security[3]. Without information sharing, the findings found by data mining could not be able to share among different agencies so that the collaboration and coordinated action among different agencies would consequently be seriously affected. In this chapter, we first introduce a few important concepts in data/text/web mining techniques and we also present a literature review of current research on data/text/web mining for ISI. We discuss the research and system design followed by three representative case studies.

Introduction

Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets[1][2]. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction. Text mining and web mining are dealing with textual and web data sources respectively, and at the same time, they share a lot of similarities with data mining. Sometimes they are even treated as special cases of data mining.

In the aftermath of the terrorist attacks on the World Trade Center and the Pentagon, data mining has become one of the key features of many homeland security initiatives. Often used as a means for detecting fraud, assessing risk, and product retailing, data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets[4]. At the same time, various recommendations and efforts have also been made with the intention of improving information sharing among government entities at all levels within the United States, the private sector, and certain foreign governments, with a view to countering terrorists and strengthening homeland security[3]. Without information sharing, the findings found by data mining could not be able to share among different agencies so that the collaboration and coordinated action among different agencies would consequently be seriously affected.

In section 2, we introduce a few concepts and previous literatures in ISI data/text/web mining as well as information sharing and collaboration. In section 3, we also present a conceptual framework of information sharing and collaboration along with a generic framework for data/text/web mining. Finally, we discuss three case studies: COPLINK, Dark Web and TIA in section 4.

Overview of the Field

Important Concepts

Information Sharing and Collaboration

Information Sharing and Collaboration is a common information requirement which has been simply defined as the right information in the right amount in the right place at the right time. It is the sine qua non that enables success in homeland security[5].

  • Information Sharing: for homeland security, information sharing enables the interchange of terrorism information among and between appropriate Federal, State, Local, tribal, and territorial authorities, foreign partners and the private sector. It will support the ability of agencies to acquire additional such information, and, it will protect or enhance the freedom, information privacy, and other legal rights of Americans in the conduct of their activities.[6]
  • Collaboration: collaboration means different things to different people. Working together to provide a shared and improved result is commonly accepted. Collaboration adds the richness of context, sharing and questioning, opposing viewpoints and considerations, legal, technical and logistical limitations, rapid access to experts, and modeling and simulation insight, and develops the common understanding that enables the components to operate in a unified manner.[5]

Data Mining

Data Mining is similar to text mining. It uses a lot of the same techniques that text mining does. However, data mining uses highly structured data, as is stored in relational databases. We can thus think of data mining as a kind of text mining, using more types of data and more structured data. Knowledge discovery from databases (KDD) [7] is a process that includes data mining steps. Techniques to structure data are as follows:

  • Classification is putting the pieces of information extracted from documents or other sources into common categories based on its contents. This can either happen using supervised methods with available training data or unsupervised methods in which the systems needs to classify without reference to external training data.
  • Association Rule Mining attaches probabilities of one or more data items, called antecedents, being associated with other data items, called consequents with a certain probability, called the confidence factor of the rule. [8]
  • Clustering is a type of unsupervised learning. It groups similar data items into clusters without knowing their class membership using hierarchical or partitioning methods. Hierarchical methods find clusters using previous clusters, either from large to small clusters or from small to large clusters. In contrast, partitioning methods find all clusters at once. [9]

Text Mining

Text Mining concerns the automated discovery of new and relevant information from textual sources. This discovery can happen with statistical and linguistic methods. The ultimate goal of text mining is always to expand the knowledge of the user and to help in decision making. [10]

  • Information Extraction distills structured data from unstructured text using natural language processing techniques such as Named Entity Recognition, Coreference Identification, and Terminology Extraction. Terminology Extraction is a technique used in Information Retrieval, using bag of words models and feature vectors based on occurrence frequencies of terms in a document. [11] Named Entity Recognition locates and classified simple text elements into predefined categories. In the field of ISI, the concept of a named entity had first been defined at the Message Understanding-7 conference [12]. It is also a part used in the Arizona TerrorNet tool. [13] Coreferences are used in text to avoid repeating the same nouns over again. In the field of cognitive science, Hobbs founded the concept [14]

The key distinction between text mining and information extraction is that information extraction has the aim of extracting information from text and putting it into a schema which is known already. In contrast, text mining seeks to discover new information and patterns from unstructured text.

Web Mining

Web Mining is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining[15]

Personal tools