WikiBI/DataTextWebMining
From AI Wiki
Data/text/web mining
Abstract
Scientists nowadays face a challenge of information overload. Biomedical research has rapidly produced a large amount data in various forms including text data, genomic map data, genomic sequence data, and genomic expression data. Hospitals generate vast amount of patient information in both structural and non structural forms. The large news collections on public health and disease outbreak gathered from websites worldwide are available for the epidemiologist to study. To address this challenge, scientists have applied data/text/web mining techniques to help them effectively discover knowledge. Data/text/web mining has been used to organize non-structural data and make it accessible to other application, predicting gene function, discover a pattern that leads to hypothesis. In this chapter, we introduce a few concepts in data/text/web mining techniques that have been used in biomedical domain such as statistical machine learning, rule-based approach, and cluster analysis. We also present a review of current research on data/text/web mining for biomedicine and its challenge, discuss general research design, and conclude with a case study.
Introduction
Scientists nowadays face a challenge of information overload. Biomedical research has rapidly produced a large amount data in various forms including text data, genomic map data, genomic sequence data, and genomic expression data. For instance, the Pubmed collection that is made available through National Library of Medicine (NLM) contains information for over 12 million articles and continues to grow at a rate of 2,000 articles per week. The National Center for Biotechnology Information (NCBI) also provides more than 2 millions records of genes in more than 155 thousand species.[1]
In addition to the research literature, hospitals generate vast amount of patient information in both structural and non structural form. Narrative clinical notes such as operative reports, discharge summaries, and nursing notes are increasingly available in electronic form either through transcription or direct data entry[2]. In 2006, the WebMD Health Network had an average of more than 31 million unique monthly users and generated over 3 billion aggregate page views (WebMD Annual Report, 2006). Such data can be use to enhance disease surveillance, adverse event detection, and care quality assessment.
To address this challenge of information overload, scientists have applied data/text/web mining techniques to help them effectively discover knowledge from the large amount of data. Data/text/web mining has been used to organize non-structural data, e.g. patient information, and make it accessible to other application. The more structured data such as genomic sequence was processed using statistical method to predict gene/protein functions. A pattern that was statistically discovered in genomic expression data also leads to a new hypothesis generation.
In the next section, we introduce a few concepts and approaches in biomedical data/text/web mining. We also present a review of current research on data/text/web mining and its challenges. We discuss general research design in section 3 and conclude with the case study in section 4.
Overview of the Field
Important Concepts
Data mining in biomedical domain is concerned with discovering knowledge from biomedical data in various forms, including text, genomic maps, genomic expressions, and genomic sequences. Data mining is specifically called text mining when the text data is in natural language form. Data mining is called web mining when data is extracted from the website, which might include web usage, web content, and web structure.
Approaches to Data/text/web mining
We classify approaches to data/text/web mining into 2 major categories: knowledge-based approaches and statistical approaches.
Knowledge-based approaches
The knowledge-based approaches in data mining rely on human generated rules and dictionaries. For instance, names entity recognition studies that used knowledge-based approach usually require human developed grammar rules to indicate the location of candidate phrases. This approach may require a small amount of training data to validate the rules and dictionaries. Developing and maintaining a system with this approach can be time-consuming and expensive in the long run. Applying a knowledge-based system to a similar problem in other area may also be hard to accommodate. However, the performance of systems using this approach in terms of sensitivity and specificity is usually high. In text mining, some studies that used knowledge based approach include Fukuda et al. (1998)[3], Narayanaswamy et al. (2003)[4], and Koike et al. (2005)[5].
Statistical approaches
Unlike knowledge-based approaches, statistical approaches aim to reduce human effort in maintaining rules and dictionaries of biomedical terms. In this section, we present Bayes' theorem, cluster analysis, machine learning with point-wise classification, and machine learning with graphical model which have been used in biomedical data/text/web mining.
Bayes' Theorem
Many statistical approaches rely on the Bayes' theorem.
Pr(H | D) = Pr(D | H)Pr(H) / Pr(D)
Where D is data, H is a model. Pr(H | D) is the posterior probability of a model given data D. The key data-dependent term Pr(D | H) is a likelihood, and is sometimes called the evidence for model H. The given data is usually some pattern of text or graphics and this is compared with the available data to recognize which data should be mined.[6]
Cluster analysis
Cluster analysis has been widely used for Microarray data analysis. This method partitions a data set into subsets (clusters) of similar features. This method is also regarded as unsupervised learning in the context of machine learning. The common methods for cluster analysis include Principal components analysis (PCA) and factor analysis which are based on the distribution of data. They are frequently effective with mixed population and non-ideal data, and subpopulations can be seen in the component graph in lower dimensions. Cluster analysis can be visualized in a tree called a dendrogram.[7]
Machine learning with point-wise classification
Machine learning with point-wise classification has been used in cancer classification. This approach includes Naïve Bayes (NB), Support Vector Machine (SVM), Maximum Entropy (ME), K-Nearest Neighbor (KNN), Decision Tree, Neural networks, and Inductive rule learning. This approach tries to make the best local decision by minimizing classification error at each data point. This approach has been widely used in relation extraction, and text classification studies. [8]
The Naïve Bayesian Model is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. In the joint probability model, each feature Fi is conditionally independent of every other feature Fj for j not equal to i. For real world cases, it works better than the Bayesian model.
Support Vector Machines methods are used to create functions from a set of labeled training data. The function can be a classification function (the output is binary: is the input in a category) or the function can be a general regression function. These methods are powerful but treated like “black boxes”. Support Vector Machines non-linearly map their n-dimensional input space into a high dimensional feature space. In this high dimesional feature space a linear classifier is constructed.
Neural networks are semi supervised learning methods which start with a particular assumption of the relations they will be able to find in the given data by unsupervised or supervised training which the researcher performs anticipating the kind of data expected in the application. The genes or the RNAs whose relations need to be extracted are the nodes and the relations between them are the neurons. The numerical weights on them are initialized based on training and then continually improved (the weights on the existing neuron and its surrounding are increased while the others are decreased) to finally show the importance and strength of relationship or effect of one variable (gene or process) over another.
The types are Back propagation / feed forward models, Kohonen’s self-organizing map which provides unsupervised learning, clustering and pattern recognition and the Hopfield network which specializes in search and optimization applications. They give predictive power and classification accuracy.
Machine learning with graphical model
When the data mining problem is formed as pairing between an input sequence and a label sequence, the machine learning with graphical model can be applied. For instance, the named entity recognition problem is formed as pairing between a sequence of words in a sentence and a sequence of tags. This machine learning approach includes Hidden Markov models (HMMs), Linear interpolated HMMs, Back-off HMMs, Maximum Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs).
Specifically, CRFs are state-of-the-art sequence labeling techniques. They can trade off decisions at different sequence positions to obtain a globally optimal labeling. The CRFs framework has already been used to obtain promising results in a number of domains including: tagging, parsing, and information extraction in natural language processing. Settles (2004)[9] used CRFs to recognize biomedical named entities. CRFs are presented in more complete detail by Lafferty et al. (2001) [10].
Data/Text/Web mining Applications
Data mining application for Biomedicine
In recent years, experimental research in biomedicine have produced large amounts of data. There is an urgent need for efficient methods of processing and understanding this data. This resulted in a rapid development of an interdisciplinary research area which applies data mining techniques to the analysis of biomedical datasets. Data mining techniques play a crucial role in analyzing and integrating these large datasets, as well as in discovering the biological processes underlying these data. We can categorize application areas of data mining into 3 types: genomic data mining, clinical data mining, and public health/Biosurveillance data mining.
- Genomic data mining
- The genomic data mining looked into biological structured data, i.e. genomic map, genomic expression, and genomic sequence. Genomic map data identify the position of a gene on a chromosome or on the DNA itself. It helps identifying human disease genes and mutations. Many disease-related genes are found by linkage to chromosomal regions. For example, chromosomal aberrations have been found to be associated with cancer [11]. The main source of genomic data is the National Center for Biotechnology Information (NCBI).
- Clinical data mining
- Clinical data mining looked into clinical data measurements for diagnosis, e.g. for cancer disease or neurological disease. In cancer diagnosis, data mining has been used to predict breast cancer survivability, such as the work of Delen et al. (2005)[12]. The source of cancer diagnosis data is the SEER Cancer Incidence Public-Use Database. The SEER data files can be requested through the Surveillance, Epidemiology, and End Results (SEER) web site (http://www.seer.cancer.gov). The SEER Program is a part of the Surveillance Research Program (SRP) at the National Cancer Institute (NCI) and is responsible for collecting incidence and survival data from the participating nine registries, and disseminating these datasets (alongvwith descriptive information of the data itself) to institutions and laboratories for the purpose of conducting analytical research[13].
- In neurology, data mining has been used to recognize the presence and severity of motor fluctuations in patients with Parkinson’s disease, such as the work of Bonato et al. (2004)[14]. The neurology data can be collected using ACC (accelerometer) and EMG (electromyographic) sensors which gather bioelectrical signals during standardized clinical tests including sitting, finger-to-nose, tapping, sit-to-stand, walking, and stand-to-sit.
- Public health/Biosurveillance data mining
- Nowadays, it is important to detection of outbreaks of disease, whether natural or bioterrorist induced. Government need to detect outbreaks as early as possible in order to provide response to them. Public health/Biosurveillance data mining collects and processes a wide array of data sources. These data sources include chief complaints, emergency department (ED) visits, ambulatory visits, hospital admissions, triage nurse calls, 911 calls, work or school absenteeism data, veterinary health records, laboratory test orders, health department requests for influenza testing, among others. For instance, one of the most established syndromic surveillance projects, the Real-time Outbreak Detection system (RODS), uses laboratory orders, dictated radiology reports, dictated hospital reports, poison control center calls, chief complaints data, and daily sales data for over-the-counter (OTC) medications for syndromic surveillance[15].
Text mining application for Biomedicine
Text mining is an emerging field in computational linguistics that has been applied to biomedical domain. It can refer to automated process using a data-driven approach to extract knowledge from large text collections. These text collections may include scientific abstracts, full-text articles, and websites that contain the collective facts known about nearly all biomedical substances and functions that have ever been studied. Likewise, text mining can be applied to extract knowledge from the clinical information as well. Biomedical text mining studies focus on 6 areas[8]:
- Named Entity Recognition
- The goal of NER is to recognize all of substance names mention in text, for example, all of the drug names within a collection of journal articles, or all of the gene names and symbols within a collection of MEDLINE abstracts. NER is a basis for further extraction of relationships and other information. This task is difficult because there is no complete dictionary for most types of biological named entities, and that the majority of biological names are ambiguous. The same word or phrase can refer to a different thing depending upon context. Many biological entities also have several names. Currently, the performance of state-of-the- art gene and protein NER systems achieves F-scores between 75 and 85 percent. [8]
- Text Classification
- Text classification attempts to automatically determine whether a document or part of a document has particular characteristics of interest, usually based on whether the document discusses a given topic or contains a certain type of information. Text classification systems must automatically extract the features that help determine classes and apply those features to candidate documents using some kind of decision-making process. Accurate text classification systems can be especially valuable to database curators, who may have to review many documents to find a few that contain the kind of information they are collecting in their database. [8]
- Synonym and Abbreviation Extraction
- Most of the work in this type of extraction has focused on uncovering gene name synonyms and biomedical term abbreviations. An automated system to collect synonyms and abbreviations aids users doing literature searches because there are many biomedical entities have multiple names and abbreviations. Furthermore, if all of the synonyms and abbreviations for an entity could be mapped to a single term representing the concept, other NLP tasks could be done more efficiently.
- Relationship Extraction
- For relationship extraction, the goal is to classify the existent of relationship between a pair of biomedical entities. In the current genomic era, most studies have focused around relation extraction between genes and proteins. Some researchers use patterns of sentence to extract relationships. Some researchers use co-occurrence and other similarity measures to determine the probability of relationship between candidate pair. Grouping genes by functional relationships could potentially aid gene expression analysis and database annotation[16].
- Hypothesis Generation
- Hypothesis generation attempts to uncover relationships that are not present in the text but instead are inferred by the presence of other more explicit relationships. The goal is to uncover previously unrecognised relationships worthy of further investigation[8]. Practically all of the work in hypothesis generation makes use of large databases of scientific literature that allow discoveries to be made by connecting concepts using logical inference.
- Integration Frameworks
- Several research groups are developing integrated text-mining frameworks intended to be able to address a variety of user needs, e.g. to extract relationships between biomedical entities, to perform gene-based text profiling and clustering using the information contained on multiple on-line biological databases, and to locate and retrieve full text journal articles.
Web mining application for Biomedicine
Web mining for Biomedicine is a new field that has little studies. Data is extracted from the website, which might include web usage, web content, and web structure.
Web usage mining is the application that uses data mining to analyse and discover interesting patterns of user’s usage data on the web. The usage data records the user’s behaviour when the user browses or makes transactions on the web site. The Web usage data includes the data from Web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data as the results of interactions. In order to better understand and serve the needs of users or Web-based applications. It is an activity that involves the automatic discovery of patterns from one or more Web servers. For example, Bracke (2004)[17] conducted an exploratory study on Web usage mining at an academic health sciences library, and Liu et al. (2006)[18] implemented similay system at SenseLab, a Web-based neuroscience database system.
Web content mining is the process to discover useful information from the content of a web page. The type of the web content may consist of text, image, audio or video data in the web. Web content mining sometimes is called web text mining, because the text content is the most widely researched area. The technologies that are normally used in web content mining are NLP (Natural language processing) and IR (Information retrieval). Multimedia data is also growing at enormous rate. Because of its un-structured or semi-structured nature, it is difficult for researchers to extract the information without advanced tools.
Web structure mining tries to discover the model underlying the link structures of the Web. According to the type of web structural data, web structure mining can be divided into two kinds. The model is based on the topology of the hyperlinks with or without the description of the links. This model can be used to categorize Web pages and is useful to generate information such as the similarity and relationship between different Web sites. Web structure mining could be used to discover authority sites for the subjects (authorities) and overview sites for the subjects that point to many authorities (hubs).
The major source of data for web mining for Biomedicine includes WebMD and ClinicalTrial.gov. WebMD is a comprehensive site that provides information and services for physicians, consumers, providers and insurance professionals to help everyone navigate the complexity of today’s health care system. Their information, products, and services help to streamline medical processes, thereby reducing medical costs. ClinicalTrials.gov offers up-to-date information for locating federally and privately supported clinical trials for a wide range of diseases and conditions. It consists of medical studies involving humans. Clinical trials show the cause-and-effect relationship between the variables and the outcome. It contains more than 36,100 clinical studies sponsored by the National Institutes of Health, other federal agencies, and private industry[19].
Some other websites in the e-health domain includes revolutionhealth.com, Medscape.com, doctor.com, doctorslink.com, drkoop.com, healthanswers.com, americasdoctor.com, and intelihealth.com.
Research/System Design
Data/Text/Web mining systems strive to analyze simultaneously the data from multiple sources/formats, filter through the evidence, identify implicit connections that have potential and then ensure that these are novel in the sense that they have not yet been explicitly addressed. The Data/Text/Web mining systems usually consist of four main components.
Preprocessing
At the preprocessing stage, data in various form are collected, filtered, and cleansed. In web mining, the data can be collected using spidering. If data is in relational database, the query must be written to identify the target collection of interest. The data collection is then filtered and cleansed to ensure validity and consistency.
- De-identification
De-identification, is the process of removing or altering data in a medical record that could be used to identify the patient. It is a technique employed to allow research, training, or other non-clinical applications to use real medical data, without compromising patient privacy. According to HIPAA, once 18 specified identifiers are deleted, the restrictions and requirements of federal and state privacy laws no longer apply. There are two methods of de-identification: 1) use of statistical methods proven to render information not individually identifiable, and 2) deletion of 18 specified identifiers.
Data Transformation
In this stage, features are extracted from the data collection. Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. Feature selection is the technique of selecting a subset of relevant features for building robust learning models. Analysis with a large number of variables generally requires a large amount of memory and computation power. Feature extraction and feature selection address this problem.
Researchers choose feature based on the research objective and data set. For instance, in biomedical text mining, researchers may be interested in part-of-speech, bag of words, and some word triggers in text. On the other hand, the data mining researcher who works on cancer diagnostic may be interested in age, gender, weight, height, and cholesterol level of patients.
Data Mining/Pattern Discovery
At this stage, the data is partitioned into training set, test set, and sometime tuning set, if needed. (The data partition is not needed for the cluster analysis technique.) The model is trained (learned) from training set. Then, that model is applied to the test set for further evaluation.
Interpretation/Evaluation
Further interpretation can also be done using visualization tool such as self-organizing map (SOM). The performance of the data mining can be evaluated by analyzing the result at the test set in terms of precision, recall, and F-measure.
Case Studies
We select two case studies to present here. The first case study performed data mining on clinical trial data. The second case study performed web mining using web usage data. Both of these studies were conducted by prominent researchers on real-life large-scale dataset.
Case Study 1: Information Mining Over Heterogeneous and High-Dimensional Time-Series Data in Clinical Trials Databases (Altiparmak et al., 2006)
- Objective
- An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. In this case study (Altiparmak et al., 2006)[20], the author proposed a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. The objective is to identify a small set of analytes, substances inside the blood or urine, that effectively models the state of normal health.
- Preprocessing
- Data set
- This study used the industry-sponsored clinical trial data. The main division of the data was based on the drugs that were studied for marketing. For each drug, a sample of patients were selected from different regions, genders and ages. There are more than 28,000 patients for different drugs. Each patient was in only one study, and was measured at a limited number of unequally spaced time points. For each visit (time point), a patient’s blood and urine samples were taken; these samples are called analytes such as hemoglobin, calcium, or phosphate. For a given drug, a set of evaluated analytes might differ across studies.
- Data Collection and Filtering
- Accoring to the authors, "the subset of data taken as the input to the algorithm contains the patients that have at least k observations for each member of a set of analytes, which is determined as follows. First, for each analyte, the total number of patients that has at least k observations is calculated. Second, based on the numbers found in step one, a threshold is decided, and each analyte that passes the threshold test is selected as a member of analyte set. Patients that have at least k observations for each member of the previously selected analyte set are chosen. k was set to four after a set of tries in our experiments. The result of these tries validated two facts: 1) while k increases, the total number of analytes and patients in the subset decreases, and 2) while k decreases, the lengths of series become inadequate to analyze and compare."
- Data set
- Data Transformation
- The authors proposed two new similarity distance metrics that are suitable to the nature of the clinical trial data: mean-wise comparison (MWC) and Slope-wise Comparison (SWC). Given data points of the series X and Y, MWC takes four inputs, X , Mean of X , Y , and Mean of Y. If both Xi and Yi are more than or less than the mean of their own series, then distance is set to 0; otherwise, distance is set to 1. There is also a fuzzy region inserted into algorithm if |(X/Mean of X ) - (Y/Mean of Y )| is less than a threshold; then distance is also set to 0. The SWC metric takes four inputs, x1, x2, y1, and y2, and compares the relationships between x1, x2, and y1, y2. There are five possible distances which can be assigned: 0, 0.25, 0.5, 0.75, and 1. The sum of absolute values of x1, x2 is used to find an artificial slope. These artificial slopes are compared to positive threshold (pt) and negative threshold (nt) in order to determine the distance between these two pairs.
- Data Mining
- Using these distance metrics, they cluster the series of attributes in each data source. The information is further mined to find groups of attributes that occur in all subsets. These are:
- 1) Selected Features set-1: Hematocrit, Neutrophils(%), Total Bilirubin, Globulin, SGOT(AST), BUN, Creatinine, Phosphorus; and
- 2) Selected Features set-2: Hematocrit, Total Bilirubin, Globulin, SGOT(AST), BUN, Creatinine, Phosphorus, Neutrophils(abs).
- Interpretation/Evaluation
- The results have verified two biological panels (of blood analytes) already well-known in the medical field. Besides the well-known groups, they also identified biological groups that are not commonly used. The proposed algorithm is general and can be applied to pharmaceutical and clinical data, as well as other high dimensional and heterogeneous data sets. It illustrates the interactions between blood analytes that is crucial to understand the human body.
Case Study 2: Automated Discovery of Patient-Specific Clinician Information Needs Using Clinical Information System Log Files (Chen and Cimino, 2003)
In this study, Chen and Cimino(2003)[21] have applied their pattern discovery method to WebCIS (Web-based Clinical Information System) at New York Presbyterian Hospital (NYPH). WebCIS enables clinicians to browse the content of patients’ medical records. Log analysis help identify patient-specific information needs. The results can be used to guide design and development of relevant clinical information systems.
- Preprocessing
- Data set
- Logs record the actions of all users of WebCIS in chronological order. Each line in the logs consists of seven fields: timestamp, application name, username or user ID, client machine name or IP address, 7-digit medical record number (MRN), data type and action.
- Data Collection and Filtering
- Preprocessing tasks include de-identification, data cleaning, enrichment and transformation. They de-identify the data by encrypting all usernames and MRNs using the MD5 hash function. Data cleaning removes duplication and filtering out unnecessary data.
- Data set
- Data Transformation
- Most of the techniques will be interested in viewing the log as user sessions. They transform their data into 2 types: 1) contains only data types and actions, and 2)contains data types with generalized subtypes and actions with generalized modifiers.
- Pattern Discovery
- The four pattern discovery techniques they have chosen to perform are:
- Descriptive statistics
- Path analysis: To identify the frequently visited pages in WebCIS
- Association rule generation: To relate data types that are most often referenced together in a user session, disregarding the order. They used Apriori algorithm to identify items that commonly occur together.
- Sequential pattern discovery: To generate rules that take into account the order or sequence of data types in a user session.
- Interpretation/Evaluation
- Results from all of pattern discovery techniques indicate that WebCIS users commonly view laboratory and radiology results in a session. The frequent association rules and sequential patterns only give a general idea of the access patterns of users, e.g., abdominal ultrasonography (USG) results are commonly viewed after liver function test (LFT) results.
Conclusions
Data/text/web mining help scientists discover knowledge in large amount of data. In this chapter, we presented a review of current research on data/text/web mining for biomedicine and its challenge, and discuss general research design. Data/text/web mining has been used to organize non-structural data and make it accessible to other application, predicting gene function, discover a pattern that leads to hypothesis. In future, we should see more studies that utilize online resources such as WebMD and others.
References
- ↑ National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/
- ↑ Murff, H.J., A.J. Forster, J.F. Peterson, J.M. Fiskio, H.L. Heiman, D.W. Bates (2003). "Electronically screening discharge summaries for adverse medical events." J Am Med Inform Assoc. 2003;10:339–50.
- ↑ Fukuda, K., A. Tamura, et al. (1998). "Toward information extraction: identifying protein names from biological papers." In Proceedings of Pac Symp Biocomput: 707-18.
- ↑ Narayanaswamy, M., K. E. Ravikumar, et al. (2003). "A biological named entity recognizer." In Proceedings of Pac Symp Biocomput: 427-38.
- ↑ Koike, A., Y. Niwa, et al. (2005). "Automatic extraction of gene/protein biological functions from biomedical text." Bioinformatics 21(7): 1227-36.
- ↑ Duda, R.O., P.E. Hart, and N.L. Nilsson (1976). "Subjective Bayesian Methods for a Rule-Based Inference System," In Proceeding of Nat'l Computer Conf., Vol. 45, 1976, pp. 1075-1082.
- ↑ Smith, L. (2005). "Chapter 20 Exploratory Genomic Data Analysis." Medical Informatics: Knowledge Management and Data Mining in Biomedicine. Springer.
- ↑ 8.0 8.1 8.2 8.3 8.4 Cohen, A.M. and W.R. Hersh (2005). "A survey of current work in biomedical text mining." Briefings in Bioinformatics 6: 57-71.
- ↑ Settles, B. (2004). "Biomedical Named Entity Recognition Using Conditional Random Fields and Rish Feature Sets." In Proceedings of the COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA). Geneva, Switzerland. 2004
- ↑ Lafferty, J., A. McCallum, et al. (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." In Proceedings of 18th International Conf. on Machine Learning 2001.
- ↑ Mitelman, F., F. Mertens, B. Johansson (1997). "A Breakpoint Map of Recurrent Chromosomal Rearrangements in Human Neoplasia," Nature Genet., 15:417-474.
- ↑ Delen, D., G. Walker, and A. Kadam (2005). "Predicting breast cancer survivability: A comparison of three data mining methods." Artificial Intelligence in Medicine 34 (2005) (2), pp. 113–127.
- ↑ SEER Cancer Statistics Review. Surveillance, Epidemiology,and End Results (SEER) program. http://www.seer.cancer.gov
- ↑ Bonato, P., D.M. Sherrill, D.G. Standaert, S.S. Salles, and M. Akay (2004). "Data Mining Techniques to Detect Motor Fluctuations in Parkinson's Disease." 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Francisco (California). September 1–5, 2004
- ↑ Tsui, F.C., J.U. Espino,V.M. Dato, P.H. Gesteland , J. Hutman, and M.M. Wagner (2003). "Technical Description of RODS: a Real-time Public Health Surveillance System." J Am Med Inform Assoc 2003. 10,pp. 399-408, 2003.
- ↑ Raychaudhuri, S., H. Schutze, and R.B. Altman (2002). "Using text analysis to identify functionally coherent gene groups." Genome Res., Vol. 12(10), pp. 1582–1590.
- ↑ Bracke, P.J. (2004). "Web usage mining at an academic health sciences library: an exploratory study." J Med Libr Assoc 2004;92:421-428.
- ↑ Liu, N., L. Marenco, and P.L. Miller (2006). "ResourceLog: An Embeddable Tool for Dynamically Monitoring the Usage of Web-Based Bioscience Resources." J. Am. Med. Inform. Assoc. 13: 432-437.
- ↑ ClinicalTrials.gov - Information on Clinical Trials and Human Research Studies.http://www.clinicaltrial.gov/
- ↑ Altiparmak, F., H. Ferhatosmanoglu, S. Erdal, D.C. Trost (2006). "Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases." Information Technology in Biomedicine, IEEE Transactions on Volume 10, Issue 2, April 2006 Page(s):254 - 263
- ↑ Chen, E.S. and J.J. Cimino (2003). "Automated discovery of patient-specific clinician information needs using clinical information system log files." Proc AMIA Symp. 2003.

