DATA WAREHOUSING & DATA MINING
Fundamentals
of Data mining:
What
Motivated Data Mining? Why Is It Important?
Data mining is due to the wide
availability of huge amounts of data and the approach need for turning such
data into useful information and knowledge. The information and knowledge
gained can be used for applications ranging from market analysis, fraud
detection, and customer retention, to production control and science
exploration. Data mining can be viewed as a
result of the natural evolution of information technology. The database system
in the development of the following functionalities: data
collection and database creation, data management (including
data storage and retrieval, and database transaction processing), and advanced
data analysis (involving data warehousing and data
mining).
Evolution
of Database Technology
n
1960s:
n
Data collection, database creation, IMS
and network DBMS
Since
the 1960s, database and information technology has been evolving systematically
from primitive file processing systems to sophisticated and powerful database
systems. The research and development in database systems
n
1970s:
n
Relational data model, relational DBMS
implementation
since
the 1970s has progressed from early hierarchical and network database systems
to the development of relational database systems (where data are stored in
relational table structures; data modeling tools, and indexing and accessing
methods.
Efficient methods for on-line
transaction processing (OLTP).
n
1980s:
n
RDBMS, advanced data models
(extended-relational, OO, deductive, etc.)
Research and development activities on new and powerful database
systems
n
Application-oriented DBMS (spatial, scientific,
engineering, etc.)
Application-oriented
database systems, including spatial, temporal, multimedia, active, stream, and
sensor, and scientific and engineering databases,
Knowledge bases, and office information
bases, have flourished
n
1990s:
n
Data mining, data warehousing, multimedia
databases, and Web databases
n
2000s
n
Stream data management and mining
n
Data mining and its applications
n
Web technology (XML, data integration) and
global information systems.
What
Is Data Mining?
n
Data mining (knowledge discovery from
data)
n
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of data.
n
The term is actually a misnomer. A
misnomer that carries both “data” and “mining”
became a popular choice.
Many other terms carry a similar or slightly different meaning to data mining, such
as knowledge mining from data, knowledge extraction, data/pattern
analysis, data
archaeology, and data dredging.
Knowledge
Discovery (KDD) Process
Many people treat data mining as a
synonymfor another popularly used term, Knowledge
Discovery from Data, or KDD.
Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery. Knowledge discovery as a process and consists of an iterative sequence of the
following steps:
1.
Data
cleaning (to remove
noise and inconsistent data)
2.
Data integration
(where multiple data sources may be combined)1
3.
Data
selection (where data
relevant to the analysis task are retrieved fromthe database)
4.
Data
transformation (where data are
transformed or consolidated into forms appropriate
for mining by performing summary or
aggregation operations, for instance)2
5.
Data
mining (an essential
process where intelligent methods are applied in order to
extract
data patterns)
6.
Pattern
evaluation (to identify
the truly interesting patterns representing knowledge
based on
some interestingness measures;
Section 1.5)
7.
Knowledge
presentation (where
visualization and knowledge representation techniques
are used
to present the mined knowledge to the user)
Data Mining—On
What Kind of Data?
Data mining should be applicable to
any kind of data repository, as well as to transient data, such as data
streams. Thus the scope of our examination of data repositories will include
relational databases, data warehouses, transactional databases, advanced
database systems, flat files, data streams, and the
World
Wide Web. Advanced database systems include object-relational databases and specific
application-oriented databases, such as
spatial databases, time-series databases, text databases, and multimedia
databases.
1.3.1
Relational Databases
A
database system, also called a database management
system (DBMS), consists of a collection of interrelated
data, known as a database, and a
set of software programs to manage and access the data.
A Relational
database is a collection of tables,
each of which is assigned a unique name. Each table consists of a set of attributes
(columns or fields)
and usually stores a large set of tuples (records
or rows). Each tuple
in a relational table represents an object identified by a unique key
and described by a set of attribute values. A semantic data
model, such as an entity-relationship (ER) data
model, is often constructed for relational databases.
1.3.2
Data Warehouses
A data
warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and that usually resides
at a single site. Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading, and periodic
data refreshing.
Figure shows the typical framework for construction and use of a data warehouse
for AllElectronics.
1.3.3
Transactional Databases
In
general, a transactional database consists
of a file where each record represents a transaction.
A
transaction typically includes a unique transaction identity number (trans
ID) and a list of the items
making up the transaction (such as items purchased in a store).
The transactional database may have additional tables associated with it, which
contain other information regarding the sale, such as the date of the
transaction, the customer ID number and so on.
1.3.4
Advanced Data and Information Systems and Advanced Applications
, various kinds of advanced data and
information systems have emerged and are undergoing development to address the
requirements of new applications.
The
new database applications include handling spatial data (such as maps), engineering
design data (such as the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio Data),
time-related data (such as historical records or stock exchange data), stream
data (such as video surveillance and sensor data, where data flow in and out
like streams), and the WorldWideWeb (a huge, widely distributed information
repository made available by the Internet).
1.3.5
Object-Relational Databases
Object-relational
databases are constructed based on an
object-relational data model. This model extends the relational model by
providing a rich data type for handling complex objects and object orientation.
Conceptually, the object-relational data model inherits the essential concepts
of
object-oriented
databases, where, in general terms, each entity is
considered as an object.
Each
object has associated with it the following:
A set of variables
that describe the objects. These correspond to attributes in
the entity-relationship and relational models.
A set of messages
that the object can use to communicate with other objects, or
with the rest of the database system.
A set of methods,
where each method holds the code to implement a message. Upon receiving a
message, the method returns a value in response.
1.3.6
Temporal Databases, Sequence Databases, and
Time-Series
Databases
A temporal database typically
stores relational data that include time-related attributes.
These
attributes may involve several timestamps, each having different semantics.
A sequence database stores
sequences of ordered events, with or without a concrete
notion of time.
A time-series
database stores sequences of values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly).
1.3.7
Spatial Databases and Spatiotemporal Databases
Spatial
databases contain spatial-related information.
Examples include geographic (map) databases, very large-scale integration
(VLSI) or computed-aided design databases, and medical and satellite image
databases. Spatial data may be represented in raster
format, consisting of n-dimensional
bit maps or pixel maps.
A spatial database that stores
spatial objects that change with time is called a spatiotemporal database, from which
interesting information can be mined. For example, we may be able to group the
trends of moving objects and identify some strangely moving vehicles.
1.3.8
Text Databases and Multimedia Databases
Text
databases are databases that contain word
descriptions for objects. These word descriptions are usually not simple
keywords but rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages, summary reports, notes,
or other documents. Text databases may be highly unstructured.
Multimedia databases store
image, audio, and video data. They are used in applications such as picture
content-based retrieval, voice-mail systems, video-on-demand systems, the World
Wide Web, and speech-based user interfaces that recognize spoken commands.
1.3.9
Heterogeneous Databases and Legacy Databases
A heterogeneous
database consists of a set of interconnected,
autonomous component databases. The components communicate in order to exchange
information and answer queries.
A legacy
database is a group of heterogeneous
databases that combines different kinds of data systems,
such as relational or object-oriented databases, hierarchical databases,
network databases, spreadsheets, multimedia databases, or file systems. The
heterogeneous databases in a legacy database may be connected by intra or
inter-computer networks.
1.3.10
Data Streams
Many
applications involve the generation and analysis of a new kind of data, called stream data,
where data flow in and out of an observation platform (or window) dynamically.
Such data streams have the following unique features: huge
or possibly infinite volume, dynamically changing, flowing in and out in a
fixed order, allowing only one or a small number of scans, and demanding fast
(often real-time) response time.
1.3.11
TheWorld WideWeb
The
World Wide Web and its associated distributed information services, such as
Yahoo!,
Google, America Online, and AltaVista, provide rich, worldwide, on-line
information
services,
where data objects are linked together to facilitate interactive access.
1.3 Data Mining Functionalities—What Kinds of Patterns Can Be
Mined?
Data
mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two
categories: descriptive and
predictive. Descriptive mining tasks
characterize the general properties of the data in the database. Predictive
mining tasks perform inference on the current data in order to make predictions.
Data mining systems should also
allow users to specify hints to guide or focus the search for interesting
patterns. Because some patterns may not hold for all of the data in the
database, a measure of certainty or “trustworthiness” is usually associated with
each discovered pattern. Data mining functionalities, and the kinds of patterns
they can discover, are described below.
1.3.1
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes
or concepts. Descriptions of a class or a concept are called class/concept
descriptions. These descriptions can be derived via
(1) data
characterization, by summarizing the data of the class
under study (often called the target class)
in general terms, or
(2) data
discrimination, by comparison of the target class with
one or a set of comparative classes (often called the contrasting
classes), or
(3) both
data characterization and discrimination. Data
characterization is a summarization of the general
characteristics or features of a target class of data. The data corresponding
to the user-specified class are typically collected by a database query.
An attribute-oriented
induction technique can be used to perform data generalization and
characterization without
step-by-step user interaction.
The output of data characterization
can be presented in various forms. Examples include pie
charts, bar charts,
curves, multidimensional
data cubes, and multidimensional
tables, including crosstabs. The resulting
descriptions can also be presented as generalized
relations or in rule form(called characteristic
rules).
Data discrimination is
a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. The
target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through
database queries. The methods used for data discrimination are similar to those
used for data characterization.
1.3.2
Mining Frequent Patterns, Associations, and Correlations
Frequent
patterns, as the name suggests, are patterns that
occur frequently in data. There are many kinds of frequent patterns, including
itemsets, subsequences, and substructures. A substructure can refer
to
different structural forms, such as graphs, trees, or lattices, which may be
combined with item sets or subsequences. If a substructure occurs frequently,
it is called a (frequent) structured
pattern. Mining frequent patterns leads to the
discovery of interesting associations and correlations within data.
1.3.3
Classification and Prediction
Classification
is the process of finding a model (or
function) that describes and distinguishes
data
classes or concepts, for the purpose of being able to use the model to predict
the
class of objects whose class label is unknown. The derived model is based on
the analysis
of a set
of training data (i.e., data objects whose
class label is known).
The derived model may be represented
in various
forms,
such as classification (IF-THEN) rules,
decision trees, mathematical
formulae,
or neural
networks (Figure 1.10). A decision
tree is a flow-chart-like tree structure, where
each
node denotes a test on an attribute value, each branch represents an outcome of
the
test,
and tree leaves represent classes or class distributions. A neural
network, when used for classification, is
typically
a
collection of neuron-like processing units with weighted connections between
the
units.
1.3.4
Cluster Analysis
Which analyze
class-labeled data objects, clustering analyzes
data objects without consulting a known class label.
In
general, the class labels are not present in the training data simply because
they are not known to begin with. Clustering can be used to generate such
labels. The objects are clustered or grouped based on the principle of maximizing
the intraclass similarity and minimizing the interclass similarity.
1.3.5
Outlier Analysis
A
database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers.
Most data mining methods discard outliers as noise or exceptions. However, in
some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier
data is referred to as outlier mining.
1.3.6
Evolution Analysis
Data evolution
analysis describes and models regularities or
trends for objects whose behavior changes over time. Although this may include
characterization, discrimination, association and correlation analysis,
classification, prediction, or clustering of time related
data.
1.4
Classification of Data Mining Systems
Data
mining is an interdisciplinary field, the confluence of a set of disciplines,
including database systems, statistics, machine learning, visualization, and
information science.
Depending
on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from spatial data analysis,
information retrieval, pattern recognition, image analysis, signal processing,
computer graphics, Web technology, economics, business, bioinformatics, or
psychology.
Classification
according to the kinds
of databases mined:
A data mining system can be classified according to the kinds of databases
mined. Database systems can be classified according to different criteria (such
as data models, or the types of data or applications involved), each of which
may require its own data mining technique.
Classification
according to the kinds of
knowledge mined:
Data mining systems can be categorized according to the kinds of knowledge they
mine, that is, based on data mining functionalities, such as characterization,
discrimination, association and correlation analysis, classification,
prediction, clustering, outlier analysis, and evolution analysis.
Classification
according to the kinds
of techniques utilized:
Data mining systems can be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the the methods
of data analysis employed (e.g., database-oriented or data warehouse–
oriented
techniques, machine learning, statistics, visualization, pattern recognition, neural
networks, and so on).
Classification
according to the applications
adapted: Data mining systems can also be
categorized
according to the applications they adapt. For example, data mining
systems
may be tailored specifically for finance, telecommunications, DNA, stock
markets,
e-mail, and so on.
1.5
Major Issues in Data Mining
Major issues in data mining
regarding mining methodology, user interaction, performance, and diverse data
types. These issues are introduced below:
Mining methodology and user interaction issues:
These reflect the kinds of knowledge mined, the ability to mine
knowledge at multiple granularities, the use of domain knowledge, and knowledge
visualization.
Mining different kinds of knowledge in databases:
Different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data
analysis and knowledge discovery tasks, including data characterization, discrimination,
association and correlation analysis, classification, prediction, clustering,
outlier analysis, and so on..
Interactive mining of knowledge at multiple levels of
abstraction: It is difficult to know exactly what can be discovered
within a database, the data mining process should be interactive.
For databases containing a huge amount of data, appropriate sampling techniques
can first be applied to facilitate interactive data exploration. Interactive
mining allows users to focus the search for patterns, providing and refining
data mining requests based on returned results.
.
Incorporation of background knowledge:
Background knowledge, or information regarding the domain under
study, may be used to guide the discovery process and allow discovered patterns
to be expressed in concise terms and at different levels of abstraction.
Data mining query languages and ad hoc data mining:
Relational query languages (such as SQL) allow users to pose ad
hoc queries for data retrieval . Such a language should be integrated with a
database or data warehouse query language and optimized for efficient and
flexible data mining.
Presentation and visualization of data mining results:
Discovered knowledge should be expressed in high-level
languages, visual representations, or other expressive forms so that the
knowledge can be easily understood and directly usable by humans. This is
especially crucial if the data mining system is to be interactive.
Pattern evaluation—the interestingness problem:
A data mining system can uncover thousands of patterns. Many of
the patterns discovered may be uninteresting to the given user, either because
they represent common knowledge or lack novelty. The use of interestingness
measures or user-specified constraints to guide the discovery process and
reduce the search space is another active area of research.
Performance issues: These include
efficiency, scalability, and parallelization of data
mining
algorithms.
a)Efficiency
and scalability of data mining algorithms:
To effectively extract information
from a
huge amount of data in databases, data mining algorithms must be efficient and
scalable.
B)Parallel,
distributed, and incremental mining algorithms:
The huge size of many
databases,
the wide distribution of data, and the computational complexity of
some
data mining methods are factors motivating the development of parallel
and
distributed
data mining algorithms.
No comments:
Post a Comment