Data Mining Notes For B-Tech, Bca, Bsc ...
What is Data Mining?
* Data Mining can be defined as extraction of useful information from a large amount of data.
The another term for data mining is KDD ( Knowledge discovery in data ) .
--------------------
Data :- These are Raw and Unorganised facts that need to be process.
Information:- When data is processed , organised structured in given contest to make it useful or meaningful.
Knowledge:- It is expertise to inform the result from information you have obtain.
--------------------
Area where Data Mining is used:-
1. Healthcare
2. Fraud Detection
3. Education
4. Lie Detection
----------------
Challenges Of Implementation in Data Mining :-
* The process of data mining becomes effective when the challenges or problems are correctly recognized.
1. Noisy and incomplete data
2. Distributed data
3. Complex data
4. Performance
---------------
Domain used for Data Mining Technique:-
1. AI
2. ML
3. Statistics
4. Database
---------------
Steps in Data Mining -
Data Selection
👇
Data Pre-Processing
👇
Data Transformation
👇
Data Mining
👇
Pattern Evaluation
👇
Knowledge Representation
---------------------
1. Data Selection: It is defined as the process of determining the appropriate data and collection from different sources. Example- database , www etc.
2. Data Pre-Processing: It is an important step in data mining. It refers to cleaning transforming or intelligence of data in order to make it ready for analysed . The Goal of DPP is to impose the quality of data and to make it more suitable for data mining.
3. Data Transformation : Data Transformation is refers to the process of assessing the quality of discovered pattern.
4. Data Mining : Data Mining can be defined as extraction of useful information from a large amount of data. The another term for data mining is KDD ( Knowledge discovery in data ) .
5. Pattern Evaluation : Pattern mining concentrates on identifying rules that describe specific patterns within the data.
6. Knowledge Representation: It is the representation of knowledge to the user for visualization in term of tables , graph , charts , etc.
---------------------------------------
Issue in Data Mining:- Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding .
* Mining Methodology and User Interaction
* Performance Issues
* Diverse Data Types Issues
The following diagram describes the major issues :
Database :- It is a collection of diffrent kind of data in a particular format.
Data Ware House:- It is collection of database in a particular format.
* The term dataware house was named by Bill Inmun in 1990.
* According to him "A Dataware house is a subject oriented , integrated , time-varient and non-volatile , collection of data in support of management decision making process" .
- Subject oriented : A data warehouse can be used to analysis a particular subject area.
For Ex- Sales , Customer , etc
- Integrated : A data warehouse integrated data from multiple data source.
- Time varient : Historical data is kept in data warehouse . Data can be identify with a particular time period .
- Non volatile : In data warehouse , once entered into the warehouse , data should not change.
-------------------------------------
Data Mining Task :- Two types of task are :
1. Descriptive Task: The descriptive function deal with the general properties of data in the database.
Here is the list of descriptive function-
> Class / concept description
> Mining of Association
> Clustering
> Sequence Discovery
2. Predictive Task: It provide values of data by making use of non results from a different set of sample data.
Ex-
> Classification
> Regression
> Prediction
> Time series analysis
------------------------
Data Pre-Processing :- There are major steps involves in data pre processing are -
a) Data Cleaning
b) Data Integration
c) Data Reduction
d) Data Transformation
Need of Data Pre-Processing - Data Pre processing is done to improve the quality of data in data ware house. It increase the efficiency of data mining process. It removes noisy data , inconsistent data and incomplete data .
* Data Cleaning : It cleans the data by filling the missing values , smoothing data , resolving the inconsistency and removing the outliers.
Filling the missing values -
1. Manual Entry of missing data
2. Using Attribute Mean
3. Using Global Constant
4. Ignore Table
5. Using Most proble value
* Data Integration : Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping, and transformation.
* Data Reduction: Data reduction is the process of reducing the amount of capacity required to store data. Data reduction can increase storage efficiency and reduce costs.
* Data Transformation : It is a technique that transform the data into alternate forms appropriate for mining technique.
Technique of data Transformation :
1. Smoothing -
> Binning
> Regression
> Clustering
2. Aggregation - Summary or aggregation function are used in aggregation.
For ex- Data Cube
Data Cube : Grouping a data in a multi directional data or matrix is called data cube. In data ware house , we generally deal with various multi directional data models. As the data is represented by multi dimension and multi attribute.
3. Generalization - In generalization low level concept are replaced with higher level concept.
4. Normalization - In normalization attribute values are normalized by scalling their values so that they fall in specified range.
V` = V-min / max x - min x
------------
Noisy Data :- It is a random error or varience in a measured variable or data. Noisy is unwanted data items. It is meaningless data or corrupt data. Any data that cannot be understood or interrupted correctly by a system.
Technique of Remove Noisy data :-
1. Binning - It is a technique for reducing the cardinality of continuous and discreate data.
2. Regression - It is a data mining technique which is used to fit an equation to a data set.
3. Clustering - Group or clusters are formed from the data having similar characteristics.
Outliers: Data that varies greatly from others.
---------------
Frequent Item Set :- A frequent item set is an item set whose support is greater than some user specified minimum support.
support = support count / total no. of transection
------------------
Comments
Post a Comment