Table of Contents and Abstracts
1 Introduction
1.1 Data Mining
1.2 R
1.3 Datasets
1.3.1 The Iris Dataset
1.3.2 The Bodyfat Dataset
2 Data Import and Export
2.1 Save and Load R Data
2.2 Import from and Export to .CSV Files
2.3 Import Data from SAS
2.4 Import/Export via ODBC
2.4.1 Read from Databases
2.4.2 Output to and Input from EXCEL Files
3 Data Exploration
3.1 Have a Look at Data
3.2 Explore Individual Variables
3.3 Explore Multiple Variables
3.4 More Explorations
3.5 Save Charts into Files
4 Decision Trees and Random Forest
4.1 Decision Trees with Package party
4.2 Decision Trees with Package rpart
4.3 Random Forest
5 Regression
5.1 Linear Regression
5.2 Logistic Regression
5.3 Generalized Linear Regression
5.4 Non-linear Regression
6 Clustering
6.1 The k-Means Clustering
6.2 The k-Medoids Clustering
6.3 Hierarchical Clustering
6.4 Density-based Clustering
7 Outlier Detection
7.1 Univariate Outlier Detection
7.2 Outlier Detection with LOF
7.3 Outlier Detection by Clustering
7.4 Outlier Detection from Time Series
7.5 Discussions
8 Time Series Analysis and Mining
8.1 Time Series Data in R
8.2 Time Series Decomposition
8.3 Time Series Forecasting
8.4 Time Series Clustering
8.4.1 Dynamic Time Warping
8.4.2 Synthetic Control Chart Time Series Data
8.4.3 Hierarchical Clustering with Euclidean Distance
8.4.4 Hierarchical Clustering with DTW Distance
8.5 Time Series Classification
8.5.1 Classification with Original Data
8.5.2 Classification with Extracted Features
8.5.3 k-NN Classification
8.6 Discussions
8.7 Further Readings
9 Association Rules
9.1 Basics of Association Rules
9.2 The Titanic Dataset
9.3 Association Rule Mining
9.4 Removing Redundancy
9.5 Interpreting Rules
9.6 Visualizing Association Rules
9.7 Discussions and Further Readings
10 Text Mining
10.1 Retrieving Text from Twitter
10.2 Transforming Text
10.3 Stemming Words
10.4 Building a Term-Document Matrix
10.5 Frequent Terms and Associations
10.6 Word Cloud
10.7 Clustering Words
10.8 Clustering Tweets
10.8.1 Clustering Tweets with the k-means Algorithm
10.8.2 Clustering Tweets with the k-medoids Algorithm
10.9 Packages, Further Readings and Discussions
11 Social Network Analysis
11.1 Network of Terms
11.2 Network of Tweets
11.3 Two-Mode Network
11.4 Discussions and Further Readings
12 Case Study I: Analysis and Forecasting of House Price Indices
12.1 Importing HPI Data
12.2 Exploration of HPI Data
12.3 Trend and Seasonal Components of HPI
12.4 HPI Forecasting
12.5 The Estimated Price of a Property
12.6 Discussion
13 Case Study II: Customer Response Prediction and Profit Optimization
13.1 Introduction
13.2 The Data of KDD Cup 1998
13.3 Data Exploration
13.4 Training Decision Trees
13.5 Model Evaluation
13.6 Selecting the Best Tree
13.7 Scoring
13.8 Discussions and Conclusions
14 Case Study III: Predictive Modeling of Big Data with Limited Memory
14.1 Introduction
14.2 Methodology
14.3 Data and Variables
14.4 Random Forest
14.5 Memory Issue
14.6 Train Models on Sample Data
14.7 Build Models with Selected Variables
14.8 Scoring
14.9 Print Rules
14.9.1 Print Rules in Text
14.9.2 Print Rules for Scoring with SAS
14.10 Conclusions and Discussion
15 Online Resources
15.1 R Reference Cards
15.2 R
15.3 Data Mining
15.4 Data Mining with R
15.5 Classification/Prediction with R
15.6 Time Series Analysis with R
15.7 Association Rule Mining with R
15.8 Spatial Data Analysis with R
15.9 Text Mining with R
15.10 Social Network Analysis with R
15.11 Data Cleansing and Transformation with R
15.12 Big Data and Parallel Computing with R
R Reference Card for Data Mining
Bibliography
General Index
Package Index
Function Index
1 Introduction
Abstract: This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described.
Keywords: R, Data mining, Data mining process, Package, Dataset
2 Data Import and Export
Abstract: This chapter shows how to import foreign data into R and export R objects to other formats. At first, examples are given to demonstrate saving R objects to and loading them from .Rdata files. After that, it demonstrates importing data from and exporting data to .CSV files, SAS databases, ODBC databases and EXCEL files.
Keywords: Data import, Data export, Foreign data, CSV, SAS
3 Data Exploration
Abstract: This chapter shows examples on data exploration in R. It starts with inspecting the dimensionality, structure, and data of an R object, followed by basic statistics and various charts like pie charts and histograms. Exploration of multiple variables is then demonstrated, including grouped distribution, grouped boxplots, scattered plot, and pairs plot. After that, examples are given on level plot, contour plot, and 3D plot. It also shows how to save charts into files of various formats.
Keywords: Data exploration, Statistics, Graphic, Chart, Distribution
4 Decision Trees and Random Forest
Abstract: This chapter shows how to build predictivemodels with packages party, rpart, and randomForest. It starts with building decision trees with package party and using the built tree for classification, followed by anotherway to build decision trees with package rpart. After that, it presents an example on training a random forest model with package randomForest.
Keywords: Decision tree, Random forest, Classification, Prediction, Predictive model
5 Regression
Abstract: This chapter introduces basic concepts and presents examples of various regression techniques. At first, it shows an example on building a linear regression model to predict CPI data. After that, it introduces logistic regression. The generalized linear model (GLM) is then presented, followed by a brief introduction of non-linear regression.
Keywords: Regression, Prediction, Linear regression, Logistic regression, Generalized linear model, Non-linear regression
6 Clustering
Abstract: This chapter presents examples of various clustering techniques in R, including k-means clustering, k-medoids clustering, hierarchical clustering, and density-based clustering. The first two sections demonstrate how to use the k-means and k-medoids algorithms to cluster the iris data. The third section shows an example on hierarchical clustering on the same data. The last section describes the idea of density-based clustering and the DBSCAN algorithm, and shows how to cluster with DBSCAN and then label new data with the clustering model.
Keywords: Clustering, Cluster, k-means, k-medoids, DBSCAN, Partitioning, Hierarchical clustering, Density-based clustering
7 Outlier Detection
Abstract: This chapter presents examples of outlier detection with R. At first, it demonstrates univariate outlier detection. After that, an example of outlier detection with LOF (Local Outlier Factor) is given, followed by examples on outlier detection by clustering. At last, it demonstrates outlier detection from time series data.
Keywords: Outlier, Anomaly, Extreme value, Outlier detection, LOF
8 Time Series Analysis and Mining
Abstract: This chapter presents examples on time series decomposition, forecasting clustering, and classification. The first section introduces briefly time series data in R. The second section shows an example on decomposing time series into trend, seasonal, and random components. The third section presents how to build an autoregressive integrated moving average (ARIMA) model in R and use it to predict future values. The fourth section introduces Dynamic Time Warping (DTW) and hierarchical clustering of time series data with Euclidean distance and with DTW distance. The fifth section shows three examples on time series classification: one with original data, the other with DWT (Discrete Wavelet Transform) transformed data, and another with k-NN classification. The chapter ends with discussions and further readings.
Keywords: Time series, Decomposition, Forecasting, Time series classification, Time series clustering, Dynamic Time Warping (DTW), Discrete Wavelet Transform (DWT), k-NN classification
9 Association Rules
Abstract: This chapter presents examples of association rule mining with R. It starts with basic concepts of association rules, and then demonstrates association rules mining with R. After that, it presents examples of pruning redundant rules and interpreting and visualizing association rules. The chapter concludes with discussions and recommended readings.
Keywords: Association rule, Frequent itemset, Interestingness, APRIORI, Redundant rule, Visualization
10 Text Mining
Abstract: This chapter presents examples of text mining with R. Twitter text of @RDataMining is used as the data to analyze. It starts with extracting text from Twitter. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and tweets are clustered to find groups of words and also groups of tweets.
Keywords: Text mining, Text clustering, Word cloud, Twitter, Stop word, Stemming, Document-term matrix
11 Social Network Analysis
Abstract: This chapter presents examples of social network analysis with R, specifically, with package igraph. The data to analyze is Twitter text data used in the previous chapter on text mining. In this chapter, we first build a network of terms based on their co-occurrence in the same tweets, and then build a network of tweets based on the terms shared by them. At last, we build a two-mode network composed of both terms and tweets. We also demonstrate some tricks to plot nice network graphs.
Keywords: Social network, Twitter, Two-mode network, Social network analysis
12 Case Study I: Analysis and Forecasting of House Price Indices
Abstract: This chapter presents a case study on analyzing and forecasting of House Price Indices (HPI). It demonstrates data import from aCSVfile, descriptive analysis of HPI time series data, and decomposition and forecasting of the data.
Keywords: Time series, Decomposition, Forecasting, Seasonal component
13 Case Study II: Customer Response Prediction and Profit Optimization
Abstract: This chapter presents a case study on using decision trees to predict customer response and optimize profit. To improve customer contact process and maximize the amount of profit, decision trees were built with R to model customer contact history and predict the response of customers. And then the customers can be prioritized to contact based on the prediction, so that profit can be maximized, given a limited amount of time, cost, and human resources.
Keywords: Decision tree, Prediction, Profit optimization
14 Case Study III: Predictive Modeling of Big Data with Limited Memory
Abstract: This chapter shows a case study on building a predictive model with limited memory. Because the training dataset was large and not easy to build decision trees within R, multiple subsets were drawn from it by random sampling, and a decision tree was built for each subset. After that, the variables appearing in any one of the built trees were used for variable selection from the original training dataset to reduce data size. In the scoring process, the scoring dataset was also split into subsets, so that the scoring could be done with limited memory. R codes for printing rules in plain English and in SAS format are also presented in this chapter.
Keywords: Predictive model, Limited memory, Large data, Training, Scoring
15 Online Resources
Abstract: This chapter presents links to online resources on R and data mining, includes books, documents, tutorials, and slides.
Keywords: Resource, Document, Tutorial, Slides