Table of Contents and Abstracts

1 Introduction

1.1 Data Mining

1.2 R

1.3 Datasets

1.3.1 The Iris Dataset

1.3.2 The Bodyfat Dataset

2 Data Import and Export

2.1 Save and Load R Data

2.2 Import from and Export to .CSV Files

2.3 Import Data from SAS

2.4 Import/Export via ODBC

2.4.1 Read from Databases

2.4.2 Output to and Input from EXCEL Files

3 Data Exploration

3.1 Have a Look at Data

3.2 Explore Individual Variables

3.3 Explore Multiple Variables

3.4 More Explorations

3.5 Save Charts into Files

4 Decision Trees and Random Forest

4.1 Decision Trees with Package party

4.2 Decision Trees with Package rpart

4.3 Random Forest

5 Regression

5.1 Linear Regression

5.2 Logistic Regression

5.3 Generalized Linear Regression

5.4 Non-linear Regression

6 Clustering

6.1 The k-Means Clustering

6.2 The k-Medoids Clustering

6.3 Hierarchical Clustering

6.4 Density-based Clustering

7 Outlier Detection

7.1 Univariate Outlier Detection

7.2 Outlier Detection with LOF

7.3 Outlier Detection by Clustering

7.4 Outlier Detection from Time Series

7.5 Discussions

8 Time Series Analysis and Mining

8.1 Time Series Data in R

8.2 Time Series Decomposition

8.3 Time Series Forecasting

8.4 Time Series Clustering

8.4.1 Dynamic Time Warping

8.4.2 Synthetic Control Chart Time Series Data

8.4.3 Hierarchical Clustering with Euclidean Distance

8.4.4 Hierarchical Clustering with DTW Distance

8.5 Time Series Classification

8.5.1 Classification with Original Data

8.5.2 Classification with Extracted Features

8.5.3 k-NN Classification

8.6 Discussions

8.7 Further Readings

9 Association Rules

9.1 Basics of Association Rules

9.2 The Titanic Dataset

9.3 Association Rule Mining

9.4 Removing Redundancy

9.5 Interpreting Rules

9.6 Visualizing Association Rules

9.7 Discussions and Further Readings

10 Text Mining

10.1 Retrieving Text from Twitter

10.2 Transforming Text

10.3 Stemming Words

10.4 Building a Term-Document Matrix

10.5 Frequent Terms and Associations

10.6 Word Cloud

10.7 Clustering Words

10.8 Clustering Tweets

10.8.1 Clustering Tweets with the k-means Algorithm

10.8.2 Clustering Tweets with the k-medoids Algorithm

10.9 Packages, Further Readings and Discussions

11 Social Network Analysis

11.1 Network of Terms

11.2 Network of Tweets

11.3 Two-Mode Network

11.4 Discussions and Further Readings

12 Case Study I: Analysis and Forecasting of House Price Indices

12.1 Importing HPI Data

12.2 Exploration of HPI Data

12.3 Trend and Seasonal Components of HPI

12.4 HPI Forecasting

12.5 The Estimated Price of a Property

12.6 Discussion

13 Case Study II: Customer Response Prediction and Profit Optimization

13.1 Introduction

13.2 The Data of KDD Cup 1998

13.3 Data Exploration

13.4 Training Decision Trees

13.5 Model Evaluation

13.6 Selecting the Best Tree

13.7 Scoring

13.8 Discussions and Conclusions

14 Case Study III: Predictive Modeling of Big Data with Limited Memory

14.1 Introduction

14.2 Methodology

14.3 Data and Variables

14.4 Random Forest

14.5 Memory Issue

14.6 Train Models on Sample Data

14.7 Build Models with Selected Variables

14.8 Scoring

14.9 Print Rules

14.9.1 Print Rules in Text

14.9.2 Print Rules for Scoring with SAS

14.10 Conclusions and Discussion

15 Online Resources

15.1 R Reference Cards

15.2 R

15.3 Data Mining

15.4 Data Mining with R

15.5 Classification/Prediction with R

15.6 Time Series Analysis with R

15.7 Association Rule Mining with R

15.8 Spatial Data Analysis with R

15.9 Text Mining with R

15.10 Social Network Analysis with R

15.11 Data Cleansing and Transformation with R

15.12 Big Data and Parallel Computing with R

R Reference Card for Data Mining

Bibliography

General Index

Package Index

Function Index

1 Introduction

Abstract: This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described.

Keywords: R, Data mining, Data mining process, Package, Dataset

2 Data Import and Export

Abstract: This chapter shows how to import foreign data into R and export R objects to other formats. At first, examples are given to demonstrate saving R objects to and loading them from .Rdata files. After that, it demonstrates importing data from and exporting data to .CSV files, SAS databases, ODBC databases and EXCEL files.

Keywords: Data import, Data export, Foreign data, CSV, SAS

3 Data Exploration

Abstract: This chapter shows examples on data exploration in R. It starts with inspecting the dimensionality, structure, and data of an R object, followed by basic statistics and various charts like pie charts and histograms. Exploration of multiple variables is then demonstrated, including grouped distribution, grouped boxplots, scattered plot, and pairs plot. After that, examples are given on level plot, contour plot, and 3D plot. It also shows how to save charts into files of various formats.

Keywords: Data exploration, Statistics, Graphic, Chart, Distribution

4 Decision Trees and Random Forest

Abstract: This chapter shows how to build predictivemodels with packages party, rpart, and randomForest. It starts with building decision trees with package party and using the built tree for classification, followed by anotherway to build decision trees with package rpart. After that, it presents an example on training a random forest model with package randomForest.

Keywords: Decision tree, Random forest, Classification, Prediction, Predictive model

5 Regression

Abstract: This chapter introduces basic concepts and presents examples of various regression techniques. At first, it shows an example on building a linear regression model to predict CPI data. After that, it introduces logistic regression. The generalized linear model (GLM) is then presented, followed by a brief introduction of non-linear regression.

Keywords: Regression, Prediction, Linear regression, Logistic regression, Generalized linear model, Non-linear regression

6 Clustering

Abstract: This chapter presents examples of various clustering techniques in R, including k-means clustering, k-medoids clustering, hierarchical clustering, and density-based clustering. The first two sections demonstrate how to use the k-means and k-medoids algorithms to cluster the iris data. The third section shows an example on hierarchical clustering on the same data. The last section describes the idea of density-based clustering and the DBSCAN algorithm, and shows how to cluster with DBSCAN and then label new data with the clustering model.

Keywords: Clustering, Cluster, k-means, k-medoids, DBSCAN, Partitioning, Hierarchical clustering, Density-based clustering

7 Outlier Detection

Abstract: This chapter presents examples of outlier detection with R. At first, it demonstrates univariate outlier detection. After that, an example of outlier detection with LOF (Local Outlier Factor) is given, followed by examples on outlier detection by clustering. At last, it demonstrates outlier detection from time series data.

Keywords: Outlier, Anomaly, Extreme value, Outlier detection, LOF

8 Time Series Analysis and Mining

Abstract: This chapter presents examples on time series decomposition, forecasting clustering, and classification. The first section introduces briefly time series data in R. The second section shows an example on decomposing time series into trend, seasonal, and random components. The third section presents how to build an autoregressive integrated moving average (ARIMA) model in R and use it to predict future values. The fourth section introduces Dynamic Time Warping (DTW) and hierarchical clustering of time series data with Euclidean distance and with DTW distance. The fifth section shows three examples on time series classification: one with original data, the other with DWT (Discrete Wavelet Transform) transformed data, and another with k-NN classification. The chapter ends with discussions and further readings.

Keywords: Time series, Decomposition, Forecasting, Time series classification, Time series clustering, Dynamic Time Warping (DTW), Discrete Wavelet Transform (DWT), k-NN classification

9 Association Rules

Abstract: This chapter presents examples of association rule mining with R. It starts with basic concepts of association rules, and then demonstrates association rules mining with R. After that, it presents examples of pruning redundant rules and interpreting and visualizing association rules. The chapter concludes with discussions and recommended readings.

Keywords: Association rule, Frequent itemset, Interestingness, APRIORI, Redundant rule, Visualization

10 Text Mining

Abstract: This chapter presents examples of text mining with R. Twitter text of @RDataMining is used as the data to analyze. It starts with extracting text from Twitter. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and tweets are clustered to find groups of words and also groups of tweets.

Keywords: Text mining, Text clustering, Word cloud, Twitter, Stop word, Stemming, Document-term matrix

11 Social Network Analysis

Abstract: This chapter presents examples of social network analysis with R, specifically, with package igraph. The data to analyze is Twitter text data used in the previous chapter on text mining. In this chapter, we first build a network of terms based on their co-occurrence in the same tweets, and then build a network of tweets based on the terms shared by them. At last, we build a two-mode network composed of both terms and tweets. We also demonstrate some tricks to plot nice network graphs.

Keywords: Social network, Twitter, Two-mode network, Social network analysis

12 Case Study I: Analysis and Forecasting of House Price Indices

Abstract: This chapter presents a case study on analyzing and forecasting of House Price Indices (HPI). It demonstrates data import from aCSVfile, descriptive analysis of HPI time series data, and decomposition and forecasting of the data.

Keywords: Time series, Decomposition, Forecasting, Seasonal component

13 Case Study II: Customer Response Prediction and Profit Optimization

Abstract: This chapter presents a case study on using decision trees to predict customer response and optimize profit. To improve customer contact process and maximize the amount of profit, decision trees were built with R to model customer contact history and predict the response of customers. And then the customers can be prioritized to contact based on the prediction, so that profit can be maximized, given a limited amount of time, cost, and human resources.

Keywords: Decision tree, Prediction, Profit optimization

14 Case Study III: Predictive Modeling of Big Data with Limited Memory

Abstract: This chapter shows a case study on building a predictive model with limited memory. Because the training dataset was large and not easy to build decision trees within R, multiple subsets were drawn from it by random sampling, and a decision tree was built for each subset. After that, the variables appearing in any one of the built trees were used for variable selection from the original training dataset to reduce data size. In the scoring process, the scoring dataset was also split into subsets, so that the scoring could be done with limited memory. R codes for printing rules in plain English and in SAS format are also presented in this chapter.

Keywords: Predictive model, Limited memory, Large data, Training, Scoring

15 Online Resources

Abstract: This chapter presents links to online resources on R and data mining, includes books, documents, tutorials, and slides.

Keywords: Resource, Document, Tutorial, Slides