Table of Contents and Abstracts

    1 Introduction
        1.1 Data Mining
        1.2 R
        1.3 Datasets
            1.3.1 The Iris Dataset
            1.3.2 The Bodyfat Dataset

    2 Data Import and Export
        2.1 Save and Load R Data
        2.2 Import from and Export to .CSV Files
        2.3 Import Data from SAS
        2.4 Import/Export via ODBC
            2.4.1 Read from Databases
            2.4.2 Output to and Input from EXCEL Files

    3 Data Exploration
        3.1 Have a Look at Data
        3.2 Explore Individual Variables
        3.3 Explore Multiple Variables
        3.4 More Explorations
        3.5 Save Charts into Files

    4 Decision Trees and Random Forest
        4.1 Decision Trees with Package party
        4.2 Decision Trees with Package rpart
        4.3 Random Forest

    5 Regression
        5.1 Linear Regression
        5.2 Logistic Regression
        5.3 Generalized Linear Regression
        5.4 Non-linear Regression

    6 Clustering
        6.1 The k-Means Clustering
        6.2 The k-Medoids Clustering
        6.3 Hierarchical Clustering
        6.4 Density-based Clustering

    7 Outlier Detection
        7.1 Univariate Outlier Detection
        7.2 Outlier Detection with LOF
        7.3 Outlier Detection by Clustering
        7.4 Outlier Detection from Time Series
        7.5 Discussions

    8 Time Series Analysis and Mining
        8.1 Time Series Data in R
        8.2 Time Series Decomposition
        8.3 Time Series Forecasting
        8.4 Time Series Clustering
            8.4.1 Dynamic Time Warping
            8.4.2 Synthetic Control Chart Time Series Data
            8.4.3 Hierarchical Clustering with Euclidean Distance
            8.4.4 Hierarchical Clustering with DTW Distance
        8.5 Time Series Classification
            8.5.1 Classification with Original Data
            8.5.2 Classification with Extracted Features
            8.5.3 k-NN Classification
        8.6 Discussions
        8.7 Further Readings

    9 Association Rules
        9.1 Basics of Association Rules
        9.2 The Titanic Dataset
        9.3 Association Rule Mining
        9.4 Removing Redundancy
        9.5 Interpreting Rules
        9.6 Visualizing Association Rules
        9.7 Discussions and Further Readings

    10 Text Mining
        10.1 Retrieving Text from Twitter
        10.2 Transforming Text
        10.3 Stemming Words
        10.4 Building a Term-Document Matrix
        10.5 Frequent Terms and Associations
        10.6 Word Cloud
        10.7 Clustering Words
        10.8 Clustering Tweets
            10.8.1 Clustering Tweets with the k-means Algorithm
            10.8.2 Clustering Tweets with the k-medoids Algorithm
        10.9 Packages, Further Readings and Discussions

    11 Social Network Analysis

        11.1 Network of Terms
        11.2 Network of Tweets
        11.3 Two-Mode Network
        11.4 Discussions and Further Readings

    12 Case Study I: Analysis and Forecasting of House Price Indices
        12.1 Importing HPI Data
        12.2 Exploration of HPI Data
        12.3 Trend and Seasonal Components of HPI
        12.4 HPI Forecasting
        12.5 The Estimated Price of a Property
        12.6 Discussion

    13 Case Study II: Customer Response Prediction and Profit Optimization
        13.1 Introduction
        13.2 The Data of KDD Cup 1998
        13.3 Data Exploration
        13.4 Training Decision Trees
        13.5 Model Evaluation
        13.6 Selecting the Best Tree
        13.7 Scoring
        13.8 Discussions and Conclusions

    14 Case Study III: Predictive Modeling of Big Data with Limited Memory
        14.1 Introduction
        14.2 Methodology
        14.3 Data and Variables
        14.4 Random Forest
        14.5 Memory Issue
        14.6 Train Models on Sample Data
        14.7 Build Models with Selected Variables
        14.8 Scoring
        14.9 Print Rules
            14.9.1 Print Rules in Text
            14.9.2 Print Rules for Scoring with SAS
        14.10 Conclusions and Discussion

    15 Online Resources
        15.1 R Reference Cards
        15.2 R
        15.3 Data Mining
        15.4 Data Mining with R
        15.5 Classification/Prediction with R
        15.6 Time Series Analysis with R
        15.7 Association Rule Mining with R
        15.8 Spatial Data Analysis with R
        15.9 Text Mining with R
        15.10 Social Network Analysis with R
        15.11 Data Cleansing and Transformation with R
        15.12 Big Data and Parallel Computing with R

    R Reference Card for Data Mining

    Bibliography

    General Index

    Package Index

    Function Index
    1 Introduction
    Abstract: This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described.
    Keywords: R, Data mining, Data mining process, Package, Dataset

    2 Data Import and Export
    Abstract: This chapter shows how to import foreign data into R and export R objects to other formats. At first, examples are given to demonstrate saving R objects to and loading them from .Rdata files. After that, it demonstrates importing data from and exporting data to .CSV files, SAS databases, ODBC databases and EXCEL files.
    Keywords: Data import, Data export, Foreign data, CSV, SAS

    3 Data Exploration
    Abstract: This chapter shows examples on data exploration in R. It starts with inspecting the dimensionality, structure, and data of an R object, followed by basic statistics and various charts like pie charts and histograms. Exploration of multiple variables is then demonstrated, including grouped distribution, grouped boxplots, scattered plot, and pairs plot. After that, examples are given on level plot, contour plot, and 3D plot. It also shows how to save charts into files of various formats.
    Keywords: Data exploration, Statistics, Graphic, Chart, Distribution

    4 Decision Trees and Random Forest
    Abstract: This chapter shows how to build predictivemodels with packages party, rpart, and randomForest. It starts with building decision trees with package party and using the built tree for classification, followed by anotherway to build decision trees with package rpart. After that, it presents an example on training a random forest model with package randomForest.
    Keywords: Decision tree, Random forest, Classification, Prediction, Predictive model

    5 Regression
    Abstract: This chapter introduces basic concepts and presents examples of various regression techniques. At first, it shows an example on building a linear regression model to predict CPI data. After that, it introduces logistic regression. The generalized linear model (GLM) is then presented, followed by a brief introduction of non-linear regression.
    Keywords: Regression, Prediction, Linear regression, Logistic regression, Generalized linear model, Non-linear regression

    6 Clustering
    Abstract: This chapter presents examples of various clustering techniques in R, including k-means clustering, k-medoids clustering, hierarchical clustering, and density-based clustering. The first two sections demonstrate how to use the k-means and k-medoids algorithms to cluster the iris data. The third section shows an example on hierarchical clustering on the same data. The last section describes the idea of density-based clustering and the DBSCAN algorithm, and shows how to cluster with DBSCAN and then label new data with the clustering model.
    Keywords: Clustering, Cluster, k-means, k-medoids, DBSCAN, Partitioning, Hierarchical clustering, Density-based clustering

    7 Outlier Detection
    Abstract: This chapter presents examples of outlier detection with R. At first, it demonstrates univariate outlier detection. After that, an example of outlier detection with LOF (Local Outlier Factor) is given, followed by examples on outlier detection by clustering. At last, it demonstrates outlier detection from time series data.
    Keywords: Outlier, Anomaly, Extreme value, Outlier detection, LOF

    8 Time Series Analysis and Mining
    Abstract: This chapter presents examples on time series decomposition, forecasting clustering, and classification. The first section introduces briefly time series data in R. The second section shows an example on decomposing time series into trend, seasonal, and random components. The third section presents how to build an autoregressive integrated moving average (ARIMA) model in R and use it to predict future values. The fourth section introduces Dynamic Time Warping (DTW) and hierarchical clustering of time series data with Euclidean distance and with DTW distance. The fifth section shows three examples on time series classification: one with original data, the other with DWT (Discrete Wavelet Transform) transformed data, and another with k-NN classification. The chapter ends with discussions and further readings.
    Keywords: Time series, Decomposition, Forecasting, Time series classification, Time series clustering, Dynamic Time Warping (DTW), Discrete Wavelet Transform (DWT), k-NN classification

    9 Association Rules
    Abstract: This chapter presents examples of association rule mining with R. It starts with basic concepts of association rules, and then demonstrates association rules mining with R. After that, it presents examples of pruning redundant rules and interpreting and visualizing association rules. The chapter concludes with discussions and recommended readings.
    Keywords: Association rule, Frequent itemset, Interestingness, APRIORI, Redundant rule, Visualization

    10 Text Mining
    Abstract: This chapter presents examples of text mining with R. Twitter text of @RDataMining is used as the data to analyze. It starts with extracting text from Twitter. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and tweets are clustered to find groups of words and also groups of tweets.
    Keywords: Text mining, Text clustering, Word cloud, Twitter, Stop word, Stemming, Document-term matrix

    11 Social Network Analysis
    Abstract: This chapter presents examples of social network analysis with R, specifically, with package igraph. The data to analyze is Twitter text data used in the previous chapter on text mining. In this chapter, we first build a network of terms based on their co-occurrence in the same tweets, and then build a network of tweets based on the terms shared by them. At last, we build a two-mode network composed of both terms and tweets. We also demonstrate some tricks to plot nice network graphs.
    Keywords: Social network, Twitter, Two-mode network, Social network analysis

    12 Case Study I: Analysis and Forecasting of House Price Indices
    Abstract: This chapter presents a case study on analyzing and forecasting of House Price Indices (HPI). It demonstrates data import from aCSVfile, descriptive analysis of HPI time series data, and decomposition and forecasting of the data.
    Keywords: Time series, Decomposition, Forecasting, Seasonal component

    13 Case Study II: Customer Response Prediction and Profit Optimization
    Abstract: This chapter presents a case study on using decision trees to predict customer response and optimize profit. To improve customer contact process and maximize the amount of profit, decision trees were built with R to model customer contact history and predict the response of customers. And then the customers can be prioritized to contact based on the prediction, so that profit can be maximized, given a limited amount of time, cost, and human resources.
    Keywords: Decision tree, Prediction, Profit optimization

    14 Case Study III: Predictive Modeling of Big Data with Limited Memory
    Abstract: This chapter shows a case study on building a predictive model with limited memory. Because the training dataset was large and not easy to build decision trees within R, multiple subsets were drawn from it by random sampling, and a decision tree was built for each subset. After that, the variables appearing in any one of the built trees were used for variable selection from the original training dataset to reduce data size. In the scoring process, the scoring dataset was also split into subsets, so that the scoring could be done with limited memory. R codes for printing rules in plain English and in SAS format are also presented in this chapter.
    Keywords: Predictive model, Limited memory, Large data, Training, Scoring

    15 Online Resources
    Abstract: This chapter presents links to online resources on R and data mining, includes books, documents, tutorials, and slides.
    Keywords: Resource, Document, Tutorial, Slides