1 Introduction 1.1 Data Mining 1.2 R 1.3 Datasets 1.3.1 The Iris Dataset 1.3.2 The Bodyfat Dataset 2 Data Import and Export 2.1 Save and Load R Data 2.2 Import from and Export to .CSV Files 2.3 Import Data from SAS 2.4 Import/Export via ODBC 2.4.1 Read from Databases 2.4.2 Output to and Input from EXCEL Files 3 Data Exploration 3.1 Have a Look at Data 3.2 Explore Individual Variables 3.3 Explore Multiple Variables 3.4 More Explorations 3.5 Save Charts into Files 4 Decision Trees and Random Forest 4.1 Decision Trees with Package party 4.2 Decision Trees with Package rpart 4.3 Random Forest 5 Regression 5.1 Linear Regression 5.2 Logistic Regression 5.3 Generalized Linear Regression 5.4 Non-linear Regression 6 Clustering 6.1 The k-Means Clustering 6.2 The k-Medoids Clustering 6.3 Hierarchical Clustering 6.4 Density-based Clustering 7 Outlier Detection 7.1 Univariate Outlier Detection 7.2 Outlier Detection with LOF 7.3 Outlier Detection by Clustering 7.4 Outlier Detection from Time Series 7.5 Discussions 8 Time Series Analysis and Mining 8.1 Time Series Data in R 8.2 Time Series Decomposition 8.3 Time Series Forecasting 8.4 Time Series Clustering 8.4.1 Dynamic Time Warping 8.4.2 Synthetic Control Chart Time Series Data 8.4.3 Hierarchical Clustering with Euclidean Distance 8.4.4 Hierarchical Clustering with DTW Distance 8.5 Time Series Classification 8.5.1 Classification with Original Data 8.5.2 Classification with Extracted Features 8.5.3 k-NN Classification 8.6 Discussions 8.7 Further Readings 9 Association Rules 9.1 Basics of Association Rules 9.2 The Titanic Dataset 9.3 Association Rule Mining 9.4 Removing Redundancy 9.5 Interpreting Rules 9.6 Visualizing Association Rules 9.7 Discussions and Further Readings 10 Text Mining 10.1 Retrieving Text from Twitter 10.2 Transforming Text 10.3 Stemming Words 10.4 Building a Term-Document Matrix 10.5 Frequent Terms and Associations 10.6 Word Cloud 10.7 Clustering Words 10.8 Clustering Tweets 10.8.1 Clustering Tweets with the k-means Algorithm 10.8.2 Clustering Tweets with the k-medoids Algorithm 10.9 Packages, Further Readings and Discussions 11 Social Network Analysis 11.1 Network of Terms 11.2 Network of Tweets 11.3 Two-Mode Network 11.4 Discussions and Further Readings 12 Case Study I: Analysis and Forecasting of House Price Indices 12.1 Importing HPI Data 12.2 Exploration of HPI Data 12.3 Trend and Seasonal Components of HPI 12.4 HPI Forecasting 12.5 The Estimated Price of a Property 12.6 Discussion 13 Case Study II: Customer Response Prediction and Profit Optimization 13.1 Introduction 13.2 The Data of KDD Cup 1998 13.3 Data Exploration 13.4 Training Decision Trees 13.5 Model Evaluation 13.6 Selecting the Best Tree 13.7 Scoring 13.8 Discussions and Conclusions 14 Case Study III: Predictive Modeling of Big Data with Limited Memory 14.1 Introduction 14.2 Methodology 14.3 Data and Variables 14.4 Random Forest 14.5 Memory Issue 14.6 Train Models on Sample Data 14.7 Build Models with Selected Variables 14.8 Scoring 14.9 Print Rules 14.9.1 Print Rules in Text 14.9.2 Print Rules for Scoring with SAS 14.10 Conclusions and Discussion 15 Online Resources 15.1 R Reference Cards 15.2 R 15.3 Data Mining 15.4 Data Mining with R 15.5 Classification/Prediction with R 15.6 Time Series Analysis with R 15.7 Association Rule Mining with R 15.8 Spatial Data Analysis with R 15.9 Text Mining with R 15.10 Social Network Analysis with R 15.11 Data Cleansing and Transformation with R 15.12 Big Data and Parallel Computing with R R Reference Card for Data Mining Bibliography General Index Package Index Function Index | 1 Introduction Abstract: This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described. Keywords: R, Data mining, Data mining process, Package, Dataset 2 Data Import and Export Abstract: This chapter shows how to import foreign data into R and export R objects to other formats. At first, examples are given to demonstrate saving R objects to and loading them from .Rdata files. After that, it demonstrates importing data from and exporting data to .CSV files, SAS databases, ODBC databases and EXCEL files. Keywords: Data import, Data export, Foreign data, CSV, SAS 3 Data Exploration Abstract: This chapter shows examples on data exploration in R. It starts with inspecting the dimensionality, structure, and data of an R object, followed by basic statistics and various charts like pie charts and histograms. Exploration of multiple variables is then demonstrated, including grouped distribution, grouped boxplots, scattered plot, and pairs plot. After that, examples are given on level plot, contour plot, and 3D plot. It also shows how to save charts into files of various formats. Keywords: Data exploration, Statistics, Graphic, Chart, Distribution 4 Decision Trees and Random Forest Abstract: This chapter shows how to build predictivemodels with packages party, rpart, and randomForest. It starts with building decision trees with package party and using the built tree for classification, followed by anotherway to build decision trees with package rpart. After that, it presents an example on training a random forest model with package randomForest. Keywords: Decision tree, Random forest, Classification, Prediction, Predictive model 5 Regression Abstract: This chapter introduces basic concepts and presents examples of various regression techniques. At first, it shows an example on building a linear regression model to predict CPI data. After that, it introduces logistic regression. The generalized linear model (GLM) is then presented, followed by a brief introduction of non-linear regression. Keywords: Regression, Prediction, Linear regression, Logistic regression, Generalized linear model, Non-linear regression 6 Clustering Abstract: This chapter presents examples of various clustering techniques in R, including k-means clustering, k-medoids clustering, hierarchical clustering, and density-based clustering. The first two sections demonstrate how to use the k-means and k-medoids algorithms to cluster the iris data. The third section shows an example on hierarchical clustering on the same data. The last section describes the idea of density-based clustering and the DBSCAN algorithm, and shows how to cluster with DBSCAN and then label new data with the clustering model. Keywords: Clustering, Cluster, k-means, k-medoids, DBSCAN, Partitioning, Hierarchical clustering, Density-based clustering 7 Outlier Detection Abstract: This chapter presents examples of outlier detection with R. At first, it demonstrates univariate outlier detection. After that, an example of outlier detection with LOF (Local Outlier Factor) is given, followed by examples on outlier detection by clustering. At last, it demonstrates outlier detection from time series data. Keywords: Outlier, Anomaly, Extreme value, Outlier detection, LOF 8 Time Series Analysis and Mining Abstract: This chapter presents examples on time series decomposition, forecasting clustering, and classification. The first section introduces briefly time series data in R. The second section shows an example on decomposing time series into trend, seasonal, and random components. The third section presents how to build an autoregressive integrated moving average (ARIMA) model in R and use it to predict future values. The fourth section introduces Dynamic Time Warping (DTW) and hierarchical clustering of time series data with Euclidean distance and with DTW distance. The fifth section shows three examples on time series classification: one with original data, the other with DWT (Discrete Wavelet Transform) transformed data, and another with k-NN classification. The chapter ends with discussions and further readings. Keywords: Time series, Decomposition, Forecasting, Time series classification, Time series clustering, Dynamic Time Warping (DTW), Discrete Wavelet Transform (DWT), k-NN classification 9 Association Rules Abstract: This chapter presents examples of association rule mining with R. It starts with basic concepts of association rules, and then demonstrates association rules mining with R. After that, it presents examples of pruning redundant rules and interpreting and visualizing association rules. The chapter concludes with discussions and recommended readings. Keywords: Association rule, Frequent itemset, Interestingness, APRIORI, Redundant rule, Visualization 10 Text Mining Abstract: This chapter presents examples of text mining with R. Twitter text of @RDataMining is used as the data to analyze. It starts with extracting text from Twitter. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and tweets are clustered to find groups of words and also groups of tweets. Keywords: Text mining, Text clustering, Word cloud, Twitter, Stop word, Stemming, Document-term matrix 11 Social Network Analysis Abstract: This chapter presents examples of social network analysis with R, specifically, with package igraph. The data to analyze is Twitter text data used in the previous chapter on text mining. In this chapter, we first build a network of terms based on their co-occurrence in the same tweets, and then build a network of tweets based on the terms shared by them. At last, we build a two-mode network composed of both terms and tweets. We also demonstrate some tricks to plot nice network graphs. Keywords: Social network, Twitter, Two-mode network, Social network analysis 12 Case Study I: Analysis and Forecasting of House Price Indices Abstract: This chapter presents a case study on analyzing and forecasting of House Price Indices (HPI). It demonstrates data import from aCSVfile, descriptive analysis of HPI time series data, and decomposition and forecasting of the data. Keywords: Time series, Decomposition, Forecasting, Seasonal component 13 Case Study II: Customer Response Prediction and Profit Optimization Abstract: This chapter presents a case study on using decision trees to predict customer response and optimize profit. To improve customer contact process and maximize the amount of profit, decision trees were built with R to model customer contact history and predict the response of customers. And then the customers can be prioritized to contact based on the prediction, so that profit can be maximized, given a limited amount of time, cost, and human resources. Keywords: Decision tree, Prediction, Profit optimization 14 Case Study III: Predictive Modeling of Big Data with Limited Memory Abstract: This chapter shows a case study on building a predictive model with limited memory. Because the training dataset was large and not easy to build decision trees within R, multiple subsets were drawn from it by random sampling, and a decision tree was built for each subset. After that, the variables appearing in any one of the built trees were used for variable selection from the original training dataset to reduce data size. In the scoring process, the scoring dataset was also split into subsets, so that the scoring could be done with limited memory. R codes for printing rules in plain English and in SAS format are also presented in this chapter. Keywords: Predictive model, Limited memory, Large data, Training, Scoring 15 Online Resources Abstract: This chapter presents links to online resources on R and data mining, includes books, documents, tutorials, and slides. Keywords: Resource, Document, Tutorial, Slides |