Frequently Asked Questions

Here are FAQs for book R and Data Mining: Examples and Case Studies. If you have any questions or comments, or come across any problems with the book, please feel free to post them to the RDataMining group or email them to me. Thanks.

There are errors when using tm v0.6 for text cleaning and/or stemming

posted May 23, 2015, 9:06 AM by Yanchang Zhao

I have update slides for text mining with R for tm v0.6. Please see solutions at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf.

Remove URLs from text

posted Mar 24, 2015, 12:55 PM by Yanchang Zhao

Q: Function below does not remove URLs completely.

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

A:
Use code below, where ":alnum:" matches any alphanumeric characters, incl. letters and numbers, and ":punct:" matches punctuation characters. See details by running "?regex" under R or googling for "regular expression".

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x)

If there are non-ASCII characters in URL, you can use function below, which removes string starting with "http" and followed by any number of non-space characters.

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)

Where is to find the bodyfat dataset?

posted Sep 12, 2014, 5:49 AM by Yanchang Zhao

The data is provided in package TH.data, but no longer in package mboost.

To use the data, run code below.

data("bodyfat", package="TH.data")

Error in text mining: no applicable method for '***' applied to an object of class "character"

posted Aug 17, 2014, 4:12 PM by Yanchang Zhao

Q: When trying the code in chapter text mining, some readers came across an error: 
no applicable method for '***' applied to an object of class "character".

A: It is caused by the changes in package tm v0.6 from tm v0.5-10. Some functions need to be wrapped with "content_transformer", which is new in tm v0.6. See solutions at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf.

Another way is to use tm v0.5-10, based on which all original code can run successfully without any changes. Package tm v0.5-10 can be installed with code below.
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip")


Stem completion does not work in section 10.3 - stemming words

posted Aug 6, 2014, 1:46 PM by Yanchang Zhao

Some readers had problem with the stemCompletion step in section 10.3 - stemming words. It is a bit tricky to set it up correctly.

Please check if Java is installed, as well as R packages Snowball, RWeka, rJava and RWekajars, as suggested at the beginning of section 10.3. If one or more of them are missing, you may still be able to run the stemCompletion code without error, but it would not actually do any stem completion.

If it still does not work, a suggestion to skip stem completion, if it is not absolutely necessary in your work, so that you can go ahead with text mining.

Error in converting into Boolean matrix for social network analysis in chapter 11

posted Aug 6, 2014, 1:26 PM by Yanchang Zhao

Q: At the begining of Chapter 11 social network analysis, the
> # change it to a Boolean matrix
> termDocMatrix[termDocMatrix>=1] <- 1

produced the error '[<-. simple_sparse_array´(as.simple_sparse_array(x),…, value= value): Only numeric subscripting is implemented.

A: The "termDocMatrix" is a normal matrix, not a term-document matrix created with package tm. Sorry for the misleading statement at the beginning for section 11.1 saying that "termDocMatrix" is "a term-document matrix".

Referring to section 10.7, "myTdm2" is a term-document matrix, and then it is converted into a normal matrix "m2" with as.matrix(). "termDocMatrix" is a copy of "m2", which is used as input in section 11.1 for social network analysis.
m2 <- as.matrix(myTdm2)

Therefore, to use your own data of class term-document matrix, you need to convert it with as.matrix() first, before running code for social network analysis in chapter 11.
yourMatrix <- as.matrix(yourTermDocMatrix)

Where to download KDD Cup 1998 Data

posted Jul 23, 2013, 4:16 AM by Yanchang Zhao

The data should be provided at http://www.kdd.org/kdd-cup-1998-direct-marketing-profit-optimization.

If you have problem to download at the above link, you can try another website http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html.

Where to find data used in the book?

posted Jul 18, 2013, 6:09 AM by Yanchang Zhao

All code and data used in the book are provided at this page.

Problem with the twitteR package

posted Apr 25, 2013, 4:21 AM by Yanchang Zhao

Q: When using package twitteR to fetch data from Twitter, there is an error as below.

> rdmTweets <- userTimeline(user, n=100)

Error: twInterfaceObj$doAPICall(cmd, params, method, ...) :
  OAuth authentication is required with Twitter's API v1.1

A: It is because Twitter API requires authentication from March 2013. After authentication, you can then run userTimeline().

For instructions on Twitter authentication, please refer to "Section 3 - Authentication with OAuth" in the twitteR user vignette.

1-9 of 9