Frequently Asked Questions
I have update slides for text mining with R for tm v0.6. Please see solutions at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf.
Q: Function below does not remove URLs completely.
A: Use code below, where ":alnum:" matches any alphanumeric characters, incl. letters and numbers, and ":punct:" matches punctuation characters. See details by running "?regex" under R or googling for "regular expression".
If there are non-ASCII characters in URL, you can use function below, which removes string starting with "http" and followed by any number of non-space characters.
The data is provided in package TH.data, but no longer in package mboost.To use the data, run code below.
Q: When trying the code in chapter text mining, some readers came across an error:
no applicable method for '***' applied to an object of class "character".
A: It is caused by the changes in package tm v0.6 from tm v0.5-10. Some functions need to be wrapped with "content_transformer", which is new in tm v0.6. See solutions at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf.
Another way is to use tm v0.5-10, based on which all original code can run successfully without any changes. Package tm v0.5-10 can be installed with code below.
Some readers had problem with the stemCompletion step in section 10.3 - stemming words. It is a bit tricky to set it up correctly.
Please check if Java is installed, as well as R packages Snowball, RWeka, rJava and RWekajars, as suggested at the beginning of section 10.3. If one or more of them are missing, you may still be able to run the stemCompletion code without error, but it would not actually do any stem completion.
If it still does not work, a suggestion to skip stem completion, if it is not absolutely necessary in your work, so that you can go ahead with text mining.
Q: At the begining of Chapter 11 social network analysis, the
produced the error '[<-. simple_sparse_array´(as.simple_sparse_array(x),…, value= value): Only numeric subscripting is implemented.A: The "termDocMatrix" is a normal matrix, not a term-document matrix created with package tm. Sorry for the misleading statement at the beginning for section 11.1 saying that "termDocMatrix" is "a term-document matrix".
Referring to section 10.7, "myTdm2" is a term-document matrix, and then it is converted into a normal matrix "m2" with as.matrix(). "termDocMatrix" is a copy of "m2", which is used as input in section 11.1 for social network analysis.
Therefore, to use your own data of class term-document matrix, you need to convert it with as.matrix() first, before running code for social network analysis in chapter 11.
The data should be provided at http://www.kdd.org/kdd-cup-1998-direct-marketing-profit-optimization.
If you have problem to download at the above link, you can try another website http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html.
Q: When using package twitteR to fetch data from Twitter, there is an error as below.
A: It is because Twitter API requires authentication from March 2013. After authentication, you can then run userTimeline().
For instructions on Twitter authentication, please refer to "Section 3 - Authentication with OAuth" in the twitteR user vignette.