Remove URLs from text

posted Mar 24, 2015, 12:55 PM by Yanchang Zhao
Q: Function below does not remove URLs completely.

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

Use code below, where ":alnum:" matches any alphanumeric characters, incl. letters and numbers, and ":punct:" matches punctuation characters. See details by running "?regex" under R or googling for "regular expression".

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x)

If there are non-ASCII characters in URL, you can use function below, which removes string starting with "http" and followed by any number of non-space characters.

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)