Outlier Detection

This page shows an example on outlier detection with the LOF (Local Outlier Factor) algorithm.

The LOF algorithm

LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.

Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.

Calculate Outlier Scores

> library(DMwR)

> # remove "Species", which is a categorical column

> iris2 <- iris[,1:4]

> outlier.scores <- lofactor(iris2, k=5)

> plot(density(outlier.scores))

> # pick top 5 as outliers

> outliers <- order(outlier.scores, decreasing=T)[1:5]

> # who are outliers

> print(outliers)

[1] 42 107 23 110 63

Visualize Outliers with Plots

Next, we show outliers with a biplot of the first two principal components.

> n <- nrow(iris2)

> labels <- 1:n

> labels[-outliers] <- "."

> biplot(prcomp(iris2), cex=.8, xlabs=labels)

We can also show outliers with a pairs plot as below, where outliers are labeled with "+" in red.

> pch <- rep(".", n)

> pch[outliers] <- "+"

> col <- rep("black", n)

> col[outliers] <- "red"

> pairs(iris2, pch=pch, col=col)

Parallel Computation of LOF Scores

Package Rlof provides function lof(), a parallel implementation of the LOF algorithm. Its usage is similar to the above lofactor(), but lof() has two additional features of supporting multiple values of k and several choices of distance metrics. Below is an example of lof().

> library(Rlof)

> outlier.scores <- lof(iris2, k=5)

> # try with different number of neighbors (k = 5,6,7,8,9 and 10)

> outlier.scores <- lof(iris2, k=c(5:10))