Examples‎ > ‎

Outlier Detection

This page shows an example on outlier detection with the LOF (Local Outlier Factor) algorithm.

The LOF algorithm

LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.

Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.

Calculate Outlier Scores

> library(DMwR)
> # remove "Species", which is a categorical column
> iris2 <- iris[,1:4]
> outlier.scores <- lofactor(iris2, k=5)
> plot(density(outlier.scores))

A density plot of outlier scores

> # pick top 5 as outliers
> outliers <- order(outlier.scores, decreasing=T)[1:5]
> # who are outliers
> print(outliers)
[1] 42 107 23 110 63

Visualize Outliers with Plots

Next, we show outliers with a biplot of the first two principal components.

> n <- nrow(iris2)
> labels <- 1:n
> labels[-outliers] <- "."
> biplot(prcomp(iris2), cex=.8, xlabs=labels)

Visualization of outliers in a biplot of the 1st two principal components

We can also show outliers with a pairs plot as below, where outliers are labeled with "+" in red.

> pch <- rep(".", n)
> pch[outliers] <- "+"
> col <- rep("black", n)
> col[outliers] <- "red"
> pairs(iris2, pch=pch, col=col)

Visualization of outliers with a matrix of scatter plots

Parallel Computation of LOF Scores

Package Rlof provides function lof(), a parallel implementation of the LOF algorithm. Its usage is similar to the above lofactor(), but lof() has two additional features of supporting multiple values of k and several choices of distance metrics. Below is an example of lof().

> library(Rlof)
> outlier.scores <- lof(iris2, k=5)
> # try with different number of neighbors (k = 5,6,7,8,9 and 10)
> outlier.scores <- lof(iris2, k=c(5:10))