Examples‎ > ‎

    Outlier Detection

    This page shows an example on outlier detection with the LOF (Local Outlier Factor) algorithm.

    The LOF algorithm

    LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.

    Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.

    Calculate Outlier Scores

    > library(DMwR)
    > # remove "Species", which is a categorical column
    > iris2 <- iris[,1:4]
    > outlier.scores <- lofactor(iris2, k=5)
    > plot(density(outlier.scores))

    A density plot of outlier scores

    > # pick top 5 as outliers
    > outliers <- order(outlier.scores, decreasing=T)[1:5]
    > # who are outliers
    > print(outliers)
    [1] 42 107 23 110 63

    Visualize Outliers with Plots

    Next, we show outliers with a biplot of the first two principal components.

    > n <- nrow(iris2)
    > labels <- 1:n
    > labels[-outliers] <- "."
    > biplot(prcomp(iris2), cex=.8, xlabs=labels)

    Visualization of outliers in a biplot of the 1st two principal components

    We can also show outliers with a pairs plot as below, where outliers are labeled with "+" in red.

    > pch <- rep(".", n)
    > pch[outliers] <- "+"
    > col <- rep("black", n)
    > col[outliers] <- "red"
    > pairs(iris2, pch=pch, col=col)

    Visualization of outliers with a matrix of scatter plots

    Parallel Computation of LOF Scores

    Package Rlof provides function lof(), a parallel implementation of the LOF algorithm. Its usage is similar to the above lofactor(), but lof() has two additional features of supporting multiple values of k and several choices of distance metrics. Below is an example of lof().

    > library(Rlof)
    > outlier.scores <- lof(iris2, k=5)
    > # try with different number of neighbors (k = 5,6,7,8,9 and 10)
    > outlier.scores <- lof(iris2, k=c(5:10))