Vectorized Rolling/Cumulative Mahalanobis Distance in R

Question

I'm trying to calculate a rolling mahalanobis distance without resorting to for loops and failing miserably.

Here is an example dataset:

df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
               date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
               valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
               valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
               valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))

I'm trying to calculate, by group (label), the mahalanobis distance of valx, valy, and valz, but only using rows from that date (date) or previous. My current solution is to loop through each label, loop through each date, filter the dataset down to the matching data, calculate distance using stats::mahalanobis, add that distance to a list, and then do.call and rbind them outside the loop*. Clearly this isn't ideal.

I suspect that there's some way to write:

cum.mdist <- function(df, cols) {...}
df %>%
  group_by(label) %>%
  arrange(date) %>%
  mutate(mdist = xapply(., c(valx, valy, valz), cum.mdist)) %>%
  ungroup()

in a similar way to calculating a rolling unary function like so:

cumsd <- function(x) sapply(seq_along(x), function(k, z) sd(z[1:k]), z = x)

I could calculate the distance from component parts if there were no covariance (rolling variance variance is simple to calculate using a function like the above one), but I think my variables do have covariance, and I'm not sure how to build a rolling covariance matrix...

Does a solution to this exist outside of for loops?

*code for a looped solution is below:

library("tidyverse")
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
                 date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
                 valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
                 valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
                 valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
mdist.list <- vector(length = nrow(df), mode = "list")
counter <- 1

for(l in seq_along(unique(df$label))){
  label_data <- df %>%
    filter(label == unique(df$label)[l])

  for(d in seq_along(unique(label_data$date))){
    label_date_data <- label_data %>%
      filter(date <= unique(label_data$date)[d])

    if(nrow(label_date_data) > 3){
      label_date_data$mdist <- mahalanobis(label_date_data %>% select(contains("val")),
                                           colMeans(label_date_data %>% select(contains("val"))),
                                           cov(label_date_data %>% select(contains("val"))))
    } else{
      label_date_data$mdist <- NA
    }

    mdist.list[[counter]] <- filter(label_date_data, 
                                    date == unique(label_data$date)[d])

    counter <- counter + 1
  }
}

mdist.df <- do.call(rbind, mdist.list)

can you show an example of how you would call the mahalanobis function with your df dataset? — chinsoon12, Jan 23 '19 at 01:01
I added the code I'm using for a looped solution to the question, does that help? — Forrest, Jan 23 '19 at 23:04

score 1 · Answer 1 · answered Jan 24 '19 at 00:26

Not sure if I understand your requirements or desired output correctly, here is something using data.table to get you started:

library(data.table)
setDT(df)
df[, mdist := 
    .SD[, transpose(lapply(1L:.N, function(n) {
        ma <- .SD[1L:n]
        ans <- tryCatch(mahalanobis(ma, colMeans(ma), var(ma)), error=function(e) NA)
        ans[length(ans)]            
    })), by=.(label), .SDcols=valx:valz]$V1]

output:

    label       date         valx       valy        valz     mdist
 1:     A 2018-01-01  1.262954285   7.635935  -2.2426789        NA
 2:     A 2018-01-02 -0.326233361  -7.990092   3.7739565        NA
 3:     A 2018-01-03  1.329799263 -11.476570   1.3333636        NA
 4:     A 2018-01-04  1.272429321  -2.894616   8.0418951 2.2500000
 5:     A 2018-01-05  0.414641434  -2.992151  -0.5710677 0.7260652
 6:     B 2018-01-01 -1.539950042  -4.115108  15.1082392        NA
 7:     B 2018-01-02 -0.928567035   2.522234  32.5730809        NA
 8:     B 2018-01-03 -0.294720447  -8.919211 -20.7286152        NA
 9:     B 2018-01-04 -0.005767173   4.356833 -38.5379806 2.2500000
10:     B 2018-01-05  2.404653389 -12.375384   1.4017852 3.0800360

data:

set.seed(0L)
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
    date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
    valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 0, sd = 1)),
    valy = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 10)),
    valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))

I will delete this post if you are looking only for a tidyverse solution.

While I would prefer a `tidyverse` solution, if this massively reduces my runtime I think I can figure out how to apply it to my problem. Is there an issue with leaving this solution if I don't accept it as solved? — Forrest, Jan 24 '19 at 18:11

Vectorized Rolling/Cumulative Mahalanobis Distance in R

1 Answers1