I'm trying to calculate a rolling mahalanobis distance without resorting to for loops and failing miserably.
Here is an example dataset:
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
I'm trying to calculate, by group (label), the mahalanobis distance of valx, valy, and valz, but only using rows from that date (date) or previous. My current solution is to loop through each label, loop through each date, filter the dataset down to the matching data, calculate distance using stats::mahalanobis, add that distance to a list, and then do.call and rbind them outside the loop*. Clearly this isn't ideal.
I suspect that there's some way to write:
cum.mdist <- function(df, cols) {...}
df %>%
group_by(label) %>%
arrange(date) %>%
mutate(mdist = xapply(., c(valx, valy, valz), cum.mdist)) %>%
ungroup()
in a similar way to calculating a rolling unary function like so:
cumsd <- function(x) sapply(seq_along(x), function(k, z) sd(z[1:k]), z = x)
I could calculate the distance from component parts if there were no covariance (rolling variance variance is simple to calculate using a function like the above one), but I think my variables do have covariance, and I'm not sure how to build a rolling covariance matrix...
Does a solution to this exist outside of for loops?
*code for a looped solution is below:
library("tidyverse")
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
mdist.list <- vector(length = nrow(df), mode = "list")
counter <- 1
for(l in seq_along(unique(df$label))){
label_data <- df %>%
filter(label == unique(df$label)[l])
for(d in seq_along(unique(label_data$date))){
label_date_data <- label_data %>%
filter(date <= unique(label_data$date)[d])
if(nrow(label_date_data) > 3){
label_date_data$mdist <- mahalanobis(label_date_data %>% select(contains("val")),
colMeans(label_date_data %>% select(contains("val"))),
cov(label_date_data %>% select(contains("val"))))
} else{
label_date_data$mdist <- NA
}
mdist.list[[counter]] <- filter(label_date_data,
date == unique(label_data$date)[d])
counter <- counter + 1
}
}
mdist.df <- do.call(rbind, mdist.list)