Let x be a vector of numeric, non-negative data (mostly < 10) and qx <- quantile(x, probs = pq), and where length(pq) is typically > length(x) * (3/4).
I am in need of a vector of indices of qx, call it q_i, where x[i] falls in the quantile qx[q_i[i]].
The catch, as the title indicates, is that there may be non-unique values present in qx, e.g. multiple 0-valued quantiles if x is zero-inflated, and potentially other duplicate values. I would like to handle these cases by either (a) recycling the sequence of indices of these equivalent quantiles, or (b) randomly assigning the indices of equivalent quantiles. I think I would prefer option (a), but a solution for either would be useful.
Here is an edit to provide the rules for determining q_i[i] for a particular x[i]:
Consider that qx has one or more sequences of duplicate values, i.e. for some j there is (are) sequence(s) qx[j:n] where qx[j] == qx[j + 1] == ... == qx[j + n] < qx[j + n + 1]. Let k = c(j, j + 1,..., j + n). Then q_i[i] <- k[r] where qx[j] <= x[i] <= qx[j + n + 1] if j == 1 or qx[j] < x[i] <= qx[j + n + 1] if j > 1, and where r <- m %% (n + 1) such that x[i] is the m-th occurrence in x where the inequality has been satisfied.
NOTE: based on this rule, I realized I omitted a 4 in my original q_i - this has been changed.
NOTE: @hodgenovice brought up a good point regarding special cases where data values that are strictly smaller than two quantiles may be grouped into the "bin" between two such quantiles. I am not particularly concerned with the special case because, if for example there were no duplicate quantiles but we had the same quantile values, those special cases would correctly be binned together.
I'm thinking there is an efficient way to do this - I have essentially done this using a for loop but I am looking for a vectorized approach.
I started trying to work with cut() which of course doesn't allow non-unique breaks. I found this question here which kind of helped, in that I discovered the .bincode() function, which does allow non-unique breaks. However, it has no rule for "distributing" the indices - it would only use the index of the first of each duplicated quantile value.
Some example code for this problem:
x <- c(5.8, 0.0, 16.1, 5.8, 3.5, 13.8, 6.9, 5.8, 11.5, 9.2, 11.5,
3.5, 0.0, 8.1, 0.0, 4.6, 5.8, 3.5, 0.0, 10.3, 0.0, 0.0,
3.5, 6.9, 3.5)
pq <- seq(0, 1, length.out = 20)
qx <- quantile(x, pq)
# quantiles for reference, rounded for readability
round(as.numeric(qx), 2)
[1] 0.00 0.00 0.00 0.00 0.18 3.50 3.50 3.50 3.62 5.04 5.80 5.80 5.97
[14] 6.90 7.72 9.14 10.55 11.50 13.19 16.10
q_i <- .bincode(x, qx, include.lowest = TRUE)
q_i
[1] 10 1 19 10 5 19 13 10 17 16 17 5 1 15 1 9 10 5 1 16 1 1 5 13 5
Here are the results I would be looking for, if .bincode() was magic and I could talk it into doing what I need:
Under scenario (a) above:
(I edited this too, as I was originally missing a value of 4)
q_i
[1] 10 1 19 11 5 19 13 10 17 16 17 6 2 15 3 9 11 7 4 16 1 2 5 13 6
Under scenario (b), it could, with low probability, look the same as directly above. Or something like:
q_i
[1] 10 1 19 10 6 19 13 11 17 16 17 5 3 15 2 9 11 6 2 16 1 4 5 13 7
Note here that the full vectors of "equivalent" qx sequences that get recycled are essentially sampled without replacement.
Thanks!