Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.
First, let's dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0] --> [.66, .72, .79, .87, .96], which is explained to be a "progressively increasing value of 10%" (in such a way that the total is the "value to spread": 4.0).
This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:
r = 1.1
a = 4.0
n = 5
q = np.cumprod(np.repeat(r, n))
a * q / q.sum()
# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We'd like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a: value to distribute
i: index of run (0 .. n-1)
n: run length
- then, the value is
v = a * r**i * (r - 1) / (r**n - 1).
To illustrate with the first column in the OP's example, where the input is: [1, nan, nan, nan, nan, 4], we would like:
a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
- then, the value
v would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96].
Now comes the part where we go about getting these three quantities as numpy arrays.
a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values
nrows, ncols = z.shape
For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for "Cumulative counts in NumPy without iteration". But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z):
na = np.isnan(z)
without_reset = na.cumsum(axis=0)
reset_at = ~na
overcount = np.maximum.accumulate(without_reset * reset_at)
result = without_reset - overcount
return result
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n, we need to do some dancing on our own, using first principles of numpy (I'll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1):
z = df.values
nrows, ncols = z.shape
a = df.bfill().values
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
v = a * r**i * (r - 1) / (r**n - 1)
return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes
A B
a b c d e a b c d e
S
2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68
2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85
2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04
2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24
2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47
2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let's look at each of the 3 quantities in that example:
>>> a
[[ 1 2 4 4 9 10 11 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]]
>>> i
[[0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 1 0 0 1 1 1]
[1 1 2 2 2 1 1 2 2 2]
[2 2 3 0 3 2 2 3 0 3]
[3 3 4 1 4 3 3 4 1 4]
[4 4 5 2 5 4 4 5 2 5]]
>>> n
[[1 1 6 3 6 1 1 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):
np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
0 1 2 3 4 5
0 NaN NaN NaN NaN NaN 0.0
1 NaN NaN 9.0 NaN 8.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 8.0 4.0 NaN NaN NaN
4 NaN NaN NaN 6.0 9.0 NaN
5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing
0 1 2 3 4 5
0 NaN 1.72 4.29 0.98 3.81 0.00
1 NaN 1.90 4.71 1.08 4.19 1.31
2 NaN 2.09 1.90 1.19 2.72 1.44
3 NaN 2.29 2.10 1.31 2.99 1.59
4 NaN NaN 0.95 1.44 3.29 1.74
5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
%timeit spread_bfill(df)
# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)