Filling missing data using a custom condition in a Pandas time series dataframe

Question

Below is a portion of mydataframe which has many missing values.

            A                                       B
S           a       b       c       d       e       a       b       c       d       e
date                                        
2020-10-15  1.0     2.0     NaN     NaN     NaN     10.0    11.0    NaN     NaN     NaN
2020-10-16  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2020-10-17  NaN     NaN     NaN     4.0     NaN     NaN     NaN     NaN     13.0    NaN
2020-10-18  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2020-10-19  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2020-10-20  4.0     6.0     4.0     1.0     9.0     10.0    2.0     13.0    4.0     13.0

I would like to replace the NANs in each column using a specific backward fill condition .

For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is '4' against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.

I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.

I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.

Appreciate any inputs.

Pierre D · Accepted Answer · 2021-04-16T17:52:58.167

Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.

First, let's dig a bit into the requirements, specifically what exactly the value for each cell needs to be.

The example given is [nan, nan, nan, nan, 4.0] --> [.66, .72, .79, .87, .96], which is explained to be a "progressively increasing value of 10%" (in such a way that the total is the "value to spread": 4.0).

This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:

r = 1.1
a = 4.0
n = 5
q = np.cumprod(np.repeat(r, n))
a * q / q.sum()
# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])

We'd like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:

To generalize, we need 3 quantities to calculate the value of each cell:

a: value to distribute
i: index of run (0 .. n-1)
n: run length
then, the value is v = a * r**i * (r - 1) / (r**n - 1).

To illustrate with the first column in the OP's example, where the input is: [1, nan, nan, nan, nan, 4], we would like:

a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
then, the value v would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96].

Now comes the part where we go about getting these three quantities as numpy arrays.

a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:

z = df.values
nrows, ncols = z.shape

For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for "Cumulative counts in NumPy without iteration". But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:

def rcount(z):
    na = np.isnan(z)
    without_reset = na.cumsum(axis=0)
    reset_at = ~na
    overcount = np.maximum.accumulate(without_reset * reset_at)
    result = without_reset - overcount
    return result

i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]

For n, we need to do some dancing on our own, using first principles of numpy (I'll break down the steps if I have time):

runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]

So, putting it all together:

def spread_bfill(df, r=1.1):
    z = df.values
    nrows, ncols = z.shape

    a = df.bfill().values
    i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
    runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
    n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
    v = a * r**i * (r - 1) / (r**n - 1)
    return pd.DataFrame(v, columns=df.columns, index=df.index)

On your example data, we get:

>>> spread_bfill(df).round(2)  # round(2) for printing purposes
               A                              B                         
               a     b     c     d     e      a      b     c     d     e
S                                                                       
2020-10-15  1.00  2.00  0.52  1.21  1.17  10.00  11.00  1.68  3.93  1.68
2020-10-16  0.66  0.98  0.57  1.33  1.28   1.64   0.33  1.85  4.32  1.85
2020-10-17  0.72  1.08  0.63  1.46  1.41   1.80   0.36  2.04  4.75  2.04
2020-10-18  0.79  1.19  0.69  0.30  1.55   1.98   0.40  2.24  1.21  2.24
2020-10-19  0.87  1.31  0.76  0.33  1.71   2.18   0.44  2.47  1.33  2.47
2020-10-20  0.96  1.44  0.83  0.37  1.88   2.40   0.48  2.71  1.46  2.71

For inspection, let's look at each of the 3 quantities in that example:

>>> a
[[ 1  2  4  4  9 10 11 13 13 13]
 [ 4  6  4  4  9 10  2 13 13 13]
 [ 4  6  4  4  9 10  2 13 13 13]
 [ 4  6  4  1  9 10  2 13  4 13]
 [ 4  6  4  1  9 10  2 13  4 13]
 [ 4  6  4  1  9 10  2 13  4 13]]

>>> i
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 1 0 0 1 1 1]
 [1 1 2 2 2 1 1 2 2 2]
 [2 2 3 0 3 2 2 3 0 3]
 [3 3 4 1 4 3 3 4 1 4]
 [4 4 5 2 5 4 4 5 2 5]]

>>> n
[[1 1 6 3 6 1 1 6 3 6]
 [5 5 6 3 6 5 5 6 3 6]
 [5 5 6 3 6 5 5 6 3 6]
 [5 5 6 3 6 5 5 6 3 6]
 [5 5 6 3 6 5 5 6 3 6]
 [5 5 6 3 6 5 5 6 3 6]]

And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):

np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
    0    1    2    3    4    5
0 NaN  NaN  NaN  NaN  NaN  0.0
1 NaN  NaN  9.0  NaN  8.0  NaN
2 NaN  NaN  NaN  NaN  NaN  NaN
3 NaN  8.0  4.0  NaN  NaN  NaN
4 NaN  NaN  NaN  6.0  9.0  NaN
5 NaN  NaN  2.0  NaN  7.0  8.0

Then:

>>> spread_bfill(df).round(2)  # round(2) for printing
    0     1     2     3     4     5
0 NaN  1.72  4.29  0.98  3.81  0.00
1 NaN  1.90  4.71  1.08  4.19  1.31
2 NaN  2.09  1.90  1.19  2.72  1.44
3 NaN  2.29  2.10  1.31  2.99  1.59
4 NaN   NaN  0.95  1.44  3.29  1.74
5 NaN   NaN  1.05   NaN  7.00  1.92

Speed

a = np.random.randint(0, 10, (1000, 1000)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)

%timeit spread_bfill(df)
# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Brilliant. Thanks a lot for the detailed explanation. Luckily the data will always have values in the last row.:):). May be for the last row if it ever has NANs, i might do a ffill at 'r'. Thanks you very much. — Srinivas, Apr 16 '21 at 17:56
the solution got executed instantaneously, super super quick on my actual dataframe with 331 rows and 156 columns. Thanks a lot. — Srinivas, Apr 16 '21 at 18:07

Corralien · Answer 2 · 2021-04-16T17:57:17.627

Initial data:

>>> df
              A                         B
              a    b    c    d    e     a     b     c     d     e
date
2020-10-15  1.0  2.0  NaN  NaN  NaN  10.0  11.0   NaN   NaN   NaN
2020-10-16  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN
2020-10-17  NaN  NaN  NaN  4.0  NaN   NaN   NaN   NaN  13.0   NaN
2020-10-18  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN
2020-10-19  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN
2020-10-20  4.0  6.0  4.0  1.0  9.0  10.0   2.0  13.0   4.0  13.0

Define your geometric sequence:

def geomseq(seq):
    q = 1.1
    n = len(seq)
    S = seq.max()
    Uo = S * (1-q) / (1-q**n)
    Un = [Uo * q**i for i in range(0, n)]
    return Un

TL;DR

>>> df.unstack().groupby(df.unstack().sort_index(ascending=False).notna().cumsum().sort_index()).transform(geomseq).unstack(level=[0, 1])
                   A                                                  B
                   a         b         c         d         e          a          b         c         d         e
date
2020-10-15  1.000000  2.000000  0.518430  1.208459  1.166466  10.000000  11.000000  1.684896  3.927492  1.684896
2020-10-16  0.655190  0.982785  0.570272  1.329305  1.283113   1.637975   0.327595  1.853386  4.320242  1.853386
2020-10-17  0.720709  1.081063  0.627300  1.462236  1.411424   1.801772   0.360354  2.038724  4.752266  2.038724
2020-10-18  0.792780  1.189170  0.690030  0.302115  1.552567   1.981950   0.396390  2.242597  1.208459  2.242597
2020-10-19  0.872058  1.308087  0.759033  0.332326  1.707823   2.180144   0.436029  2.466856  1.329305  2.466856
2020-10-20  0.959264  1.438895  0.834936  0.365559  1.878606   2.398159   0.479632  2.713542  1.462236  2.713542

Details

Convert your dataframe to series:

>>> sr = df.unstack()
>>> sr.head(10)
      date
A  a  2020-10-15    1.0
      2020-10-16    NaN  # <= group X (final value: .655)
      2020-10-17    NaN  # <= group X (final value: .720)
      2020-10-18    NaN  # <= group X (final value: .793)
      2020-10-19    NaN  # <= group X (final value: .872)
      2020-10-20    4.0  # <= group X (final value: .960)
   b  2020-10-15    2.0
      2020-10-16    NaN
      2020-10-17    NaN
      2020-10-18    NaN
dtype: float64

Now you can build groups:

>>> groups = sr.sort_index(ascending=False).notna().cumsum().sort_index()
>>> groups.head(10)
      date
A  a  2020-10-15    16
      2020-10-16    15  # <= group X15
      2020-10-17    15  # <= group X15
      2020-10-18    15  # <= group X15
      2020-10-19    15  # <= group X15
      2020-10-20    15  # <= group X15
   b  2020-10-15    14
      2020-10-16    13
      2020-10-17    13
      2020-10-18    13
dtype: int64

Apply your geometric progression:

>>> sr = sr.groupby(groups).transform(geomseq)
>>> sr.head(10)
      date
A  a  2020-10-15    1.000000
      2020-10-16    0.655190  # <= group X15
      2020-10-17    0.720709  # <= group X15
      2020-10-18    0.792780  # <= group X15
      2020-10-19    0.872058  # <= group X15
      2020-10-20    0.959264  # <= group X15
   b  2020-10-15    2.000000
      2020-10-16    0.982785
      2020-10-17    1.081063
      2020-10-18    1.189170
dtype: float64

And finally, reshape series according to your initial dataframe:

>>> df = sr.unstack(level=[0, 1])
>>> df
                   A                                                  B
                   a         b         c         d         e          a          b         c         d         e
date
2020-10-15  1.000000  2.000000  0.518430  1.208459  1.166466  10.000000  11.000000  1.684896  3.927492  1.684896
2020-10-16  0.655190  0.982785  0.570272  1.329305  1.283113   1.637975   0.327595  1.853386  4.320242  1.853386
2020-10-17  0.720709  1.081063  0.627300  1.462236  1.411424   1.801772   0.360354  2.038724  4.752266  2.038724
2020-10-18  0.792780  1.189170  0.690030  0.302115  1.552567   1.981950   0.396390  2.242597  1.208459  2.242597
2020-10-19  0.872058  1.308087  0.759033  0.332326  1.707823   2.180144   0.436029  2.466856  1.329305  2.466856
2020-10-20  0.959264  1.438895  0.834936  0.365559  1.878606   2.398159   0.479632  2.713542  1.462236  2.713542

This is even cooler. Awesome. It takes much less code. Thanks a lot again for the great explanation. This code runs fast too. However, while this solution took several seconds to execute on my actual dataframe of 331 rows and 156 columns, the solution provided by @Pierre D was instantaneous, super quick. — Srinivas, Apr 16 '21 at 18:06

Filling missing data using a custom condition in a Pandas time series dataframe

2 Answers2