assigning values in each column to be the sum of that column

Question

I have DataFrame and I am trying to assign all values in each column to be the sum of that column.

x = pd.DataFrame(data = [[1,2],[3,4],[5,6],[7,8],[9,10]],index=[1,2,3,4,5],columns=['a','b'])
x 
   a   b
1  1   2
2  3   4
3  5   6
4  7   8
5  9  10

the output should be

I want to use x.apply(f, axis=0), but I do not know how to define a function that convert a column to be the sum of all column values in a lambda function. The following line raise SyntaxError: can't assign to lambda

f = lambda x : x[:]= x.sum()

"I can not define a lambda function as follows": What makes you say this? — Scott Hunter, Aug 12 '16 at 14:21
Wouldn't `df.sum()` just do what you want - or do you *really* want to over-write all the columns with the sum? (I'm struggling to think of why you'd want to do that...) — Jon Clements, Aug 12 '16 at 14:26
I think you need `df['col'] = df.col.sum()` or better `df['new_col'] = df.col.sum()` — jezrael, Aug 12 '16 at 14:27

Alexander · Answer 1 · 2016-08-12T16:02:08.573

5

for col in df:
    df[col] = df[col].sum()

or a slower solution that doesn't use looping...

df = pd.DataFrame([df.sum()] * len(df))

Timings

@jezrael Thanks for the timings. This does them on a larger dataframe and includes the for loop as well. Most of the time is spent creating the dataframe rather than calculating the sums, so the most efficient method that does this appears to be the one from @ayhan that assigns the sum to the values directly:

from string import ascii_letters

df = pd.DataFrame(np.random.randn(10000, 52), columns=list(ascii_letters))

# A baseline timing figure to determine sum of each column.
%timeit df.sum()
1000 loops, best of 3: 1.47 ms per loop

# Solution 1 from @Alexander
%%timeit
for col in df:
    df[col] = df[col].sum()
100 loops, best of 3: 21.3 ms per loop

# Solution 2 from @Alexander (without `for loop`, but much slower)
%timeit df2 = pd.DataFrame([df.sum()] * len(df))
1 loops, best of 3: 270 ms per loop

# Solution from @PiRSquared
%timeit df.stack().groupby(level=1).transform('sum').unstack()
10 loops, best of 3: 159 ms per loop

# Solution 1 from @Jezrael
%timeit (pd.DataFrame(np.tile(df.sum().values, (len(df.index),1)), columns=df.columns, index=df.index))
100 loops, best of 3: 2.32 ms per loop

# Solution 2 from @Jezrael
%%timeit
df2 = pd.DataFrame(df.sum().values[np.newaxis,:].repeat(len(df.index), axis=0),
                 columns=df.columns,
                 index=df.index)
100 loops, best of 3: 2.3 ms per loop

# Solution from @ayhan
%time df.values[:] = df.values.sum(0)
CPU times: user 1.54 ms, sys: 485 µs, total: 2.02 ms
Wall time: 1.36 ms  # <<<< FASTEST

edited Aug 12 '16 at 16:02

answered Aug 12 '16 at 14:37

Alexander

105,104
32
201
196

Thanks, is there a way to avoid the for loop? – Warren Aug 12 '16 at 14:39
The for loop is perfectly valid in this case. – Alexander Aug 12 '16 at 14:48
yes, but my case is that I have thousands of columns and I am trying to find a more efficient way. – Warren Aug 12 '16 at 14:50
This is efficient. – Alexander Aug 12 '16 at 14:50
there are some discussion about iterating rows http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas/34311080#34311080, there are iterrows(), itertuples(), and zip(), as well as apply(), not sure which one is the best – Warren Aug 12 '16 at 15:07
2

Iterating over rows is problematic. Iterating over columns, not so much. I've tested this on a (10^5, 10^4) dataframe and it took less than 2 seconds. If this is not something you are doing over and over, no need to avoid looping over columns. – ayhan Aug 12 '16 at 15:15
2

@whan I've also tested this technique. It is efficient. Using numpy can often be more efficient but it becomes a choice between more intuitive pandas and faster numpy. Alexander has shown time and again that loops in specific situations are fast and that the generalization that all loops are bad is naive. – piRSquared Aug 12 '16 at 15:35
@piRSquared Just tested with my case 4000+ rows and 2000+ cols. Time spent is roughly 1.2s, which is comparable with the Numpy method and I agree it is more intuitive. – Warren Aug 12 '16 at 15:49

jezrael · Accepted Answer · 2016-08-12T15:10:48.340

Another faster numpy solution with numpy.tile:

print (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), 
                    columns=x.columns, 
                    index=x.index))
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30

Another solution with numpy.repeat:

h = pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),
                 columns=x.columns,
                 index=x.index)

print (h)
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30


In [431]: %timeit df = pd.DataFrame([x.sum()] * len(x))
1000 loops, best of 3: 786 µs per loop

In [432]: %timeit (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), columns=x.columns, index=x.index))
1000 loops, best of 3: 192 µs per loop

In [460]: %timeit pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),columns=x.columns, index=x.index)
The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 184 µs per loop

This also works with fairly arbitrary indices... meaning it's robust when indices are not unique. — piRSquared, Aug 12 '16 at 15:27
thanks, just tested with my case 4000+ rows and 2000+ cols. time spent on three methods are roughly [ 1.08s, 0.59s, 0.58s] — Warren, Aug 12 '16 at 15:43

score 4 · Answer 3 · answered Aug 12 '16 at 15:33

4

If your DataFrame consists of numbers, you can directly change its values:

df.values[:] = df.sum()

answered Aug 12 '16 at 15:33

ayhan

70,170
20
182
203

This is very fast! – piRSquared Aug 12 '16 at 15:38
2

Better still `df.values[:] = df.values.sum(0)` – piRSquared Aug 12 '16 at 15:39
Yes it seems to be the fastest one so far, though it is hard to test since values converges to infinity really quickly, I am afraid that might affect the timings. – ayhan Aug 12 '16 at 15:44
@piRSquared `df.values[:] = df.values.sum(0)` this is much faster than I expected. – ayhan Aug 12 '16 at 15:54

score 2 · Answer 4 · answered Aug 12 '16 at 15:17

2

Using transform

x.stack().groupby(level=1).transform('sum').unstack()

answered Aug 12 '16 at 15:17

piRSquared

285,575
57
475
624

thanks, just tested with my case 4000+ rows and 2000+ cols. Time spent is roughly 10s – Warren Aug 12 '16 at 15:44

score 0 · Answer 5 · answered Aug 12 '16 at 14:34

0

I don't know exactly what you're trying to do but you can do something with list comprehension, like f = lambda x : [column.sum() for column in x]

answered Aug 12 '16 at 14:34

DenverCoder9

151
8

assigning values in each column to be the sum of that column

5 Answers5