4

I have DataFrame and I am trying to assign all values in each column to be the sum of that column.

x = pd.DataFrame(data = [[1,2],[3,4],[5,6],[7,8],[9,10]],index=[1,2,3,4,5],columns=['a','b'])
x 
   a   b
1  1   2
2  3   4
3  5   6
4  7   8
5  9  10

the output should be

   a    b
1  25   30
2  25   30
3  25   30
4  25   30
5  25   30

I want to use x.apply(f, axis=0), but I do not know how to define a function that convert a column to be the sum of all column values in a lambda function. The following line raise SyntaxError: can't assign to lambda

f = lambda x : x[:]= x.sum()
Warren
  • 991
  • 11
  • 28

5 Answers5

5
for col in df:
    df[col] = df[col].sum()

or a slower solution that doesn't use looping...

df = pd.DataFrame([df.sum()] * len(df))

Timings

@jezrael Thanks for the timings. This does them on a larger dataframe and includes the for loop as well. Most of the time is spent creating the dataframe rather than calculating the sums, so the most efficient method that does this appears to be the one from @ayhan that assigns the sum to the values directly:

from string import ascii_letters

df = pd.DataFrame(np.random.randn(10000, 52), columns=list(ascii_letters))

# A baseline timing figure to determine sum of each column.
%timeit df.sum()
1000 loops, best of 3: 1.47 ms per loop

# Solution 1 from @Alexander
%%timeit
for col in df:
    df[col] = df[col].sum()
100 loops, best of 3: 21.3 ms per loop

# Solution 2 from @Alexander (without `for loop`, but much slower)
%timeit df2 = pd.DataFrame([df.sum()] * len(df))
1 loops, best of 3: 270 ms per loop

# Solution from @PiRSquared
%timeit df.stack().groupby(level=1).transform('sum').unstack()
10 loops, best of 3: 159 ms per loop

# Solution 1 from @Jezrael
%timeit (pd.DataFrame(np.tile(df.sum().values, (len(df.index),1)), columns=df.columns, index=df.index))
100 loops, best of 3: 2.32 ms per loop

# Solution 2 from @Jezrael
%%timeit
df2 = pd.DataFrame(df.sum().values[np.newaxis,:].repeat(len(df.index), axis=0),
                 columns=df.columns,
                 index=df.index)
100 loops, best of 3: 2.3 ms per loop

# Solution from @ayhan
%time df.values[:] = df.values.sum(0)
CPU times: user 1.54 ms, sys: 485 µs, total: 2.02 ms
Wall time: 1.36 ms  # <<<< FASTEST
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Thanks, is there a way to avoid the for loop? – Warren Aug 12 '16 at 14:39
  • The for loop is perfectly valid in this case. – Alexander Aug 12 '16 at 14:48
  • yes, but my case is that I have thousands of columns and I am trying to find a more efficient way. – Warren Aug 12 '16 at 14:50
  • This is efficient. – Alexander Aug 12 '16 at 14:50
  • there are some discussion about iterating rows http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas/34311080#34311080, there are iterrows(), itertuples(), and zip(), as well as apply(), not sure which one is the best – Warren Aug 12 '16 at 15:07
  • 2
    Iterating over rows is problematic. Iterating over columns, not so much. I've tested this on a (10^5, 10^4) dataframe and it took less than 2 seconds. If this is not something you are doing over and over, no need to avoid looping over columns. – ayhan Aug 12 '16 at 15:15
  • 2
    @whan I've also tested this technique. It is efficient. Using numpy can often be more efficient but it becomes a choice between more intuitive pandas and faster numpy. Alexander has shown time and again that loops in specific situations are fast and that the generalization that all loops are bad is naive. – piRSquared Aug 12 '16 at 15:35
  • @piRSquared Just tested with my case 4000+ rows and 2000+ cols. Time spent is roughly 1.2s, which is comparable with the Numpy method and I agree it is more intuitive. – Warren Aug 12 '16 at 15:49
5

Another faster numpy solution with numpy.tile:

print (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), 
                    columns=x.columns, 
                    index=x.index))
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30

Another solution with numpy.repeat:

h = pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),
                 columns=x.columns,
                 index=x.index)

print (h)
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30


In [431]: %timeit df = pd.DataFrame([x.sum()] * len(x))
1000 loops, best of 3: 786 µs per loop

In [432]: %timeit (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), columns=x.columns, index=x.index))
1000 loops, best of 3: 192 µs per loop

In [460]: %timeit pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),columns=x.columns, index=x.index)
The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 184 µs per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This also works with fairly arbitrary indices... meaning it's robust when indices are not unique. – piRSquared Aug 12 '16 at 15:27
  • thanks, just tested with my case 4000+ rows and 2000+ cols. time spent on three methods are roughly [ 1.08s, 0.59s, 0.58s] – Warren Aug 12 '16 at 15:43
4

If your DataFrame consists of numbers, you can directly change its values:

df.values[:] = df.sum()
ayhan
  • 70,170
  • 20
  • 182
  • 203
2

Using transform

x.stack().groupby(level=1).transform('sum').unstack()

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

I don't know exactly what you're trying to do but you can do something with list comprehension, like f = lambda x : [column.sum() for column in x]

DenverCoder9
  • 151
  • 8