13

I am so confused with different indexing methods using iloc in pandas.

Let say I am trying to convert a 1-d Dataframe to a 2-d Dataframe. First I have the following 1-d Dataframe

a_array = [1,2,3,4,5,6,7,8]
a_df = pd.DataFrame(a_array).T

And I am going to convert that into a 2-d Dataframe with the size of 2x4. I start by preseting the 2-d Dataframe as follow:

b_df = pd.DataFrame(columns=range(4),index=range(2))

Then I use for-loop to help me converting a_df (1-d) to b_df (2-d) with the following code

for i in range(2):
    b_df.iloc[i,:] = a_df.iloc[0,i*4:(i+1)*4]

It only gives me the following results

     0    1    2    3
0    1    2    3    4
1  NaN  NaN  NaN  NaN

But when I changed b_df.iloc[i,:] to b_df.iloc[i][:]. The result is correct like the following, which is what I want

   0  1  2  3
0  1  2  3  4
1  5  6  7  8

Could anyone explain to me what the difference between .iloc[i,:] and .iloc[i][:] is, and why .iloc[i][:] worked in my example above but not .iloc[i,:]

cs95
  • 379,657
  • 97
  • 704
  • 746
Yippee
  • 237
  • 1
  • 10
  • This is curious. `b_df.iloc[1] = a_df.iloc[0, 4:8]` assigns a series with index `[4, 5, 6, 7]` to a series with index `[0, 1, 2, 3]`. There is no overlap so `NaN`s get assigned to all elements. Up to this point it makes sense to me. But like you I am unclear on why `b_df.iloc[1][:] = ...` behaves differently—inspecting the objects `b_df.iloc[1]` and `b_df.iloc[1][:]` reveals no difference between the indices. My best guess would be that assigning directly to a copy (`[:]`) is treated as a special case by Pandas which makes it ignore the assignee's index and create this discrepancy. – Seb Feb 21 '20 at 12:06
  • I think it is becaused of the index, and the first row success because it has same index – Phung Duy Phong Feb 21 '20 at 12:10
  • 2
    I key thing to remember about pandas is that most all operations in pandas using a concept called 'instrinic data alignment'. Meaning that almost any operation that you do with pandas will align the indexes of both sides of the statement. Here you are trying to set index 1 using index 0, pandas will assign nans because there is no index 0 on the right side of that assignment. Also remember that column headers too are an index. So, pandas will align column header to column header. – Scott Boston Feb 25 '20 at 03:10
  • 4
    Secondly, using .iloc[i][:] is called index chaining and it is generally a pretty big "no-no" in pandas. There are some isuses with pandas creating views of an object or creating a brand new object in memory that may yield some unexpected results. – Scott Boston Feb 25 '20 at 03:12
  • Please don't forget to upvote all working answers, and accept the one you like the most. Probably you know this, but this is to let the community know which answers were useful and to reward the people for their time and effort as well ;) See this meta.stackexchange.com/questions/5234/ and meta.stackexchange.com/questions/173399/ – alan.elkin Mar 07 '20 at 19:26

3 Answers3

3

There is a very, very big difference between series.iloc[:] and series[:], when assigning back. (i)loc always checks to make sure whatever you're assigning from matches the index of the assignee. Meanwhile, the [:] syntax assigns to the underlying NumPy array, bypassing index alignment.

s = pd.Series(index=[0, 1, 2, 3], dtype='float')  
s                                                                          

0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

# Let's get a reference to the underlying array with `copy=False`
arr = s.to_numpy(copy=False) 
arr 
# array([nan, nan, nan, nan])

# Reassign using slicing syntax
s[:] = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])                 
s                                                                          

0    1
1    2
2    3
3    4
dtype: int64

arr 
# array([1., 2., 3., 4.]) # underlying array has changed

# Now, reassign again with `iloc`
s.iloc[:] = pd.Series([5, 6, 7, 8], index=[3, 4, 5, 6]) 
s                                                                          

0    NaN
1    NaN
2    NaN
3    5.0
dtype: float64

arr 
# array([1., 2., 3., 4.])  # `iloc` created a new array for the series
                           # during reassignment leaving this unchanged

s.to_numpy(copy=False)     # the new underlying array, for reference                                                   
# array([nan, nan, nan,  5.]) 

Now that you understand the difference, let's look at what happens in your code. Just print out the RHS of your loops to see what you are assigning:

for i in range(2): 
    print(a_df.iloc[0, i*4:(i+1)*4]) 

# output - first row                                                                   
0    1
1    2
2    3
3    4
Name: 0, dtype: int64
# second row. Notice the index is different
4    5
5    6
6    7
7    8
Name: 0, dtype: int64   

When assigning to b_df.iloc[i, :] in the second iteration, the indexes are different so nothing is assigned and you only see NaNs. However, changing b_df.iloc[i, :] to b_df.iloc[i][:] will mean you assign to the underlying NumPy array, so indexing alignment is bypassed. This operation is better expressed as

for i in range(2):
    b_df.iloc[i, :] = a_df.iloc[0, i*4:(i+1)*4].to_numpy()

b_df                                                                       

   0  1  2  3
0  1  2  3  4
1  5  6  7  8

It's also worth mentioning this is a form of chained assignment, which is not a good thing, and also makes your code harder to read and understand.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Now I understand it, thank you. Before I award the bounty, could you add a reference for this: "the `[:]` syntax assigns to the underlying NumPy array"? – Seb Feb 29 '20 at 22:49
  • @Seb You won't really find references to this in the documentation because it's somewhat of an implementation detail. It may be easier to find the code on GitHub that is responsible for this, but I think the easiest way is to just demonstrate what happens. I've edited the little example at the top of my answer to show how the underlying array is manipulated during the different kinds of reassignment. Hope that makes things clearer! – cs95 Feb 29 '20 at 22:59
0

The difference is that in the first case the Python interpreter executed the code as:

b_df.iloc[i,:] = a_df.iloc[0,i*4:(i+1)*4]
#as
b_df.iloc.__setitem__((i, slice(None)), value)

where the value would be the right hand side of the equation. Whereas in the second case the Python interpreter executed the code as:

b_df.iloc[i][:] = a_df.iloc[0,i*4:(i+1)*4]
#as
b_df.iloc.__getitem__(i).__setitem__(slice(None), value)

where again the value would be the right hand side of the equation.

In each of those two cases a different method would be called inside setitem due to the difference in the keys (i, slice(None)) and slice(None) Therefore we have different behavior.

MaPy
  • 505
  • 1
  • 6
  • 9
  • `b_df.iloc[i]` and `b_df.iloc[i][:]` have the same indices though. Why can you assign a series with non-matching index to one but not the other? – Seb Feb 21 '20 at 13:13
  • in the first case the _set_item would be call in the second one_setitem_slice would be call. So, suspect due to the difference of those methods we have the above behavior – MaPy Feb 21 '20 at 16:55
0

Could anyone explain to me what the difference between .iloc[i,:] and .iloc[i][:] is

The difference between .iloc[i,:] and .iloc[i][:]

In the case of .iloc[i,:] you are accessing directly to a specific possition of the DataFrame, by selecting all (:) columns of the ith row. As far as I know, it is equivalent to leave the 2nd dimension unspecified (.iloc[i]).

In the case of .iloc[i][:] you are performing a 2 chained operations. So, the result of .iloc[i] will then be affected by [:]. Using this to set values is discouraged by Pandas itself here with a warning, so you shouldn't use it:

Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided


... and why .iloc[i][:] worked in my example above but not .iloc[i,:]

As @Scott mentioned on the OP comments, data alignment is intrinsic, so the indexes in the right side of the = won't be included if they are not present in the left side. This is why there are NaN values on the 2nd row.

So, to leave things clear, you could do as follows:

for i in range(2):
    # Get the slice
    a_slice = a_df.iloc[0, i*4:(i+1)*4]
    # Reset the indices
    a_slice.reset_index(drop=True, inplace=True)
    # Set the slice into b_df
    b_df.iloc[i,:] = a_slice

Or you can convert to list instead of using reset_index:

for i in range(2):
    # Get the slice
    a_slice = a_df.iloc[0, i*4:(i+1)*4]
    # Convert the slice into a list and set it into b_df
    b_df.iloc[i,:] = list(a_slice)
alan.elkin
  • 954
  • 1
  • 10
  • 19