Split pandas df based on unique values

Question

I have the following pandas df.

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
df

I want to split it into two dfs based on the question_id. Namely, I want to have 80% of the unique question_id's to be in df1 and 20% to be in df2. Rounding up.

Dummy example with the df above: df1 includes ids 1-5 and df2 includes id 6

df1_data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0']]
  

 df2_data = [['6', '5.5', '1.0'],
            ['6', '5.2', '0.0']]

score 1 · Accepted Answer · answered Jan 08 '21 at 12:15

1

First getting the unique question ids

unique_qid = df['question_id'].unique()
array(['1', '2', '3', '4', '5', '6'], dtype=object)

Then getting first 80% unique question ids and using the corrseponding boolean indexing to get the two output dfs

df1_idx = df['question_id'].isin(unique_qid[:round(0.8 * len(unique_qid))])
df1_data = df.loc[df1_idx, :]
df2_data = df.loc[~df1_idx, :]

df1_data

df2_data

answered Jan 08 '21 at 12:15

ggaurav

1,764
1
10
10

1

Awesome, thanks a lot! What's the correct term for the "~" within df.loc[~df1_idx, :]? I would like to read into how it works. – Exa Jan 08 '21 at 13:40
1

tilde - for python, pandas you can check below link https://stackoverflow.com/questions/46054318/tilde-sign-in-python-dataframe – ggaurav Jan 08 '21 at 13:42

Split pandas df based on unique values

1 Answers1