I have the following pandas df.
columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
['1','hello', '1.0'],
['1','hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0'],
['6', '5.5', '1.0'],
['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
df
I want to split it into two dfs based on the question_id. Namely, I want to have 80% of the unique question_id's to be in df1 and 20% to be in df2. Rounding up.
Dummy example with the df above: df1 includes ids 1-5 and df2 includes id 6
df1_data = [['1','hello','1.0'],
['1','hello', '1.0'],
['1','hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0']]
df2_data = [['6', '5.5', '1.0'],
['6', '5.2', '0.0']]

