Python pandas: conditionally select a uniform sample from a dataframe -
say have dataframe such
category1 category2 other_col another_col .... 1 2 2 3 3 1 b 10 b 10 b 10 b 11 b 11 b 11
i want obtain sample dataframe category1
uniform number of times. i'm assuming there equal number of each type in category1
. know can done pandas using pandas.sample()
. however, want ensure that sample select has category2
equally represented well. so, example, if have sample size of 5, want such as:
a 1 2 b 10 b 11 b 10
i not want such as:
a 1 1 b 10 b 10 b 10
while valid random sample of n=4
, not meet requirements want vary as possible types of category2
.
notice in first example, because a
sampled twice, 3
not not represented category2
. okay. goal uniformly possible, represent sample data.
if helps provide clearer example, 1 thing having categories fruit
, vegetables
, meat
, grains
, junk
. in sample size of 10, want as possible represent each category. ideally, 2 of each. each of 2 selected rows belonging chosen categories have subcategories represented uniformly possible. so, example, fruit have subcategories of red_fruits, yellow_fruits, etc. 2 fruit categories selected of 10, red_fruits , yellow_fruits both represented in sample. of course, if had larger sample size, include more of subcategories of fruit (green_fruits, blue_fruits, etc.).
trick building balanced array. provided clumsy way of doing it. cycle through groupby object sampling referencing balanced array.
def rep_sample(df, col, n, *args, **kwargs): nu = df[col].nunique() m = len(df) mpb = n // nu mku = n - mpb * nu fills = np.zeros(nu) fills[:mku] = 1 sample_sizes = (np.ones(nu) * mpb + fills).astype(int) gb = df.groupby(col) sample = lambda sub_df, i: sub_df.sample(sample_sizes[i], *args, **kwargs) subs = [sample(sub_df, i) i, (_, sub_df) in enumerate(gb)] return pd.concat(subs)
demonstration
rep_sample(df, 'category1', 5)
Comments
Post a Comment