Python pandas: conditionally select a uniform sample from a dataframe -
say have dataframe such
category1 category2 other_col another_col .... 1 2 2 3 3 1 b 10 b 10 b 10 b 11 b 11 b 11 i want obtain sample dataframe category1 uniform number of times. i'm assuming there equal number of each type in category1. know can done pandas using pandas.sample(). however, want ensure that sample select has category2 equally represented well. so, example, if have sample size of 5, want such as:
a 1 2 b 10 b 11 b 10 i not want such as:
a 1 1 b 10 b 10 b 10 while valid random sample of n=4, not meet requirements want vary as possible types of category2.
notice in first example, because a sampled twice, 3 not not represented category2. okay. goal uniformly possible, represent sample data.
if helps provide clearer example, 1 thing having categories fruit, vegetables, meat, grains, junk. in sample size of 10, want as possible represent each category. ideally, 2 of each. each of 2 selected rows belonging chosen categories have subcategories represented uniformly possible. so, example, fruit have subcategories of red_fruits, yellow_fruits, etc. 2 fruit categories selected of 10, red_fruits , yellow_fruits both represented in sample. of course, if had larger sample size, include more of subcategories of fruit (green_fruits, blue_fruits, etc.).
trick building balanced array. provided clumsy way of doing it. cycle through groupby object sampling referencing balanced array.
def rep_sample(df, col, n, *args, **kwargs): nu = df[col].nunique() m = len(df) mpb = n // nu mku = n - mpb * nu fills = np.zeros(nu) fills[:mku] = 1 sample_sizes = (np.ones(nu) * mpb + fills).astype(int) gb = df.groupby(col) sample = lambda sub_df, i: sub_df.sample(sample_sizes[i], *args, **kwargs) subs = [sample(sub_df, i) i, (_, sub_df) in enumerate(gb)] return pd.concat(subs) demonstration
rep_sample(df, 'category1', 5) 
Comments
Post a Comment