Python pandas: conditionally select a uniform sample from a dataframe -


say have dataframe such

category1  category2   other_col   another_col ....          1          2          2                  3          3          1 b          10 b          10 b          10 b          11 b          11 b          11 

i want obtain sample dataframe category1 uniform number of times. i'm assuming there equal number of each type in category1. know can done pandas using pandas.sample(). however, want ensure that sample select has category2 equally represented well. so, example, if have sample size of 5, want such as:

a  1  2 b  10 b  11 b  10 

i not want such as:

a 1 1 b 10 b 10 b 10 

while valid random sample of n=4, not meet requirements want vary as possible types of category2.

notice in first example, because a sampled twice, 3 not not represented category2. okay. goal uniformly possible, represent sample data.

if helps provide clearer example, 1 thing having categories fruit, vegetables, meat, grains, junk. in sample size of 10, want as possible represent each category. ideally, 2 of each. each of 2 selected rows belonging chosen categories have subcategories represented uniformly possible. so, example, fruit have subcategories of red_fruits, yellow_fruits, etc. 2 fruit categories selected of 10, red_fruits , yellow_fruits both represented in sample. of course, if had larger sample size, include more of subcategories of fruit (green_fruits, blue_fruits, etc.).

trick building balanced array. provided clumsy way of doing it. cycle through groupby object sampling referencing balanced array.

def rep_sample(df, col, n, *args, **kwargs):     nu = df[col].nunique()     m = len(df)     mpb = n // nu     mku = n - mpb * nu     fills = np.zeros(nu)     fills[:mku] = 1      sample_sizes = (np.ones(nu) * mpb + fills).astype(int)      gb = df.groupby(col)      sample = lambda sub_df, i: sub_df.sample(sample_sizes[i], *args, **kwargs)      subs = [sample(sub_df, i) i, (_, sub_df) in enumerate(gb)]      return pd.concat(subs) 

demonstration

rep_sample(df, 'category1', 5) 

enter image description here


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -