python - How to impute each categorical column in numpy array -


there solutions impute panda dataframe. since working numpy arrays, have create new panda dataframe object, impute , convert numpy array follows:

nomdf=pd.dataframe(x_nominal) #convert np.array pd.dataframe nomdf=nomdf.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace nan frequent in each column x_nominal=nomdf.values #convert pd.dataframe np.array 

is there way directly impute in numpy array?

we use scipy's mode highest value in each column. leftover work nan indices , replace in input array mode values indexing.

so, implementation -

from scipy.stats import mode  r,c = np.where(np.isnan(x_nominal)) vals = mode(x_nominal,axis=0)[0].ravel() x_nominal[r,c] = vals[c] 

please note pandas, value_counts, choosing highest value in case of many categories/elements same highest count. i.e. in tie situations. scipy's mode, lowest 1 such tie cases.

if dealing such mixed dtype of strings , nans, suggest few modifications, keeping last step unchanged make work -

x_nominal_u3 = x_nominal.astype('u3') r,c = np.where(x_nominal_u3=='nan') vals = mode(x_nominal_u3,axis=0)[0].ravel() 

this throws warning mode calculation : runtimewarning: input array not checked nan values. nan values ignored. "values. nan values ignored.", runtimewarning). since, want ignore nans mode calculation, should okay there.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -