python - How to impute each categorical column in numpy array -
there solutions impute panda dataframe. since working numpy arrays, have create new panda dataframe object, impute , convert numpy array follows:
nomdf=pd.dataframe(x_nominal) #convert np.array pd.dataframe nomdf=nomdf.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace nan frequent in each column x_nominal=nomdf.values #convert pd.dataframe np.array
is there way directly impute in numpy array?
we use scipy's mode
highest value in each column. leftover work nan
indices , replace in input array mode
values indexing.
so, implementation -
from scipy.stats import mode r,c = np.where(np.isnan(x_nominal)) vals = mode(x_nominal,axis=0)[0].ravel() x_nominal[r,c] = vals[c]
please note pandas
, value_counts
, choosing highest value in case of many categories/elements same highest count. i.e. in tie situations. scipy's mode
, lowest 1 such tie cases.
if dealing such mixed dtype of strings
, nans
, suggest few modifications, keeping last step unchanged make work -
x_nominal_u3 = x_nominal.astype('u3') r,c = np.where(x_nominal_u3=='nan') vals = mode(x_nominal_u3,axis=0)[0].ravel()
this throws warning mode calculation : runtimewarning: input array not checked nan values. nan values ignored. "values. nan values ignored.", runtimewarning)
. since, want ignore nans
mode calculation, should okay there.
Comments
Post a Comment