python - How to calculate a partial Area Under the Curve (AUC) -
in scikit learn can compute area under curve binary classifier with
roc_auc_score( y, clf.predict_proba(x)[:,1] ) i interested in part of curve false positive rate less 0.1.
given such threshold false positive rate, how can compute auc part of curve threshold?
here example several roc-curves, illustration:
the scikit learn docs show how use roc_curve
>>> import numpy np >>> sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2) >>> fpr array([ 0. , 0.5, 0.5, 1. ]) >>> tpr array([ 0.5, 0.5, 1. , 1. ]) >>> thresholds array([ 0.8 , 0.4 , 0.35, 0.1 ] is there simple way go partial auc?
it seems problem how compute tpr value @ fpr = 0.1 roc_curve doesn't give that.
say start with
import numpy np sklearn import metrics now set true y , predicted scores:
y = np.array([0, 0, 1, 1]) scores = np.array([0.1, 0.4, 0.35, 0.8]) (note y has shifted down 1 problem. inconsequential: exact same results (fpr, tpr, thresholds, etc.) obtained whether predicting 1, 2 or 0, 1, sklearn.metrics functions drag if not using 0, 1.)
let's see auc here:
>>> metrics.roc_auc_score(y, scores) 0.75 as in example:
fpr, tpr, thresholds = metrics.roc_curve(y, scores) >>> fpr, tpr (array([ 0. , 0.5, 0.5, 1. ]), array([ 0.5, 0.5, 1. , 1. ])) this gives following plot:
plot([0, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 1], [0.5, 1], [1, 1]); by construction, roc finite-length y composed of rectangles:
for low enough threshold, classified negative.
as threshold increases continuously, @ discrete points, negative classifications changed positive.
so, finite y, roc characterized sequence of connected horizontal , vertical lines leading (0, 0) (1, 1).
the auc sum of these rectangles. here, shown above, auc 0.75, rectangles have areas 0.5 * 0.5 + 0.5 * 1 = 0.75.
in cases, people choose calculate auc linear interpolation. length of y larger actual number of points calculated fpr , tpr. then, in case, linear interpolation approximation of points in between might have been. in cases people follow conjecture that, had y been large enough, points in between interpolated linearly. sklearn.metrics not use conjecture, , results consistent sklearn.metrics, necessary use rectangle, not trapezoidal, summation.
let's write our own function calculate auc directly fpr , tpr:
import itertools import operator def auc_from_fpr_tpr(fpr, tpr, trapezoid=false): inds = [i (i, (s, e)) in enumerate(zip(fpr[: -1], fpr[1: ])) if s != e] + [len(fpr) - 1] fpr, tpr = fpr[inds], tpr[inds] area = 0 ft = zip(fpr, tpr) p0, p1 in zip(ft[: -1], ft[1: ]): area += (p1[0] - p0[0]) * ((p1[1] + p0[1]) / 2 if trapezoid else p0[1]) return area this function takes fpr , tpr, , optional parameter stating whether use trapezoidal summation. running it, get:
>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, true) (0.75, 0.875) we same result sklearn.metrics rectangle summation, , different, higher, result trapezoid summation.
so, need see happen fpr/tpr points if terminate @ fpr of 0.1. can bisect module
import bisect def get_fpr_tpr_for_thresh(fpr, tpr, thresh): p = bisect.bisect_left(fpr, thresh) fpr = fpr.copy() fpr[p] = thresh return fpr[: p + 1], tpr[: p + 1] how work? checks insertion point of thresh in fpr. given properties of fpr (it must start @ 0), insertion point must in horizontal line. rectangles before 1 should unaffected, rectangles after 1 should removed, , 1 should possibly shortened.
let's apply it:
fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, 0.1) >>> fpr_thresh, tpr_thresh (array([ 0. , 0.1]), array([ 0.5, 0.5])) finally, need calculate auc updated versions:
>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, true) 0.050000000000000003, 0.050000000000000003) in case, both rectangle , trapezoid summations give same results. note in general, not. consistency sklearn.metrics, first 1 should used.


Comments
Post a Comment