python - How to calculate a partial Area Under the Curve (AUC) -


in scikit learn can compute area under curve binary classifier with

roc_auc_score( y, clf.predict_proba(x)[:,1] ) 

i interested in part of curve false positive rate less 0.1.

given such threshold false positive rate, how can compute auc part of curve threshold?

here example several roc-curves, illustration:

illustration of roc-curves plot several types of classifier.

the scikit learn docs show how use roc_curve

>>> import numpy np >>> sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2) >>> fpr array([ 0. ,  0.5,  0.5,  1. ]) >>> tpr array([ 0.5,  0.5,  1. ,  1. ]) >>> thresholds array([ 0.8 ,  0.4 ,  0.35,  0.1 ] 

is there simple way go partial auc?


it seems problem how compute tpr value @ fpr = 0.1 roc_curve doesn't give that.

say start with

import numpy np sklearn import  metrics 

now set true y , predicted scores:

y = np.array([0, 0, 1, 1])  scores = np.array([0.1, 0.4, 0.35, 0.8]) 

(note y has shifted down 1 problem. inconsequential: exact same results (fpr, tpr, thresholds, etc.) obtained whether predicting 1, 2 or 0, 1, sklearn.metrics functions drag if not using 0, 1.)

let's see auc here:

>>> metrics.roc_auc_score(y, scores) 0.75 

as in example:

fpr, tpr, thresholds = metrics.roc_curve(y, scores) >>> fpr, tpr (array([ 0. ,  0.5,  0.5,  1. ]), array([ 0.5,  0.5,  1. ,  1. ])) 

this gives following plot:

plot([0, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 1], [0.5, 1], [1, 1]); 

enter image description here

by construction, roc finite-length y composed of rectangles:

  • for low enough threshold, classified negative.

  • as threshold increases continuously, @ discrete points, negative classifications changed positive.

so, finite y, roc characterized sequence of connected horizontal , vertical lines leading (0, 0) (1, 1).

the auc sum of these rectangles. here, shown above, auc 0.75, rectangles have areas 0.5 * 0.5 + 0.5 * 1 = 0.75.

in cases, people choose calculate auc linear interpolation. length of y larger actual number of points calculated fpr , tpr. then, in case, linear interpolation approximation of points in between might have been. in cases people follow conjecture that, had y been large enough, points in between interpolated linearly. sklearn.metrics not use conjecture, , results consistent sklearn.metrics, necessary use rectangle, not trapezoidal, summation.

let's write our own function calculate auc directly fpr , tpr:

import itertools import operator  def auc_from_fpr_tpr(fpr, tpr, trapezoid=false):     inds = [i (i, (s, e)) in enumerate(zip(fpr[: -1], fpr[1: ])) if s != e] + [len(fpr) - 1]     fpr, tpr = fpr[inds], tpr[inds]     area = 0     ft = zip(fpr, tpr)     p0, p1 in zip(ft[: -1], ft[1: ]):         area += (p1[0] - p0[0]) * ((p1[1] + p0[1]) / 2 if trapezoid else p0[1])     return area 

this function takes fpr , tpr, , optional parameter stating whether use trapezoidal summation. running it, get:

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, true) (0.75, 0.875) 

we same result sklearn.metrics rectangle summation, , different, higher, result trapezoid summation.

so, need see happen fpr/tpr points if terminate @ fpr of 0.1. can bisect module

import bisect  def get_fpr_tpr_for_thresh(fpr, tpr, thresh):     p = bisect.bisect_left(fpr, thresh)     fpr = fpr.copy()     fpr[p] = thresh     return fpr[: p + 1], tpr[: p + 1] 

how work? checks insertion point of thresh in fpr. given properties of fpr (it must start @ 0), insertion point must in horizontal line. rectangles before 1 should unaffected, rectangles after 1 should removed, , 1 should possibly shortened.

let's apply it:

fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, 0.1) >>> fpr_thresh, tpr_thresh (array([ 0. ,  0.1]), array([ 0.5,  0.5])) 

finally, need calculate auc updated versions:

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, true) 0.050000000000000003, 0.050000000000000003) 

in case, both rectangle , trapezoid summations give same results. note in general, not. consistency sklearn.metrics, first 1 should used.


Comments

Popular posts from this blog

php - Auto increment employee ID -

php - isset function not working properly -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -