Welcome To My SAS Showcase

Do you enjoy puzzles? Yes, I do.

ROC comparison using stratified k-fold cross-validation in logistic regression adjusting for oversampling

leave a comment »

Cross-validation is one commonly used resampling method to evaluate how well a predictive model will perform. This macro uses stratified k-fold cross-validation method to evaluate model by fitting the model to the complete data set and using the cross-validated predicted probabilities to perform an ROC analysis. In stratified k-fold cross-validation, the original data set is randomly divided into K equal size subsets and each subset is selected so that the proportion of response value is approximately equal in all the subsets. Of the k folds, a single fold is used as the validation data for testing the model, and the remaining k-1 folds are used as training data. The process is repeated k times with each fold used exactly once as the validation data. K-fold cross-validation method overcomes the limitation of the holdout method in which the evaluation may depend heavily on how data is split.

This macro will generate the graph for ROC curves for comparison and test the equality of the AUCs of the fitted model with and without crossvalidation. It also outputs two temporary datasets named as “pred” which contains predictive probabilities with and without cross-validation and “rocdata” which contains cutoff probabilities, sensitivity and 1-specificity so you can use them to generate lift chart or gains chart. In addition, this macro uses sampling weights method to adjust for oversampling problem. If the response event is oversampled in your data set and you are interested in predictive probabilities, you need to specify the weight variable to correct the predictive probabilities for oversampling. However, the macro doesn’t generate the weight variable for you, you can easily create it based on population probability and sample probability.

P.S: wordpress changed quotation marks to fancy ones and changed the display after I modified the post. I haven’t figured out how to display them correctly. If you want to use this code, please change quotation marks back.

/*—————————————————————————————————————————————————————-
| MACRO NAME :   cvlogit
| CREATED BY      :   Gu, Peihua    (Dec 29, 2013)
| CONTACT          :   lilygu10@yahoo.com
*—————————————————————————————————————————————————————–*
| PURPOSE
| This macro is to perform an ROC analysis using the stratified k-fold cross-validation method for a binary response;
*—————————————————————————————————————————————————————–*
| MACRO CALL
| %cvlogit(_delmiss=, _libnm=, _indata=, _seed=, _dep=, _grpcnt=, _deplevel=, _wt=, _charvar=, _numvar=)
*—————————————————————————————————————————————————————–*
| REQUIRED PARAMETERS
| _delmiss : enter y if you want to drop all observations with any missing values from the input data set
| _libnm :    the library of the input data set
| _indata:    input data set name
| _seed:       seed number
| _dep:       dependent variable
| _grpcnt:    the total number of cross-validation folds
| _deplevel: the response level need to be modelled; enter the formatted value if the variable is formatted
| _wt:         weight variable
| _charvar:  a list of categorical variables
| _numvar:  a list of numeric variables
*—————————————————————————————————————————————————————–*
| EXAMPLE
| %cvlogit(_delmiss=n, _libnm=work, _indata=test, _seed=81, _dep=dep, _grpcnt=5,
|              _deplevel=1, _wt=, _charvar=x2, _numvar=x1);
*—————————————————————————————————————————————————————-*/

%macro cvlogit(_delmiss=, _libnm=, _indata=, _seed=, _dep=, _grpcnt=, _deplevel=, _wt=, _charvar=,
_numvar=);

title;
%local _grpnum _i;

data &_indata;
set &_libnm..&_indata;
_rannum=ranuni(&_seed);
%if %upcase(&_delmiss)=Y %then
%do;
if cmiss(of _all_ )=0;
%end;
run;

proc sort data=&_indata;
by &_dep;
run;

proc rank data=&_indata out=_indata_rk groups=&_grpcnt; by &_dep; var _rannum; ranks _grp; run;

proc sql noprint;
select distinct _grp into: _grpnum separated by ‘ ‘
from _indata_rk;
quit;

proc freq data=_indata_rk;
tables &_dep.*_grp / missing;
run;

proc datasets library=work nolist;
delete pred rocdata;
run;

%let _cvcnt=%sysfunc(countw(&_grpnum));

%do _i=1 %to &_cvcnt;
proc logistic data=_indata_rk outmodel=model&_i noprint;
class &_charvar/param=ref;
model &_dep(event=”&_deplevel”)=&_numvar &_charvar;
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
where _grp ne %scan(&_grpnum,&_i,’ ‘);
run;

proc logistic inmodel=model&_i noprint;
score data=_indata_rk (where=(_grp=%scan(&_grpnum,&_i, ‘ ‘))) out=pred&_i (drop=F_&_dep I_&_dep);
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
run;

proc append base=pred data=pred&_i;
run;
%end;

ods graphics on;
proc logistic data=pred;
class &_charvar/param=ref;
model &_dep(event=”&_deplevel”)=&_numvar &_charvar/outroc=rocdata(keep=_source_ _prob_
_sensit_ _1mspec_);
roc “&_cvcnt.-fold Cross-Validation” pred=p_&_deplevel;
roccontrast / estimate;
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
output out=pred(rename=(p_&_deplevel=pred_cv)) predicted=pred_model;
title “Model Assessment using &_cvcnt.-fold Cross-Validation”;
run;
ods graphics off;

proc datasets library=work nolist;
delete _indata_rk
%do _i=1 %to &_cvcnt;
model&_i pred&_i
%end;;
run;

%mend;

Written by sasshowcase

December 30, 2013 at 1:43 pm

Leave a comment