SAS macro | Welcome To My SAS Showcase

Archive for the ‘SAS macro’ Category

ROC comparison using stratified k-fold cross-validation in logistic regression adjusting for oversampling

Cross-validation is one commonly used resampling method to evaluate how well a predictive model will perform. This macro uses stratified k-fold cross-validation method to evaluate model by fitting the model to the complete data set and using the cross-validated predicted probabilities to perform an ROC analysis. In stratified k-fold cross-validation, the original data set is randomly divided into K equal size subsets and each subset is selected so that the proportion of response value is approximately equal in all the subsets. Of the k folds, a single fold is used as the validation data for testing the model, and the remaining k-1 folds are used as training data. The process is repeated k times with each fold used exactly once as the validation data. K-fold cross-validation method overcomes the limitation of the holdout method in which the evaluation may depend heavily on how data is split.

This macro will generate the graph for ROC curves for comparison and test the equality of the AUCs of the fitted model with and without crossvalidation. It also outputs two temporary datasets named as “pred” which contains predictive probabilities with and without cross-validation and “rocdata” which contains cutoff probabilities, sensitivity and 1-specificity so you can use them to generate lift chart or gains chart. In addition, this macro uses sampling weights method to adjust for oversampling problem. If the response event is oversampled in your data set and you are interested in predictive probabilities, you need to specify the weight variable to correct the predictive probabilities for oversampling. However, the macro doesn’t generate the weight variable for you, you can easily create it based on population probability and sample probability.

P.S: wordpress changed quotation marks to fancy ones and changed the display after I modified the post. I haven’t figured out how to display them correctly. If you want to use this code, please change quotation marks back.

/*—————————————————————————————————————————————————————-
| MACRO NAME :   cvlogit
| CREATED BY      :   Gu, Peihua    (Dec 29, 2013)
| CONTACT          :   lilygu10@yahoo.com
*—————————————————————————————————————————————————————–*
| PURPOSE
| This macro is to perform an ROC analysis using the stratified k-fold cross-validation method for a binary response;
*—————————————————————————————————————————————————————–*
| MACRO CALL
| %cvlogit(_delmiss=, _libnm=, _indata=, _seed=, _dep=, _grpcnt=, _deplevel=, _wt=, _charvar=, _numvar=)
*—————————————————————————————————————————————————————–*
| REQUIRED PARAMETERS
| _delmiss : enter y if you want to drop all observations with any missing values from the input data set
| _libnm :    the library of the input data set
| _indata:    input data set name
| _seed:       seed number
| _dep:       dependent variable
| _grpcnt:    the total number of cross-validation folds
| _deplevel: the response level need to be modelled; enter the formatted value if the variable is formatted
| _wt:         weight variable
| _charvar: a list of categorical variables
| _numvar: a list of numeric variables
*—————————————————————————————————————————————————————–*
| EXAMPLE
| %cvlogit(_delmiss=n, _libnm=work, _indata=test, _seed=81, _dep=dep, _grpcnt=5,
|              _deplevel=1, _wt=, _charvar=x2, _numvar=x1);
*—————————————————————————————————————————————————————-*/

%macro cvlogit(_delmiss=, _libnm=, _indata=, _seed=, _dep=, _grpcnt=, _deplevel=, _wt=, _charvar=,
_numvar=);

title;
%local _grpnum _i;

data &_indata;
set &_libnm..&_indata;
_rannum=ranuni(&_seed);
%if %upcase(&_delmiss)=Y %then
%do;
if cmiss(of _all_ )=0;
%end;
run;

proc sort data=&_indata;
by &_dep;
run;

proc rank data=&_indata out=_indata_rk groups=&_grpcnt; by &_dep; var _rannum; ranks _grp; run;

proc sql noprint;
select distinct _grp into: _grpnum separated by ‘ ‘
from _indata_rk;
quit;

proc freq data=_indata_rk;
tables &_dep.*_grp / missing;
run;

proc datasets library=work nolist;
delete pred rocdata;
run;

%let _cvcnt=%sysfunc(countw(&_grpnum));

%do _i=1 %to &_cvcnt;
proc logistic data=_indata_rk outmodel=model&_i noprint;
class &_charvar/param=ref;
model &_dep(event=”&_deplevel”)=&_numvar &_charvar;
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
where _grp ne %scan(&_grpnum,&_i,’ ‘);
run;

proc logistic inmodel=model&_i noprint;
score data=_indata_rk (where=(_grp=%scan(&_grpnum,&_i, ‘ ‘))) out=pred&_i (drop=F_&_dep I_&_dep);
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
run;

proc append base=pred data=pred&_i;
run;
%end;

ods graphics on;
proc logistic data=pred;
class &_charvar/param=ref;
model &_dep(event=”&_deplevel”)=&_numvar &_charvar/outroc=rocdata(keep=_source_ _prob_
_sensit_ _1mspec_);
roc “&_cvcnt.-fold Cross-Validation” pred=p_&_deplevel;
roccontrast / estimate;
%if %length(&_wt) ne 0 %then
%do;
weight &_wt;
%end;
output out=pred(rename=(p_&_deplevel=pred_cv)) predicted=pred_model;
title “Model Assessment using &_cvcnt.-fold Cross-Validation”;
run;
ods graphics off;

proc datasets library=work nolist;
delete _indata_rk
%do _i=1 %to &_cvcnt;
model&_i pred&_i
%end;;
run;

%mend;

Written by sasshowcase

December 30, 2013 at 1:43 pm

Posted in SAS macro, Statistical Modeling

A Macro To Rank Values Within / Without By Groups

leave a comment »

/*—————————————————————————————————————————————————————-

| MACRO NAME : _grp

| CREATED BY : Gu, Peihua (Aug 17, 2013)

| CONTACT : lilygu10@yahoo.com

*—————————————————————————————————————————————————————–*

| PURPOSE

| This macro is to rank variables within By groups or without By groups;

| Ranked variables are named as original variable name plus _rk. (ex: if weight needs to be ranked,

| weight_rk will be the ranked variable)

*—————————————————————————————————————————————————————–*

| LIMITATION

| Can’t perform a weighted ranking. Combining Proc univariate and other procedures can do it.

| Length for variables need to be ranked should be less than 30 characters.

*—————————————————————————————————————————————————————–*

| MACRO CALL

| %_grp (_indata, _outdata, _varlist, _grpcnt, _bylist);

*—————————————————————————————————————————————————————–*

| REQUIRED PARAMETERS

| _indata : Input dataset name

| _outdata : Output dataset name

| _varlist: a list of variables need to be ranked

| _grpcnt: the number of groups variables need to be grouped into

| _bylist: a list of variables to form By groups; leave it as blank if users only rank variables for the whole group

*—————————————————————————————————————————————————————–*

| EXAMPLE

| %_grp(perm.score, perm.score_rank, height weight, 5, gender race)

*—————————————————————————————————————————————————————-*/

%macro _grp(_indata, _outdata, _varlist, _grpcnt, _bylist);

%local _varcnt _i;

%let _varcnt=%sysfunc(countw(&_varlist));

%if %length(&_bylist) ne 0 %then %do;

proc sort data=%str(&_indata); by &_bylist; run;

%end;

proc rank data=%str(&_indata) out=%str(&_outdata) groups=&_grpcnt;

%if %length(&_bylist) ne 0 %then %do;

by &_bylist;

%end;

var &_varlist;

ranks

%do _i=1 %to &_varcnt;

%scan(&_varlist, &_i)_rk

%end;;

run;

%mend;

Written by sasshowcase

October 19, 2013 at 12:40 pm

Posted in Data management, SAS macro

A Macro To Automate Importing Multiple Excel Files Containing Multiple Worksheets In Which Common Variables Have Different Length Or Types

leave a comment »

Note: Sorry that I forgot to mention another limitation. You may slightly change input function for your specific purpose. If you don’t want to print invalid data error message, you can add double question marks in your input function. Also, the code cannot be used in situations where variables which have date, time and dollar format have mixed data types (See the following example) in some Excel worksheets. I may modify the code when I have time.

BTW, you need to list all character variables for macro variable named “charvar” so the program can detect mixed type variables for each worksheet and convert them to numeric variables.

date	a
12/5/2009	—
—	—
—	—
—	—
—	—
—	—
—	$20
—	$5
	$6
12/8/2012	$189
11/19/2007	$19.40
8/25/1999	$1,568,941

Written by sasshowcase

March 20, 2013 at 6:17 am

Posted in Data management, SAS macro

A Macro to Build Both Bootstrap Percentile and BCA Confidence Intervals of Indirect Effect with Binary Outcomes

leave a comment »

Mediation addresses how an independent variable causes a change in a dependent variable. When a third variable is intermediate in the relationship between two variables, it is called a mediator. In general, a mediator is a variable that accounts for all or part of the relation between a predictor and an outcome. The figure below displays the path diagram for a single mediator model.

Three regression equations are used to investigate mediation,

Model 1: Y = i1 + cX + e1 (1)

Model 2: Y = i2 +c’X + bM + e2 (2)

Model 3: M = i3 + aX +e3 (3)

Where c is a total effect which represents the total relationship between independent variable and dependent variable without consideration of other variables, c’ is a direct effect which accounts for the relationship between X and Y adjusting for the effect of M, a mediator, b is the parameter relating the mediator to the dependent variable adjusted for the effect of the independent variable and a represents the relationship between X and M.

The product of a and b, ab, is the mediated effect which is also known as the indirect effect. The indirect effect is also equal to the difference between c and c’, c-c’. As a result, the total effect is the sum of a direct effect and an indirect effect.

These two approaches to quantify the mediated effects are equivalent when the dependent variable is continuous and ordinary regression is used. This is not the case in logistic regression. The residual variance in logistic regression is set to equal π*π/3 so the scale of latent variable Y is not the same across models. Therefore, the estimated c-c’ and ab are not equal and can be quite different.

The logistic regression coefficients need to be standardized so the scale is equivalent across equations. Standardized coefficients can be computed by dividing each coefficient by the square root of the variance of the predicted logit from which those coefficients arose.

An INDIRECT SPSS macro for mediation with categorical outcomes developed by Preacher and Hayes is available online. However, the authors didn’t standardize coefficients when logistic regression is used.

This SAS macro that I created can investigate mediation with a binary outcome for a single mediator model. As indirect effects do not meet normal assumption for statistical analysis, bootstrap method is used to estimate the significance of indirect effects. Both bootstrap percentile intervals and bias corrected accelerated 95% confidence intervals are computed.

Reference:

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40, 879-891.

MacKinnon, D.P. (2008). Introduction to statistical mediation analysis. Mahwah, NJ: Erlbaum.