International Journal of Statistics and Management SystemInternationalof Statisticsand Management Systems 1, 151–177 (2006)Vol. 5 No. Journal1 (January-June,2020)c Serials Publications 2006 Submitted: 13th March 2020 Revised: 25th April 2020Accepted: 29th May 2020Specification and Informational Issues inCredit ScoringNicholas M. Kiefer† and C. Erik Larson‡AbstractLenders use rating and scoring models to rank credit applicants on their expectedperformance. The models and approaches are numerous. We explore the possibilitythat estimates generated by models developed with data drawn solely from extendedloans are less valuable than they should be because of selectivity bias. We investigatethe value of “reject inference” – methods that use a rejected applicant’s characteristics, rather than loan performance data, in scoring model development. In the courseof making this investigation, we also discuss the advantages of using parametric aswell as nonparametric modeling. These issues are discussed and illustrated in thecontext of a simple stylized model. Disclaimer: Portions of this paper were completed while Larson was employed by the Office of theComptroller of the Currency. Views expressed herein are those of the authors, and do not necessarilyrepresent the views of Fannie Mae, the Office of the Comptroller of the Currency, the U.S. Department ofthe Treasury, or their staffs.Acknowledgements: We thank colleagues and workshop participants at Cornell University, the OCC andSyracuse University, as well as the referees, for helpful comments.Received: January 9, 2006; Accepted: October 3, 2006.Key words and phrases: Logistic regression, specification testing, risk management, nonparametrics, rejectinference.JEL Classification: C13, C14, C52, G11, G32†Mailing Address: Departments of Economics and Statistical Sciences, Cornell University; and RiskAnalysis Division, Office of the Comptroller of the Currency, US Department of the Treasury. Email:[email protected]‡Fannie Mae. E–mail: c erik [email protected]

152Kiefer & Larson1. IntroductionCredit scoring models rate credit applications on the basis of cur-rent application and past performance data. Typically, credit performance measures andborrower characteristics are calculated as functions of the data for a sample of borrowers.These measures are then used to develop statistical scoring models, the output of which,scores, are forecasts of credit performance for borrowers with similar characteristics. Forexample, a model might generate a predicted performance measure as a function of theapplicant’s utilization rate for existing credit lines. A lender will typically use this performance predictor as part of a decision on whether or not to extend credit in response to theapplication. A simple decision rule would be to accept the application only if the estimatedperformance measure (say, the probability of delinquency or default) is less than a criticalvalue α. The appropriate performance metric may vary across applications. A natural metric in the stylized models we will discuss is default probability; although we found it usefulto reference “default probability” throughout the paper, the discussion holds for essentiallyany performance measure. A practical, though more complicated approach, is to estimatea loan’s profitability. We note that in retail banking practice, it is more common than notto report performance forecasts (scores) that increase in value as the probability of defaultdecreases. In contrast, corporate and other business rating and scoring models usuallyreport scores and grades that increase with the probability of default. In the balance ofthis paper, we make use of the latter convention.Discussions of credit scoring, including various approaches for different applications(mortgage lending, credit card lending, small commercial lending, etc.) are given byThomas, Edelman and Crook (2002), Hand (1997), Thomas, Crook, and Edelman (1996)(a collection of relevant articles) and others. A recent review of the credit scoring problemincluding an insightful discussion of evaluating scoring mechanisms (scoring the score) isgiven by Hand (2001). Early treatments of the scoring problem are Bierman and Hausman(1970), and Dirickx and Wakeman (1976); this work has been followed up by Srinivasanand Kim (1987) and others.A critical issue in credit modeling is the relevance of the data on the experience of loansextended to the potential experience of loans declined. Can the relation between defaultand characteristics in the sample of loans made be realistically extended to predicting thedefault probabilities for loans not made? This problem is known as a “selectivity” problem.A number of methods based on “reject sampling” have been proposed to try to use datafrom rejected loan applications together with the experience of existing loans. A related

Specification and Informational Issues in Credit Scoring153issue is the relevance of the experience with loans presently or previously outstandingto current and future potential loans. Demographic changes (an aging population) or adifferent stage in the business cycle could diminish the relevance of such experience.The procedure we examine is essentially sequential, though the full implications of thesequential updating process are not explored here. Loan applications are accepted accordingto some rule, possibly stochastic. The experience of these loans is recorded and used toestimate, or sharpen, existing estimates of default probabilities. Of course, repaymentand default information is available only on loans that were extended. However, data areavailable on rejected loans, and we explore the potential for bias in using the data onlyon accepted loans. We also address the possibility of using ”reject sampling” schemes toimprove scoring models or default risk estimates. Our simple framework abstracts fromsome difficult practical problems (such as what exactly default is; how to account for loansapplied for, accepted by the bank, and then declined by the applicant; and how defaultprobabilities change over the duration of a loan). Nevertheless, our focus on default asthe outcome of interest is a useful abstraction: in practice it may be appropriate to studythe expected profit performance of a given loan application. This involves the defaultprobability, but adds other considerations, including for example the pricing of the loan.Throughout we emphasize a key conceptual distinction between two closely relatedquestions: Should the bank continue to make loans to applicants with marginally acceptablecharacteristics? Should the bank extend loans to applicants whose characteristics undercurrent rules are marginally unacceptable? There is data on the former question, as defaultprobabilities can be directly measured from experience with loan performances. Becausethe latter question cannot be answered using this conventional data, we must turn toparametric assumptions or other methods by which to extrapolate from the given sample.The only reliable way to answer the second question is to use these parametric assumptionsor collect additional information. We suggest carrying out experiments with the scoringrule.To sum up: We first cast some doubt on the likely importance of selectivity bias incredit scoring; we consider reject inference and raise doubts about its practical application; we consider advantages and possible disadvantages of parametric inference on defaultprobabilities; and finally we turn to potential gains from experimentation with the creditgranting rule for the purpose of generating information.

154Kiefer & Larson2. A Stylized ModelA simple model allows concentration on key conceptual issues.Suppose the income from serving an account over a fixed period is π, the probability ofdefault is p, and the loss given default is λ (defined here as a positive number). Thenthe expected profit from this account over the period (we will return to the question ofthe period) is π(1 p) λp. In this case, loans are profitable only if p 5 π/(π λ).As a practical matter, banks often rank applicants according to the estimated value of pand extend loans to those applicants with the smallest default probabilities (as funds areavailable) up to the critical value p π/(π λ). Of course, there is a lot missing in thiscalculation, including the important question of estimation error in p and how that mightvary across applicants.A minor variation on this calculation can get around the awkward question of thedefinition of the period. Let us reinterpret profit as a rate of income accrual, π . Assumethe discount rate is r. Let T , a random variable, be the time of default and suppose forsimplicity that T is distributed exponentially with parameter θ, f (t) e θt . Then expectedprofit is given byZTE(profit T ) π e rt λe rT(2.1)0 π /r (π /r λ)e rTand unconditionallyE(profit) π /r θ(π /r λ)/(r θ).(2.2)Again, we get a cutoff rule, order the applicants in terms of a and extend loans to those withthe smallest values of θ, up to the critical value. For a fixed period, there is a monotonemap between θ and p, the default probability in the previous model.The point of this exercise is not to exhibit a realistic model, but to illustrate thatthe lesson from the simple model is fairly robust. Namely, the optimal lending policywill involve ranking applicants according to a performance measure and lending funds asavailable up to a cutoff point. Note that, as a practical matter, essentially all of the “fixed”parameters in the simple model will vary across applicants and possibly over time accordingto macroeconomic and local economic conditions.3. Information and IdentificationSuppose the application data consists of X. Atpresent, X can be rather abstract, perhaps a collection of numbers indicating financial

Specification and Informational Issues in Credit Scoring155history, discrete or continuous variables, etc. On the basis of X, a decision is made whetherto approve a loan application. Let A be the variable indicating loan approval (A 1) ordecline (A 0). We partition X (XA , XR ) corresponding to characteristics associatedwith approved and rejected loans. X is observed in both cases.Suppose the population relationship between default D ( 1 for default, 0 for no default)and data X is P (D X). P (D X) is thus the probability of default given characteristics Xin the population. The chain determining events isX (A, X) (D A, A, X) (DA , A, X)(3.1)where the final state DA D A consists of D if it is observed, that is if A 1, and noinformation on D if A 0. D is partitioned (DA , DR ) and DR is not observed. Here Xdetermines A and X is simply carried along as a determinant of D. The final state DA isdetermined by A and X.The key observation here is that the intermediate state, (A, X) contains no informationnot already contained in X. A is determined as a (possibly random) function of X. Forexample, X might be the predictor variables in a default risk model and A might be chosento equal 1 (accept the application) if the predicted default probability is less than α. Inthis case, A is a deterministic function of X. Alternatively, A could be completely random,determined, for example, by a coin flip. In the language of the statistical literature onmissing data, the mechanism determining A and hence DR is missing at random (MAR);see Little and Rubin (2002) and Hand and Henley (1994). The deterministic case, possiblyrelevant here, in which A is determined by some function of X, is a special case of theMAR mechanism.Since A contains no information not contained in X, inference on P (D X) does notdepend on A. Of course, this inference can only be made for X configurations actuallyobserved. Which credit histories are observed depends on X (and possibly a randommechanism), so there is no bias associated with estimating those probabilities that areidentified. To illustrate, suppose X is binary and the deterministic selection rule takes onlyapplications with X 1. In this case, no information on P (D X 0) will be generated,though additional information on P (D X 1) will be. This illustrates the differencebetween the two central questions: First, are loans being made that shouldn’t be made (aquestion that can be answered using estimates of P (D X 1))? Second, are loans thatshould be made not being made (a question that must be answered using P (D X 0), on

156Kiefer & Larsonwhich there is no data)1 ?4. Hidden Variables and SelectivityThe potential for biases in using the acceptedloan data only arises when the selection mechanism proxies for omitted, but important,variables in the default equation. To see this in our Markov setup, we augment the variablesby including the hidden variable U . Thus(X, U ) (A, X, U ) (DA , A, X)(4.1)If U was observed, the problem duplicates the previous one; if not, things become morecomplicated. Specifically, we would like to estimate P (D X, A), the conditional probabilityof default given characteristics, marginally with respect to the hidden U , on the basis of ourobserved data, which are P (DA X, A). In the previous section, P (D X) and P (DA X, A)were the same, because A carried no relevant information given X. In the present case, Amight be relevant as a proxy for U . This is the case referred to as not missing at random,NMAR.This point can be made in the simpler context of inference on the marginal probabilityof default. Thus we focus temporarily on the selection issue and abstract that issue fromthe problem of inference on the effects of the X variables. The chain becomesU (A, U ) (DA , A)(4.2)and we wish to make inference on P (D) on the basis of the data, which are informative onP (DA ). Now, P (D) is the marginal probability of default in the population, given byZP (D) P (D U )g(U )dU,(4.3)whileZP (DA ) P (D U, A)g(U A)dU(4.4)Z 1P (D U )g(U A)dUNote that P (A X) can be estimated, and such an estimate might help an outside examiner trying to de-termine, for example, whether an institutional loan policy satisfies various legal requirements. Nevertheless,it does not provide information on P (D X).

Specification and Informational Issues in Credit Scoring157(the second equality holds since A carries no new information given U ). Here g(U ) is themarginal distribution of U in the population and g(U A) is the conditional distribution.ThusP (DA ) 6 P (D0 )(4.5)unless A and U are independent. Hence using information on the accepted loans to makeinference about the population default probability leads to bias.The argument is easily extended to inference about the effects of characteristics X on theconditional distribution P (D X) using data generated by the distribution P (DA X, A 1). If the hidden variable U affects D and A, then A will proxy for the effect of U inP (DA X, A 1), leading to incorrect inferences. Note thatP (DA X, U, A 1) P (D X, U ),(4.6)so A is irrelevant given U and X. NeverthelessP (DA X, A 1) 6 P (D X).(4.7)It is only through the interdependence of A and the missing hidden variable U that biasarises.What is the hidden variable U ? This is not so clear. One obvious example arises whena variable used in scoring, and relevant for predicting default, does not enter the defaultprobability model. It would be a clear mistake to include a variable in the scoring modelthat was not in the default model (although one could argue that not all variables in thedefault model need appear in the scoring model); thus, we suspect that this is not a likelysource of bias.The key is that the hidden variable must affect the decision to approve the loan and thedefault probability. This variable can be observed by whoever makes the lending decisionbut not by the default modeler. If loans are made in person, for example, an experiencedloan officer may be able to get a “feel” that the applicant is more (or less) reliable thanthe paper data indicates. There may be many components to this “feel” not reflected inthe application data: promptness in showing up for appointments, openness vs. shiftiness,vagueness or precision in answering questions. Such observations will affect the loan decisionand, if they are accurate, also the default probability. If the variable is observed by the loanoriginator and used in the acceptance decision, but is in fact not relevant to the defaultprobability, there will be no induced bias in using the default data on the accepted loans.

158Kiefer & LarsonBias only arises if the data is relevant, is available to the acceptance decision maker andused, and is not available to the default modeler.This bias cannot be corrected without adding information. One source of information isa priori – parametric assumptions on the joint distribution of A and D given X, P (A, D X).If these assumptions are sufficient to allow estimation of the parameters of the distributiongiven only the selected data, then the bias can be corrected. This approach has led to ahuge literature in labor economics, beginning with Heckman (1976). Of course, a bettersource of information is more data. Impractical in the labor economics applications wherethe decisions are made by the same individual (the classical application has D being wagesor hours of work and A employment), it is feasible when the institution determines A andthe applicant determines D. Much less restrictive assumptions can sometimes be used tobound the probabilities (Manski (1995) gives an insightful treatment of this approach andthe identification question generally).5. Reject InferenceModelers typically employ “reject sampling” or “reject inference”because they are concerned that potentially relevant information in the application data forrejected loans ought to be used in the default model. In this section we ask whether thereis any relevant information in such data. The answer is usually no. That is, in studyingdefault probabilities conditional on characteristics X, the relevant random variables generating information about the probabilities of interest, are the default/nondefault records.The additional X variables alone are not of great interest in studying defaults (althoughthey are of course informative on the scoring process, since the associated dependent variable accept/reject is observed). Useful discussions of reject sampling include Crook andBanasik (2004) and Hand and Henley (1993, 1994).Many reject sampling procedures assign values of the missing dependent variable,default/non-default, for the rejected applications according to the values of the X variables.This phase is referred to as “data augmentation.” These values then enter a secondaryanalysis as real data. But the new default values are not random variables relevant toinference about defaults. That is, they are not default data. They are functions (possibly

Specification and Informational Issues in Credit Scoring159stochastic) of the existing default data. On a purely conceptual basis we have(XA , DA ) for accepted loans &(XA , XR , DA , DR ) ”augmented” dataXR for rejected loans(5.1)%We have not been specific about how the DR , the default history for the rejected loans, isconstructed, but the details are irrelevant for the concept. Namely, the augmented data donot contain any information not in the original data XA , DA and XR .In this example, when the information content of the augmented data and the originaldata is the same, a proper data analysis (taking account of the singular conditional distribution of DA and DR in the augmented data set) will get the same answers from eitherof the two data sets. If the augmented data set is analyzed as though it were real data,the results will reflect the assignment DR . At the very least, the results will offer falseprecision, as illustrated below. If the assignment is arbitrary, the results may distort theinformation in the actual data.Consider the simple example with X a single binary variable, and only one realizedvalue chosen for the loan. There is information about only one of the default probabilities,corresponding to the chosen value of the X, not about both. The fact that one of theprobabilities is unidentified is telling. If reject sampling produces a data set that purportsto identify the other probability, it is being identified with non-data information. Thussuppose(XA , DA ) for accepted loansXR for rejected loansNon-data Information Z& (XA , XR , DA , DR ) ”augmented” data(5.2)%The non-data information Z consists of (in a common case) functional form assumptionsor other assumptions made by the rejection sample design. For example, in our simple casethe default probability corresponding to the value of XR might just be assigned as, say, β.The result would be that an analysis of the augmented data set, treating it as a real dataset, would discover that the default probability for the unselected value of XR is β. Butwould it be sensible for a bank to base decisions on this kind of inference? The point isthat the information being recovered by an analysis of the augmented data is generated byXA , XR , DA and Z. One should ask whether Z really deserves equal weight with the data?

160Kiefer & LarsonHere is a less obvious, and less arbitrary, example. Suppose, in the context of our ex-ample with binary X, the acceptance decision is randomized so that there are some loanswith X 1 and some with X 0. Then there is data information on both default probabilities. Suppose these are estimated from the accepted data as β 0 and β 1 , correspondingto X 0 and X 1. We propose to assign default data (the dependent variable) to theXR , the sample of application data from rejected loan applications. One way to do thiswould simply be to assign β i as the value of the 0/1 variable DR corresponding to XR i.These non 0/1 dependent variables will pose problems for some estimation methods, however. Another assignment method is simply to draw DRi 1 with probability β i and zerootherwise. Another method in use is to assign, for each XR , β i observations to the sampleof defaults and 1 β i to the sample of non-defaults. Some methods multiply these fractionsby a factor generating integer numbers of additional observations. The point is that no newinformation is contained in the augmented data set, though an analysis of the augmenteddata as though it were real data seems to produce much more precise parameter estimatesthan the accepted data alone. Here the non-data “information” Z is the assumption thatdefaults in the rejected sample look exactly like their predicted values on the basis of theaccepted sample. Thus, bias is not introduced, but a false sense of precision is introduced.Another common method of assignment is based on functional form assumptions. Forexample, suppose X is a continuous scalar variable and the dependence of the defaultprobability on X is estimated by a logit model using data from the sample of loans extended.Suppose only values of X greater than a cutoff x are selected. Then, the accepted samplehas X x and the declined X 5 x . Under the assumption that the logit model holdsthroughout the range of X in the population, predicted default probabilities or predicteddefaults can be made for the declined sample on the basis of information in the acceptedsample. Adding these “observations” to the augmented data set will give seemingly moreprecise estimates of the same parameters used to generate the new observations. This ismerely a classic example of double-counting.Consider this effect in the case where the X are all the same, so the default probability tobe estimated is simply the marginal default probability. Using the sample of n1 acceptedloans, we estimate this probability by pb #defaults/(#defaults #non-defaults) withsampling variance pb(1 pb)/n1 . Now consider augmenting the dataset with informationfrom the n2 declined loan applications. Assign defaults to these applications using one ofthe methods described above (for example, for each new observation, assign pb new defaults

Specification and Informational Issues in Credit Scoring161and 1 pb new non-defaults). Using the augmented sample, we calculate a new estimate,bpb # defaults in the augmented data/(n1 n2 ). Clearly bpb pb, so our procedure hasnot introduced bias. (Assuming that the acceptance mechanism is not informative aboutthe default probability, pb is a correct estimator for the default probability). However, thestandard calculation of the sampling variance of the estimator gives V (bpb) bpb(1 bpb)/(n1 n2 ) n1 /(n1 n2 ) times V (bp). If the accepted and declined samples are equal in size,the augmented data gives an estimator with one-half the variance as the accepted sample.The ridiculousness of this procedure is easily illustrated by a further extension. Supposethere are an additional n3 people who did not apply. In this example, knowing the X forthese people (everyone has the same X), we apply the same procedure. This leads to theb bbbnew estimate bpb pb pb, but now with estimated variance bpb(1 bpb)/(n1 n2 n3 ). Theopportunities for increased apparent precision here are endless . . .6. Reject Inference: Mixture ModelsMixture models allow use of the XR datafrom rejected applications through modeling assumptions on the joint distribution of theX characteristics and defaults. That is, the rejected applications are certainly informativeon the distribution of X. If an assumption on the relationship between the marginaldistribution of X and the conditional distribution of D given X can be plausibly maintained,then the distribution of X can be informative on defaults in the rejected sample. Note thatthis is a very strong assumption.To see how this works, suppose the population consists of two groups; “defaulters”and “non-defaulters,” with population (unconditional) proportions π and (1 π). Thecharacteristics X data are generated in the population according to the mixture modelp(x) πpd (x) (1 π)pn (x), where pd and pn are the marginal distributions of characteristics in the default and non-default populations respectively.The likelihood contribution of the i–th observation from the accepted sample is thejoint probability of default and X for those who default, namely πpd (xi )), and the jointprobability of non-default and X for those who do not, (1 π)pn (xi )). The contributionof the j–th observation from the reject sample is the marginal probability of X, namelyp(xj ) πpd (xj ) (1 π)pn (xj ),(6.1)and the likelihood function is the product of the likelihood contributions from both samples.A parametric model can be selected for each of the pi distributions and these parameters

162Kiefer & Larsoncan be estimated along with π. The object of primary interest is the conditional probabilityof default given x, and this is given byP (D X) πpd (X)/(πpd (X) (1 π)pn (X)).(6.2)Feelders (2000) gives an example in which pn and pd are two different normal distributions.In this example he finds that the mixture approach (known to be the correct model) improves on an approach based on fitting a logistic regression using the complete data. Handand Henley (1997) give an assessment similar to ours; without new information, perhapsin the form of functional form assumptions, reject inference is unlikely to be productive.To illustrate just how dependent this approach is on functional form assumptions, notethat the model can be estimated, and predicted default probabilities calculated, withoutany data whatever on defaults! Closely related techniques go by the names cluster analysisand discriminant analysis.How can the data on rejected applicants plausibly be used? The only hope is to getmeasurements on some proxy for the dependent variable on default experience. Here,external data such as credit bureau data may be useful. If the bureau data are available,and the declined applicant shows an additional credit line, then the payment performanceon that credit line could be used as a measure of the performance of the loan had it beenextended. Of course, there are a number of assumptions that must be made here. These arepractical matters (Was the loan extended similar to the loan that was declined, and do theloan terms affect the default behavior? Is the bureau information comparable to the dataon existing loans?), but the possibility remains that data could be assembled on rejectedapplicants. The requirement here is that payment performance be measured, albeit withnoise. It cannot simply be imputed.7. Parametric ModelsThe X data used in default models typically contains contin-uous variables, for example, financial ratios, as well as discrete variables. It is natural toexperiment with parameterized models, for the parsimonious description of the effects ofthese variables. A common specification is the logit, in which the log-odds follow the linearmodel ln(P (D 1 x)/P (D 0 x)) x0 β, where x is a vector consisting of values of theelements of X and β is a vector of coefficients. This model can be fit to data on acceptedloans. In the absence of bias due to relevant hidden variables and subject to well-knownregularity conditions, the parameter β will be consistently estimated. Under the main-

Specification and Informational Issues in Credit Scoring163tained assumption that the functional form of the relationship between the characteristicsX and the default probability is the same in the accepted and declined samples, predictedvalues of the default probabilities in the declined sample are appropriate estimates of thedefault probabilities for those observations, and are appropriate for use as a scoring rule(or part of a scoring rule).If the selection has been completely at random (MCAR), so the X configuration in thedeclined sample is the same as the X configuration in the accepted sample, we are on firmground. However, if selection is on the basis

ric in the stylized models we will discuss is default probability; although we found it useful . (mortgage lending, credit card lending, small commercial lending, etc.) are given by Thomas, Edelman and Crook (2002), Hand (1997), Thomas, Crook, and Edelman (1996) (a collection of relevant articles) and others. A recent review of the credit .