IFRS Technical Note

Design and Implementation of an Expected Credit Loss (ECL) Model for a Portfolio of Credit Card Receivables under IFRS 9: Financial Instruments

Paul McAteer, MSc, MBA

pcm353@stern.nyu.edu

Contents¶

Summary of the Guidance on Valuation Principles in the Standard
Interpretation of the Guidance
Model Design
3.1. Probability of Default
3.2. Loss Given Default
3.3. Expected Exposure at Default
PD Model Inputs
4.1. Preliminary Data Exploration and Preprocessing
4.2. Feature and Target Variable Preparation and Selection
PD Model Estimation
5.1. Fit a Naive Model Using the Pre-Selected Predictor Variables
5.2. Compute Statistical Significance of Predictor Variables
5.3. Fit a Refined Model Using the Statistically Significant Predictor Variables
PD Model Performance
6.1. Accuracy Scores and Confusion Matrices
6.2. Receiver Operating Characteristic (ROC) curve , Area under Curve (AUC) and Gini Coefficient
6.3. Kolmogorov-Smirnov Coefficient
PD Model Application
LGD and EAD Models
8.1. Defining Dependent Variable for LGD and EAD Models
8.2. Stage 1 LGD: Logistic Regression
8.3. Stage 2 LGD: Linear Regression
8.4. LGD: Combining Stages 1 and 2
8.5 EAD Computation
Expected Credit Loss
9.1. ECL Calculation
9.2. ECL Model Output

1. Summary of the Guidance on Valuation Principles in the Standard¶

IFRS 9.5.5 introduces an impairment model for financial assets based on Expected Credit Losses (ECL), which requires entities to recognize a loss allowance prior to loss materialization, utilizing forward-looking and historical information. IFRS 9.5.5.1 stipulates that an entity “shall recognise a loss allowance for expected credit losses on a financial asset that is measured in accordance with paragraphs 4.1.2”, that is, financial assets “measured at amortised cost” held “to collect contractual cash flows” and whose “contractual terms of the financial asset give rise on specified dates to cash flows that are solely payments of principal and interest on the principal amount outstanding”. Referring to the impairment model’s input data, IFRS 9.5.5.4 expects entities to consider “all reasonable and supportable information, including that which is forward-looking”.

The ECL for Trade Receivables that contain a "significant financing component"¹ under IFRS 15, such as credit card receivables, can be measured under the “Simplified Approach”. In contratst with the "General Approach", the Simplified Approach allows entities to recognise lifetime expected losses on all these assets without the need to identify significant increases in credit risk. In any case because the maturities will typically be 12 months or less, the credit loss for 12-month and lifetime ECLs would be the same. IFRS 9.5.5.15 states that "an entity shall always measure the loss allowance at an amount equal to lifetime expected credit losses for...trade receivables or contract assets that result from transactions that are within the scope of IFRS 15, and that…contain a significant financing component in accordance with IFRS 15, if the entity chooses as its accounting policy to measure the loss allowance at an amount equal to lifetime expected credit losses."

Lifetime expected credit loss is the discounted value of expected credit losses that result from probable default events over the expected life of a financial instrument. IFRS 9.5.5.17 clarifies that "An entity shall measure expected credit losses of a financial instrument in a way that reflects: (a)an unbiased and probability-weighted amount that is determined by evaluating a range of possible outcomes; (b)the time value of money. "The term ‘default’ is not defined in IFRS 9. IFRS 9:B5.5.37 states that a definition of default should be "consistent with the definition used for internal credit risk management purposes". Entities will need to consider the requirements of this paragraph where it states there is a "rebuttable presumption that default does not occur later than when a financial asset is 90 days past due unless an entity has reasonable and supportable information to demonstrate that a more lagging default criterion is more appropriate".

IFRS 9.5.5.19 indicates that the maximum expected life is generally understood as the contractual life: "The maximum period to consider when measuring expected credit losses is the maximum contractual period (including extension options) over which the entity is exposed to credit risk and not a longer period". The expected period of exposure is more subjective. IFRS 9:B5.5.40 states that when dtermining expected life "an entity should consider factors such as historical information and experience about:(a) the period over which the entity was exposed to credit risk on similar financial instruments;(b) the length of time for related defaults to occur on similar financial instruments following a significant increase in credit risk; and (c) the credit risk management actions that an entity expects to take once the credit risk on the financial instrument has increased, such as the reduction or removal of undrawn limits."

With specific reference to revolving credit facilities, IFRS 9:B5.5.39 prevails upon the entity to apply dicretionary judgement regarding the time horizon of credit exposure. Where financial instruments include both a loan and an undrawn commitment component (such as credit cards and overdraft facilities), the contractual ability to demand repayment and cancel the undrawn commitment does not necessarily limit the exposure to credit losses beyond the contractual period. For those financial instruments, management should measure ECL over the period that the entity is exposed to credit risk and ECL would not be mitigated by credit risk management actions, even if that period extends beyond the maximum contractual period. In the Illustrative Examples, IFRS 9:IE60 provides further guidance on which factors should be taken into consideration when determining size and time horizon of credit exposure: "At the reporting date the outstanding balance on the credit card portfolio is CU60,000 and the available undrawn facility is CU40,000. Bank A determines the expected life of the portfolio by estimating the period over which it expectsto be exposed to credit risk on the facilities at the reporting date, taking into account: (a) the period over which it was exposed to credit risk on a similar portfolio of credit cards; (b) the length of time for related defaults to occur on similar financial instruments; and (c) past events that led to credit risk management actions because of an increase in credit risk on similar financial instruments, such as the reduction or removal of undrawn credit limits.

¹ A significant financing component exists if the timing of payments agreed to by the parties to the contract (either explicitly or implicitly) provides the customer or the entity with a significant benefit of financing the transfer of goods or services to the customer. [IFRS 15:60]

2. Interpretation of the Guidance¶

- The ECL calculation model should calculate an unbiased and probability-weighted amount to be presented as an impairment to the book value of the financial assets in the Balance sheet.

- This unbiased and probability weighted amount is the difference between the present value of cashflows due under contract and the present value of cashflows that an entity expects to receive.

- The Expected Credit Loss determined by the probability of default, the size of the exposure to defaulting customers, the expected recoverable amount in the event of default and the discount rate applied.

- The estimated size of the exposure is necessarily related to the expectations on the customers drawdown of the undrawn commitment component over a defined time frame. The time frame will be governed by subjective evaluations focussing on how long it will take the entity to identify and take remedial action in relation to problem credit

- The Lifetime Expected Credit Losses will have to incorporate the term structure of the default probability of the assets. In other words, the hazard rate or default intensity, which connotes an instantaneous rate of failure, should be used along with the exponential distribution to compute the cumulative probability of default for a given time horizon.

- The entity should apply a granular and dynamic approach for portfolio segmentation by grouping financial assets based on shared credit characteristics.

- As with all such forward-looking models, expected loss should be taken into consideration the expected loss at a aggregate portfolio level which generally involves incorporating some expectation of the effect of correlation between the constituent assets.

3. Model Design¶

The future value of Lifetime Expected Credit Loss of portfolio $\Pi$ at future time $T$ is defined as a function of the probability of default $PD$ , expected exposure at the time of default $EAD$ and the size of the expected loss in the event of default $LGD$ . The present value of this future $ECL$ is obtained by discounting it at the Expected Interest Rate of the portfolio assets, $EIR$ . Thus:

{E C L}_{Π, T} = {P D}_{Π, T} \cdot {E A D}_{Π, T} \cdot {L G D}_{Π, T} \cdot \frac{1}{1 + E I R^{T}}

${ECL}_{\Pi,T}={PD}_{\Pi,T}\ \cdot {EAD}_{\Pi,T}\ \cdot {LGD}_{\Pi,T}\ \cdot \frac{1}{1+EIR^{T}}$

3.1. Probability of Default¶

${PD}_{(t,t+dt)}$ is the hazard rate or default intensity. More precisely, it is the (instantaneous) probability of default, $(\lambda)$ , over an infinitesimally small time interval $(dt)$ :

{P D}_{(t, t + d t)} = λ d t

${PD}_{(t,t+dt)}=\lambda dt$

The estimation of default probabilities of each credit portfolio constituent $i$ is achieved with the logit model, which employs the technique of logistic transformation to generate a sigmoid function bounded by 0 and 1.:

{{P D}_{[i, (t, t + d t)]} = 1 - P}_{i} (l o a n s t a t u s = 1 = N o D e f a u l t) = 1 - \frac{1}{1 + e^{- Y_{i}}}

${{PD}_{[i,(t,t+dt)]}= 1- P}_i\left(loan\ status=1=No Default\right)=1-\frac{1}{1+e^{-Y_i}}$
Where

Y

$Y$ is a linear regression function of the form:

Y_{i} = B_{0} + B_{1} X_{1}^{(i)} + \dots B_{n} X_{n}^{(i)}

$Y_i=B_0+B_1X^{(i)}_1+\dots B_nX^{(i)}_n$
Where

B_{n}

$\ B_n$ are parameters that are estimated statistically and

X_{n}^{(i)}

$X^{(i)}_n$ are scores, ratios and other explanatory variables for obligor

i

$i$ , transformed into binary "dummy" variables.

{\bar{P D}}_{Π, T}

${\overline{PD}}_{\mathrm{\Pi },T}$ is the average cumulative probability of default of the portfolio over

(0, T)

$(0,T)$ , that is, the output of the cumulative default time distribution

F (t) = 1 - e^{- λ t}

$F(t)=1-e^{-\lambda t}$ at time horizon

T

$T$ , where

T

$T$ denotes the weighted average lifetime of the credit portfolio:

{\bar{P D}}_{Π, T} = 1 - e^{- λ T}

${\overline{PD}}_{\mathrm{\Pi },T}=1-e^{-\lambda T}$
The Vasicek Model offers an elegant solution allowing the computation of a portfolio default rate ,

{\hat{P D}}_{Π, T}

${\widehat{PD}}_{\mathrm{\Pi },T}$ , which integrates the impact of (negative) assumptions about future economic conditions and the effect of the correlation between the portfolio assets. The model takes three inputs: * The weighted average standalone probability of default, denoted by

{\bar{P D}}_{Π, T}

${\overline{\ PD}}_{\mathrm{\Pi },T}$ * The average correlation of portfolio assets with the broader economy, denoted by

ρ

$\rho$ ; * A common systematic economic factor (such as GDP growth , general levels of credit quality etc.) denoted by

{\tilde{e}}_{M}

${\tilde{e}}_M$ The default rate for an asymptotic portfolio, having estimated the average default probability, the default correlation parameter and the common market factor, is given by:

{\hat{P D}}_{Π, T} = Φ [\frac{Φ^{- 1} ({\bar{P D}}_{Π, T}) - \sqrt{ρ} {\tilde{e}}_{M}}{\sqrt{1 - ρ}}]

${\widehat{PD}}_{\mathrm{\Pi },T}=\ \mathrm{\Phi }\left[\frac{{\mathrm{\Phi }}^{-1}\ \left({\overline{PD}}_{\mathrm{\Pi },T}\right)-\sqrt{\rho }\ \ {\tilde{e}}_M}{\sqrt{1-\rho }}\right]$

{\tilde{e}}_{M}

${\tilde{e}}_M$ is a standard normal variable,~

{\tilde{e}}_{M} \sim N (0, 1)

${\tilde{e}}_M\ \sim N(0,1)\$ , representing the assumed severity of economic downturn. The higher the probability of default, the greater the correlation coefficient and the larger the assumed market downturn, the smaller the distance from default, the closer to default and the higher the associated default rate for the portfolio. It may make more intuitive sense if the

{\tilde{e}}_{M}

${\tilde{e}}_M$ variable is restated in terms of the inverse of the standard normal cumulative distribution and a probability input

(x)

$(x)$ ranging from 0.5 to 0,999, where the higher the input value, the more severe the assumed economic downturn. This results in:

{\hat{P D}}_{Π, T} = Φ [\frac{Φ^{- 1} ({\bar{P D}}_{Π, T}) + \sqrt{ρ} Φ^{- 1} (x)}{\sqrt{1 - ρ}}]

${\widehat{PD}}_{\mathrm{\Pi },T}=\ \mathrm{\Phi }\left[\frac{{\mathrm{\Phi }}^{-1}\ \left({\overline{PD}}_{\mathrm{\Pi },T}\right)+\sqrt{\rho }\ \ {\mathit{\Phi}}^{-1\ }\ (x)\ }{\sqrt{1-\rho }}\right]$
The correlation coefficient,

ρ

$\rho$ , can be obtained by adapting the Basel II IRB risk-weighted formula for corporate exposures, which is based on the Vasicek model and which prescribes that correlations are bounded by upper and lower limits and are function of the probability of default weighted average. For credit card default correlations, we employ the empirical study of Crook et al¹ to set the lower bounds at 0.396% and the upper bound at 4% and assume that correlation is an increasing function of the default probability:

ρ = 0.396 % (1 - \frac{1 - e^{- k p}}{1 - e^{- k}}) + 8 % (1 - \frac{1 - e^{- k p}}{1 - e^{- k}})

$\rho =0.396\%\left(1-\frac{1-e^{-kp}}{1-e^{-k}}\right)+8\%\left(1-\frac{1-e^{-kp}}{1-e^{-k}}\right)$ Where the

k

$k$ parameter which controls the exponential decline is set to 50 as under Basel regulations and

p = {\bar{P D}}_{Π, T}

$p = {\overline{PD}}_{\mathrm{\Pi },T}$

3.2. Loss Given Default¶

A "Two-stage" LGD model is implemented. The "Stage 1" model is a classification model to predict whether the loan will have a recovery rate (RR) greater than zero. The "Stage 2" model a regression-type model to predict the value of the recovered amount of when the recovery rate is expected to be positive. The predicted recovery is the expected value of the two combined models, that is, the product of a binary value representing the event of recovery and the expected recovery value. So, for obligor $i$ , predicted $RR$ will be either:

{\bar{R R}}_{i} = [P (R R > 0)_{i} = 1] \cdot {\hat{Y}}_{i (R R > 0)}

${\overline{RR}_{i}}= {[P(RR>0)_{i} = 1]} \cdot {\widehat{Y}}_{i(RR>0)}$
Or:

{\bar{R R}}_{i} = [P (R R > 0)_{i} = 0]

${\overline{RR}_{i}}= {[P(RR>0)_{i} = 0]}$
Where

\hat{Y}

${\widehat{Y}}$ is the predicted amount of postive RR obtained from a multivariate linear regression,

P (R R > 0)

${P(RR>0)}$ is the probability of a postive RR obtained from a multivariate logistic regression assuming some threshold and

\bar{R R}

${\overline{RR}}$ is the obligor-specific recovery rate.

LGD is therefore:

{\bar{L G D}}_{i} = 1 - {\bar{R R}}_{i}

${\overline{LGD}_{i}} = {1-}{\overline{RR}_{i}}$

3.3. Expected Exposure at Default¶

For credit card portfolios, EAD estimation is bedevilled by the revolving nature of the credit line which poses challenges to predicting the exposure at default time. Additional borrowings in the period prior to default means taking the current balance for non-defaulted customers does not produce a sufficiently conservative enough estimate of the amount drawn by the time of default. One solution is to use historic data to derive a Credit Conversion Factor (CCF) which is the proportion of the current undrawn amount that will likely be drawn down at time of default. The dependent variable in the regression analysis will be:

(F u n d e d A m o u n t_{D e f a u l t e d L o a n} - D r a w n A m o u n t_{D e f a u l t e d L o a n}) \div F u n d e d A m o u n t_{D e f a u l t e d L o a n}

${({Funded Amount_{Defaulted Loan}} - {Drawn Amount_{Defaulted Loan}})} \div {Funded Amount_{Defaulted Loan}}$
So, for obligor

i

$i$ , predicted

E A D

$EAD$ will be:

{\bar{E A D}}_{i} = C u r r e n t D r a w n A m o u n t_{i} + ({C C F}_{i} \cdot C u r r e n t U n d r a w n A m o u n t_{i})

${\overline{EAD}_{i}}= {Current Drawn Amount_{i}} +{({CCF}_{i}\cdot{Current Undrawn Amount_{i}})}$

Where

C C F

${{CCF}}$ is the obligor-specific CCF multiplier obtained by applying the multivariate linear regression function to the obligor's data.

² J. Crook & T. Bellotti (2012) Asset correlations for credit card defaults, Applied Financial Economics, 22:2, 87-95

4. PD Model Inputs¶

4.1. Preliminary Data Exploration and Preprocessing¶

To avoid any suggestion of the selective usage of raw data and the gaming of model results, the procedure for treating raw data should be transparent and rigourous. For example:

Retrieve raw data into dataframe
Convert string values to integers where necessary
Convert string points in time to numeric periods of time where necessary
Transform all discrete variables into dummy variables and concatenate in single dataframe
Replace missing values with appropriate alternative value or remove from dataset
Incorporate new dummy variables into master dataframe
Search for errors/anomalies/outliers in the dataset. Remove or replace

import numpy as np
import pandas as pd

# 1) Retrieve loan data into dataframe
loan_data = pd.read_csv('loan_data_2007_2014.csv')

# 2) Convert string values to integers where necessary. First removing text...
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\+ years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a',  str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')
#...then converting string datatype to numeric datatype
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])

# 2) Convert string values to integers where necessary, replacing text with empty space
loan_data['term_int'] = pd.to_numeric(loan_data['term'].str.replace(' months', ''))


# 3) Convert string points in time to numeric periods of time where necessary.First converting to datetime format...
loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format = '%b-%y')
#...then converting to a new passage of time variable
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') 
                                                                - loan_data['earliest_cr_line_date']) 
                                                               / np.timedelta64(1, 'M')))


loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format = '%b-%y')
loan_data['mths_since_issue_d'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') 
                                                       - loan_data['issue_d_date']) 
                                                      / np.timedelta64(1, 'M')))


# 4) Transform all discrete variables into dummy variables and concatenate in single dataframe

loan_data_dummies = [pd.get_dummies(loan_data['grade'], prefix = 'grade', prefix_sep = ':'),
                     pd.get_dummies(loan_data['sub_grade'], prefix = 'sub_grade', prefix_sep = ':'),
                     pd.get_dummies(loan_data['home_ownership'], prefix = 'home_ownership', prefix_sep = ':'),
                     pd.get_dummies(loan_data['verification_status'], prefix = 'verification_status', prefix_sep = ':'),
                     pd.get_dummies(loan_data['loan_status'], prefix = 'loan_status', prefix_sep = ':'),
                     pd.get_dummies(loan_data['purpose'], prefix = 'purpose', prefix_sep = ':'),
                     pd.get_dummies(loan_data['addr_state'], prefix = 'addr_state', prefix_sep = ':'),
                     pd.get_dummies(loan_data['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':')]

loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)

# 5) Incorporate new dummy variables into master dataframe

loan_data = pd.concat([loan_data, loan_data_dummies], axis = 1)

# 6) Replace missing values with appropriate alternative value or remove from dataset

loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'], inplace=True) # other variable
loan_data['annual_inc'].fillna(loan_data['annual_inc'].mean(), inplace=True) # mean value
loan_data['mths_since_earliest_cr_line'].fillna(0, inplace=True) # zero value
loan_data['acc_now_delinq'].fillna(0, inplace=True) # zero value
loan_data['total_acc'].fillna(0, inplace=True) # zero value
loan_data['pub_rec'].fillna(0, inplace=True) # zero value
loan_data['open_acc'].fillna(0, inplace=True) # zero value
loan_data['inq_last_6mths'].fillna(0, inplace=True) # zero value
loan_data['delinq_2yrs'].fillna(0, inplace=True) # zero value
loan_data['emp_length_int'].fillna(0, inplace=True) # zero value

# To remove null values from dataset:
#indices = loan_data[loan_data['person _ emp_ length'].isnull()].index 
#loan_data.drop(indices, inplace=True)

# 7) Search for errors/anomalies/outliers in the dataset. Remove or replace
pd.crosstab(loan_data['home_ownership'], 
            loan_data['emp_length_int'], 
            values=loan_data['mths_since_earliest_cr_line'], 
            aggfunc='min').round(2)

loan_data['mths_since_earliest_cr_line'].describe()

# Replace all negative values in dataset with max.
loan_data['mths_since_earliest_cr_line'][loan_data['mths_since_earliest_cr_line'] 
                                         < 0] = loan_data['mths_since_earliest_cr_line'].max()

# Remove all negative values from dataset
#indices = loan_data[cr _ loan['person _ emp_ length'] < 0].index 
#loan_data.drop(indices, inplace=True)

C:\Users\delga\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (20) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:78: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

4.2. Feature and Target Variable Preparation and Selection¶

The data should be divided into training and testing datasets. All discrete and continous feature variables should be transformed into dummy variables. The initial transformation of the feature variables of the training dataset into narrow categories of arbitrary size is referred to as "fine classing". The process of creating new, refined and usually enlarged categories based on the initial ones are refined is a process known as "coarse classing".

A metric called 'Weight of Evidence' $(WoE)$ is employed to this end. The objective is to lower the number of dummy variables. Weight of evidence shows to what extent each of the different categories of an independent variable explains the dependent variable. The objective is to obtain categories with a similar WOE. Ideally, each category (bin) should have at least 5% of the observations. Each category (bin) should be non-zero for both non-events and events. The $WoE$ should be monotonic, i.e. either growing or decreasing with the groupings.

The formula for $WoE$ is:

The steps to calculate $WoE$ are:

For a continuous variable, split data into ordered parts (or less depending on the distribution)
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group
Calculate $WoE$ by taking natural log of division of % of non-events and % of events

Note : For a dicrete variable, it will unecessary to split the the data. Some discrete variables can be ordered. The interpretation of the

W o E

$WoE$ for a given category of an independent variable is relatively straightforward. The further the distance from zero of the

W o E

$WoE$ , the more powerful the category is in differentiating between the two outcomes (Default/Non-Default) of the dependent variable. Whilst

W o E

$WoE$ describes the relationship between a category value and a binary target variable; the Information Value

(I V)

$(IV)$ measures the predictive power of a feature.

W o E

$WoE$ considers only the discriminatory power of each bin, without regard to the proportion of observations in the bin.

I V

$IV$ is a weighted sum of the

(W o E)

$(WoE)$ values. It is therefore a measure bounded between 0 and 1 of how much information an independent variable brings to explaining the dependent variable. It is thus a valuable tool for feature selection. The formula for

I V

$IV$ is: The conventional procedure for feature analysis and selection is:

Define the independent "Default Variable
Split data set into training and testing sets
Create a datframe for each input variable and the target variable, ensuring where possible the input data is organized into ordered bins
For each variable, compute the $WoE$ of each category (bin)
Adjust the dimensions of the categories in accordance with the interpretation of their $WoE$
Compute the $IV$ of the adjusted (coarse-classed) categories
Applying qualitative and quantitative criteria, select the best predictor variables

# Define dependent 'Default' variable and add to loan_data dataframe

loan_data['good_bad'] = np.where(loan_data['loan_status'].isin(['Charged Off', 'Default',
                                                       'Does not meet the credit policy. Status:Charged Off',
                                                       'Late (31-120 days)']), 0, 1)

# Imports the libraries we need.
from sklearn.model_selection import train_test_split


cr_inp_train, cr_inp_test, cr_tgt_train, cr_tgt_test = train_test_split(loan_data.drop('good_bad', axis = 1), 
                                                                        loan_data['good_bad'], 
                                                                        test_size = 0.2, 
                                                                        random_state = 42)

# WoE function for discrete unordered variables
# The function takes 3 arguments: a feature dataframe, a string, and a target dataframe. 
# The function returns a dataframe as a result.

def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    df = df.sort_values(['WoE'])
    df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df

# NOTE ON GROUPBY
# Groups the data according to a criterion contained in one column (1st = Grade)
# Does not turn the names of the values of the criterion into index if as_index = False
# Aggregates the data in another column (Good_bd) to these groups, using a selected function (mean)
# Syntax: Produces Pandas DataFrame >>> df.groupby('month')[['duration']].sum()

# WoE function for ordered discrete and continuous variables

def woe_ordered_continuous(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    #df = df.sort_values(['WoE'])
    #df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df

# NOTE: We order the results by the values of a different column.

# WoE Visualization

import matplotlib.pyplot as plt
import seaborn as sns
# Imports the libraries we need.
sns.set()
# We set the default style of the graphs to the seaborn style.

# Below we define a graphing function that takes 2 arguments: a WoE dataframe and a number to rotate x labels
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
    x = np.array(df_WoE.iloc[:, 0].apply(str))
    # Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
    y = df_WoE['WoE']
    plt.figure(figsize=(18, 6))
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    plt.xlabel(df_WoE.columns[0])
    # Names the x-axis with the name of the column with index 0.
    plt.ylabel('Weight of Evidence')
    # Names the y-axis 'Weight of Evidence'.
    plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
    # Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
    plt.xticks(rotation = rotation_of_x_axis_labels)
    # Rotates the labels of the x-axis a predefined number of degrees.

##### Procedure will be run twice. Once with training data and once with testing data #####


# New dataframe with training/test inputs and targets

df_inputs_prepr = cr_inp_train
df_targets_prepr = cr_tgt_train

#df_inputs_prepr = cr_inp_test
#df_targets_prepr = cr_tgt_test

df_targets_prepr

427211    1
206088    1
136020    1
412305    0
36159     0
         ..
259178    1
365838    1
131932    1
146867    1
121958    1
Name: good_bad, Length: 373028, dtype: int32

df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_temp

plot_by_woe(df_temp)

df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp

plot_by_woe(df_temp)
# We plot the weight of evidence values.

# Using woE we combine residential status categories.
df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership:RENT'], 
                                                             df_inputs_prepr['home_ownership:OTHER'],
                                                             df_inputs_prepr['home_ownership:NONE'],
                                                             df_inputs_prepr['home_ownership:ANY']])

# IF a region does not feature in the address (state) column, then it should be added and assigned zero values
if ['addr_state:ND'] in df_inputs_prepr.columns.values:
    pass
else:
    df_inputs_prepr['addr_state:ND'] = 0

# Using woE we combine region categories.
df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_inputs_prepr['addr_state:ND'], df_inputs_prepr['addr_state:NE'],
                                              df_inputs_prepr['addr_state:IA'], df_inputs_prepr['addr_state:NV'],
                                              df_inputs_prepr['addr_state:FL'], df_inputs_prepr['addr_state:HI'],
                                                          df_inputs_prepr['addr_state:AL']])

df_inputs_prepr['addr_state:NM_VA'] = sum([df_inputs_prepr['addr_state:NM'], df_inputs_prepr['addr_state:VA']])

df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_inputs_prepr['addr_state:OK'], df_inputs_prepr['addr_state:TN'],
                                              df_inputs_prepr['addr_state:MO'], df_inputs_prepr['addr_state:LA'],
                                              df_inputs_prepr['addr_state:MD'], df_inputs_prepr['addr_state:NC']])

df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_inputs_prepr['addr_state:UT'], df_inputs_prepr['addr_state:KY'],
                                              df_inputs_prepr['addr_state:AZ'], df_inputs_prepr['addr_state:NJ']])

df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_inputs_prepr['addr_state:AR'], df_inputs_prepr['addr_state:MI'],
                                              df_inputs_prepr['addr_state:PA'], df_inputs_prepr['addr_state:OH'],
                                              df_inputs_prepr['addr_state:MN']])

df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_inputs_prepr['addr_state:RI'], df_inputs_prepr['addr_state:MA'],
                                              df_inputs_prepr['addr_state:DE'], df_inputs_prepr['addr_state:SD'],
                                              df_inputs_prepr['addr_state:IN']])

df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_inputs_prepr['addr_state:GA'], df_inputs_prepr['addr_state:WA'],
                                              df_inputs_prepr['addr_state:OR']])

df_inputs_prepr['addr_state:WI_MT'] = sum([df_inputs_prepr['addr_state:WI'], df_inputs_prepr['addr_state:MT']])

df_inputs_prepr['addr_state:IL_CT'] = sum([df_inputs_prepr['addr_state:IL'], df_inputs_prepr['addr_state:CT']])

df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_inputs_prepr['addr_state:KS'], df_inputs_prepr['addr_state:SC'],
                                              df_inputs_prepr['addr_state:CO'], df_inputs_prepr['addr_state:VT'],
                                              df_inputs_prepr['addr_state:AK'], df_inputs_prepr['addr_state:MS']])

df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_inputs_prepr['addr_state:WV'], df_inputs_prepr['addr_state:NH'],
                                              df_inputs_prepr['addr_state:WY'], df_inputs_prepr['addr_state:DC'],
                                              df_inputs_prepr['addr_state:ME'], df_inputs_prepr['addr_state:ID']])

# Using WoE we combine purpose categories.

df_inputs_prepr['purpose:educ__sm_b__wedd__ren_en__mov__house'] = sum([df_inputs_prepr['purpose:educational'], df_inputs_prepr['purpose:small_business'],
                                                                 df_inputs_prepr['purpose:wedding'], df_inputs_prepr['purpose:renewable_energy'],
                                                                 df_inputs_prepr['purpose:moving'], df_inputs_prepr['purpose:house']])
df_inputs_prepr['purpose:oth__med__vacation'] = sum([df_inputs_prepr['purpose:other'], df_inputs_prepr['purpose:medical'],
                                             df_inputs_prepr['purpose:vacation']])
df_inputs_prepr['purpose:major_purch__car__home_impr'] = sum([df_inputs_prepr['purpose:major_purchase'], df_inputs_prepr['purpose:car'],
                                                        df_inputs_prepr['purpose:home_improvement']])

df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)

# We create the following categories: '0', '1', '2 - 4', '5 - 6', '7 - 9', '10'
# '0' will be the reference category
df_inputs_prepr['emp_length:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length:5-6'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 7)), 1, 0)
df_inputs_prepr['emp_length:7-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['emp_length:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['mths_since_issue_d_factor'] = pd.cut(df_inputs_prepr['mths_since_issue_d'], 50)
# Here we perform coarse -classing: we create the following categories:
# < 38, 38 - 39, 40 - 41, 42 - 48, 49 - 52, 53 - 64, 65 - 84, > 84.
df_inputs_prepr['mths_since_issue_d:<38'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38)), 1, 0)
df_inputs_prepr['mths_since_issue_d:38-39'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38, 40)), 1, 0)
df_inputs_prepr['mths_since_issue_d:40-41'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(40, 42)), 1, 0)
df_inputs_prepr['mths_since_issue_d:42-48'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(42, 49)), 1, 0)
df_inputs_prepr['mths_since_issue_d:49-52'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(49, 53)), 1, 0)
df_inputs_prepr['mths_since_issue_d:53-64'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(53, 65)), 1, 0)
df_inputs_prepr['mths_since_issue_d:65-84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(65, 85)), 1, 0)
df_inputs_prepr['mths_since_issue_d:>84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(85, 
                                                                                                      int(df_inputs_prepr['mths_since_issue_d'].max()))), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)
# Here we perform coarse -classing: we create the following categories:
# '< 9.548', '9.548 - 12.025', '12.025 - 15.74', '15.74 - 20.281', '> 20.281'
df_inputs_prepr['int_rate:<9.548'] = np.where((df_inputs_prepr['int_rate'] <= 9.548), 1, 0)
df_inputs_prepr['int_rate:9.548-12.025'] = np.where((df_inputs_prepr['int_rate'] > 9.548) & (df_inputs_prepr['int_rate'] <= 12.025), 1, 0)
df_inputs_prepr['int_rate:12.025-15.74'] = np.where((df_inputs_prepr['int_rate'] > 12.025) & (df_inputs_prepr['int_rate'] <= 15.74), 1, 0)
df_inputs_prepr['int_rate:15.74-20.281'] = np.where((df_inputs_prepr['int_rate'] > 15.74) & (df_inputs_prepr['int_rate'] <= 20.281), 1, 0)
df_inputs_prepr['int_rate:>20.281'] = np.where((df_inputs_prepr['int_rate'] > 20.281), 1, 0)


# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['funded_amnt_factor'] = pd.cut(df_inputs_prepr['funded_amnt'], 50)
# We retain these categories

# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 50)

# Here we perform coarse-classing: we create the following categories:
#< 140, # 141 - 164, # 165 - 247, # 248 - 270, # 271 - 352, # > 352
df_inputs_prepr['mths_since_earliest_cr_line:<140'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:141-164'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140, 165)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:165-247'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(165, 248)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:248-270'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(248, 271)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:271-352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(271, 353)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(353, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)

# Here we perform coarse-classing: we create the following categories:
# Categories: 0, 1-3, >=4
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1-3'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 1) & (df_inputs_prepr['delinq_2yrs'] <= 3), 1, 0)
df_inputs_prepr['delinq_2yrs:>=4'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 9), 1, 0)

# Categories: 0, 1 - 2, 3 - 6, > 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 6), 1, 0)
df_inputs_prepr['inq_last_6mths:>6'] = np.where((df_inputs_prepr['inq_last_6mths'] > 6), 1, 0)

# Categories: '0', '1-3', '4-12', '13-17', '18-22', '23-25', '26-30', '>30'
df_inputs_prepr['open_acc:0'] = np.where((df_inputs_prepr['open_acc'] == 0), 1, 0)
df_inputs_prepr['open_acc:1-3'] = np.where((df_inputs_prepr['open_acc'] >= 1) & (df_inputs_prepr['open_acc'] <= 3), 1, 0)
df_inputs_prepr['open_acc:4-12'] = np.where((df_inputs_prepr['open_acc'] >= 4) & (df_inputs_prepr['open_acc'] <= 12), 1, 0)
df_inputs_prepr['open_acc:13-17'] = np.where((df_inputs_prepr['open_acc'] >= 13) & (df_inputs_prepr['open_acc'] <= 17), 1, 0)
df_inputs_prepr['open_acc:18-22'] = np.where((df_inputs_prepr['open_acc'] >= 18) & (df_inputs_prepr['open_acc'] <= 22), 1, 0)
df_inputs_prepr['open_acc:23-25'] = np.where((df_inputs_prepr['open_acc'] >= 23) & (df_inputs_prepr['open_acc'] <= 25), 1, 0)
df_inputs_prepr['open_acc:26-30'] = np.where((df_inputs_prepr['open_acc'] >= 26) & (df_inputs_prepr['open_acc'] <= 30), 1, 0)
df_inputs_prepr['open_acc:>=31'] = np.where((df_inputs_prepr['open_acc'] >= 31), 1, 0)

# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['pub_rec:0-2'] = np.where((df_inputs_prepr['pub_rec'] >= 0) & (df_inputs_prepr['pub_rec'] <= 2), 1, 0)
df_inputs_prepr['pub_rec:3-4'] = np.where((df_inputs_prepr['pub_rec'] >= 3) & (df_inputs_prepr['pub_rec'] <= 4), 1, 0)
df_inputs_prepr['pub_rec:>=5'] = np.where((df_inputs_prepr['pub_rec'] >= 5), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 50)

# # Here we perform coarse-classing: we create the following categories: '<=27', '28-51', '>51'
df_inputs_prepr['total_acc:<=27'] = np.where((df_inputs_prepr['total_acc'] <= 27), 1, 0)
df_inputs_prepr['total_acc:28-51'] = np.where((df_inputs_prepr['total_acc'] >= 28) & (df_inputs_prepr['total_acc'] <= 51), 1, 0)
df_inputs_prepr['total_acc:>=52'] = np.where((df_inputs_prepr['total_acc'] >= 52), 1, 0)

# Coarse classed. Categories: '0', '>=1'
df_inputs_prepr['acc_now_delinq:0'] = np.where((df_inputs_prepr['acc_now_delinq'] == 0), 1, 0)
df_inputs_prepr['acc_now_delinq:>=1'] = np.where((df_inputs_prepr['acc_now_delinq'] >= 1), 1, 0)

# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['total_rev_hi_lim_factor'] = pd.cut(df_inputs_prepr['total_rev_hi_lim'], 2000)

# Coarse classed. Categories: <=5K', '5K-10K', '10K-20K', '20K-30K', '30K-40K', '40K-55K', '55K-95K', '>95K'
df_inputs_prepr['total_rev_hi_lim:<=5K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] <= 5000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:5K-10K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 5000) & (df_inputs_prepr['total_rev_hi_lim'] <= 10000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:10K-20K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 10000) & (df_inputs_prepr['total_rev_hi_lim'] <= 20000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:20K-30K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 20000) & (df_inputs_prepr['total_rev_hi_lim'] <= 30000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:30K-40K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 30000) & (df_inputs_prepr['total_rev_hi_lim'] <= 40000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:40K-55K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 40000) & (df_inputs_prepr['total_rev_hi_lim'] <= 55000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:55K-95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 55000) & (df_inputs_prepr['total_rev_hi_lim'] <= 95000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:>95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 95000), 1, 0)

# Fine classed. Categories: Evenly split into 50 bins
df_inputs_prepr['installment_factor'] = pd.cut(df_inputs_prepr['installment'], 50)

# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 100)

# Coarse classed. We split income in 10 equal categories, each with width of 15k.
df_inputs_prepr['annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)

# Categories: Missing, 0-3, 4-30, 31-56, >=57
df_inputs_prepr['mths_since_last_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_last_delinq'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_delinq:0-3'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 0) & (df_inputs_prepr['mths_since_last_delinq'] <= 3), 1, 0)
df_inputs_prepr['mths_since_last_delinq:4-30'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 4) & (df_inputs_prepr['mths_since_last_delinq'] <= 30), 1, 0)
df_inputs_prepr['mths_since_last_delinq:31-56'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 31) & (df_inputs_prepr['mths_since_last_delinq'] <= 56), 1, 0)
df_inputs_prepr['mths_since_last_delinq:>=57'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 57), 1, 0)

# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 100)

# Categories:
df_inputs_prepr['dti:<=1.4'] = np.where((df_inputs_prepr['dti'] <= 1.4), 1, 0)
df_inputs_prepr['dti:1.4-3.5'] = np.where((df_inputs_prepr['dti'] > 1.4) & (df_inputs_prepr['dti'] <= 3.5), 1, 0)
df_inputs_prepr['dti:3.5-7.7'] = np.where((df_inputs_prepr['dti'] > 3.5) & (df_inputs_prepr['dti'] <= 7.7), 1, 0)
df_inputs_prepr['dti:7.7-10.5'] = np.where((df_inputs_prepr['dti'] > 7.7) & (df_inputs_prepr['dti'] <= 10.5), 1, 0)
df_inputs_prepr['dti:10.5-16.1'] = np.where((df_inputs_prepr['dti'] > 10.5) & (df_inputs_prepr['dti'] <= 16.1), 1, 0)
df_inputs_prepr['dti:16.1-20.3'] = np.where((df_inputs_prepr['dti'] > 16.1) & (df_inputs_prepr['dti'] <= 20.3), 1, 0)
df_inputs_prepr['dti:20.3-21.7'] = np.where((df_inputs_prepr['dti'] > 20.3) & (df_inputs_prepr['dti'] <= 21.7), 1, 0)
df_inputs_prepr['dti:21.7-22.4'] = np.where((df_inputs_prepr['dti'] > 21.7) & (df_inputs_prepr['dti'] <= 22.4), 1, 0)
df_inputs_prepr['dti:22.4-35'] = np.where((df_inputs_prepr['dti'] > 22.4) & (df_inputs_prepr['dti'] <= 35), 1, 0)
df_inputs_prepr['dti:>35'] = np.where((df_inputs_prepr['dti'] > 35), 1, 0)

# Categories: 'Missing', '0-2', '3-20', '21-31', '32-80', '81-86', '>86'
df_inputs_prepr['mths_since_last_record:Missing'] = np.where((df_inputs_prepr['mths_since_last_record'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_record:0-2'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 0) & (df_inputs_prepr['mths_since_last_record'] <= 2), 1, 0)
df_inputs_prepr['mths_since_last_record:3-20'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 3) & (df_inputs_prepr['mths_since_last_record'] <= 20), 1, 0)
df_inputs_prepr['mths_since_last_record:21-31'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 21) & (df_inputs_prepr['mths_since_last_record'] <= 31), 1, 0)
df_inputs_prepr['mths_since_last_record:32-80'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 32) & (df_inputs_prepr['mths_since_last_record'] <= 80), 1, 0)
df_inputs_prepr['mths_since_last_record:81-86'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 81) & (df_inputs_prepr['mths_since_last_record'] <= 86), 1, 0)
df_inputs_prepr['mths_since_last_record:>86'] = np.where((df_inputs_prepr['mths_since_last_record'] > 86), 1, 0)

# View metadata
df_inputs_prepr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 373028 entries, 427211 to 121958
Columns: 324 entries, Unnamed: 0 to mths_since_last_record:>86
dtypes: category(9), datetime64[ns](2), float64(49), int32(92), int64(10), object(22), uint8(140)
memory usage: 423.5+ MB

##### Store training inputs in dataframe  #####

cr_inp_train = df_inputs_prepr

##### Store test inputs in dataframe
#cr_inp_test = df_inputs_prepr

##### Save training data to CSV file #####
cr_inp_train.to_csv('cr_inp_train.csv')
cr_tgt_train.to_csv('cr_tgt_train.csv')

##### Save test data to CSV file #####

#cr_inp_test.to_csv('cr_inp_test.csv')
#cr_tgt_test.to_csv('cr_tgt_test.csv')

5. PD Model Estimation¶

5.1. Fit a Naive Model Using the Pre-Selected Predictor Variables¶

Having performed an initial filtration of predictor variables, a preliminary model is run with these variables. Care should be taken to remove one dummy for each original variable to avoid the so-called dummy variable trap.

loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
loan_data_targets_train = pd.read_csv('cr_tgt_train.csv', index_col = 0)
loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
loan_data_targets_test = pd.read_csv('cr_tgt_test.csv', index_col = 0)

# Select a limited set of input variables in a new dataframe.
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'delinq_2yrs:0',
'delinq_2yrs:1-3',
'delinq_2yrs:>=4',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'open_acc:0',
'open_acc:1-3',
'open_acc:4-12',
'open_acc:13-17',
'open_acc:18-22',
'open_acc:23-25',
'open_acc:26-30',
'open_acc:>=31',
'pub_rec:0-2',
'pub_rec:3-4',
'pub_rec:>=5',
'total_acc:<=27',
'total_acc:28-51',
'total_acc:>=52',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'total_rev_hi_lim:<=5K',
'total_rev_hi_lim:5K-10K',
'total_rev_hi_lim:10K-20K',
'total_rev_hi_lim:20K-30K',
'total_rev_hi_lim:30K-40K',
'total_rev_hi_lim:40K-55K',
'total_rev_hi_lim:55K-95K',
'total_rev_hi_lim:>95K',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]

# Here we store the names of the reference category dummy variables in a list.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'delinq_2yrs:>=4',
'inq_last_6mths:>6',
'open_acc:0',
'pub_rec:0-2',
'total_acc:<=27',
'acc_now_delinq:0',
'total_rev_hi_lim:<=5K',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']

# Drop the variables with variable names in the list with reference categories to avoid dummy variable trap
inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Create an instance of an object from the 'LogisticRegression' class with specified parameters
reg = LogisticRegression(solver='lbfgs', max_iter=200,)

# Sets the pandas dataframe options to display all columns/ rows.
#pd.options.display.max_rows = None

# Estimates the coefficients of the object from the 'LogisticRegression' class
# np.ravel(training_labels) is be required to convert the target data into a 1D numpy array
reg.fit(inputs_train, np.ravel(loan_data_targets_train))

C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Displays the intercept contain in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.intercept_

array([-1.69641599])

# Displays the coefficients contained in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.coef_

array([[ 1.14528344e+00,  8.94099019e-01,  6.98472774e-01,
         5.09252785e-01,  3.30640758e-01,  1.42515826e-01,
         9.17426145e-02,  1.07264042e-01,  3.60975951e-02,
         5.99599905e-02,  6.02816830e-02,  6.31110559e-02,
         7.91132581e-02,  1.36599873e-01,  1.01090449e-01,
         1.85222420e-01,  2.40112427e-01,  2.24601236e-01,
         2.63215859e-01,  3.21179074e-01,  5.24236957e-01,
         8.78829789e-02, -1.05476109e-02,  3.02722220e-01,
         1.99358523e-01,  2.11662337e-01,  2.64641132e-01,
         5.40210138e-02,  7.87898747e-02,  1.00483270e-01,
         1.25279467e-01,  9.04521958e-02,  6.02747351e-02,
         1.23138866e-01,  1.07294514e+00,  8.72162779e-01,
         7.72066669e-01,  5.70767806e-01,  4.08989085e-01,
         1.63339576e-01, -7.20172808e-02,  8.62232885e-01,
         5.47195666e-01,  2.97027495e-01,  1.07258640e-01,
         5.49390592e-02,  3.70621790e-02,  7.80084156e-02,
         1.19031700e-01,  1.24564448e-01,  9.08550235e-02,
         4.84293783e-02,  6.59248926e-01,  5.15715221e-01,
         3.06846154e-01,  3.40689586e-01,  2.44551577e-01,
         2.18617828e-01,  2.02962566e-01,  1.99136097e-01,
         2.35670531e-01,  1.43405647e-01,  1.17522819e-01,
         1.32391438e-01, -2.04353597e-02,  2.42785981e-02,
         1.86650314e-01,  4.13758648e-02,  1.03429616e-02,
         9.69416557e-03,  2.38900769e-02,  4.49928796e-02,
         7.05952276e-02,  2.20377919e-01, -6.82175446e-02,
         2.17117164e-04,  9.77340972e-02,  1.66648488e-01,
         2.42140782e-01,  3.16380993e-01,  3.91557060e-01,
         4.10242844e-01,  4.87244656e-01,  5.77630174e-01,
         5.01437979e-01,  1.83312115e-01,  3.12406482e-01,
         3.42476618e-01,  2.83282138e-01,  2.07270706e-01,
         1.09447019e-01,  9.97760027e-02,  6.26082943e-02,
         2.47770690e-02,  6.92607279e-02,  1.37932727e-01,
         1.48595420e-01,  1.00664499e-01,  3.48068437e-01,
         4.36647841e-01,  3.71307633e-01,  5.34928100e-01,
         1.97559958e-01,  2.64340836e-01]])

feature_name = inputs_train.columns.values
# Stores the names of the columns of a dataframe in a variable.

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
summary_table.head()

5.2. Compute Statistical Significance of Predictor Variables¶

Having fitted the preliminary model, the p-values of the beta coefficients of the feature variables should be analysed to ascertain their statistical significance and to determine if they should be retained or discarded.

# P values for sklearn logistic regression.

# Class to display p-values for logistic regression in sklearn.

from sklearn import linear_model
import scipy.stats as stat

class LogisticRegression_with_p_values:
    
    def __init__(self,*args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)

    def fit(self,X,y):
        self.model.fit(X,y)
        
        #### Get p-values for the fitted model ####
        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
        Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
        p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
        
        self.coef_ = self.model.coef_
        self.intercept_ = self.model.intercept_
        #self.z_scores = z_scores
        self.p_values = p_values
        #self.sigma_estimates = sigma_estimates
        #self.F_ij = F_ij

reg = LogisticRegression_with_p_values()
# We create an instance of an object from the newly created 'LogisticRegression_with_p_values()' class.

reg.fit(inputs_train, loan_data_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.

C:\Users\delga\anaconda3\lib\site-packages\sklearn\utils\validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

# Same as above.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()

# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = reg.p_values
# Add the intercept for completeness.
p_values = np.append(np.nan, np.array(p_values))
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' var.
summary_table['p_values'] = p_values
summary_table.head()

5.3. Fit a Refined Model Using the Statistically Significant Predictor Variables¶

# We are going to remove some features, the coefficients for all or almost all of the dummy variables for which,
# are not tatistically significant.

# We do that by specifying another list of dummy variables as reference categories, and a list of variables to remove.
# Then, we are going to drop the two datasets from the original list of dummy variables.

# Variables
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]

ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']

inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
inputs_train.head()

# Here we run a new model.
reg2 = LogisticRegression_with_p_values()
reg2.fit(inputs_train, loan_data_targets_train)

C:\Users\delga\anaconda3\lib\site-packages\sklearn\utils\validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

feature_name = inputs_train.columns.values

# Results for our final PD model.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg2.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg2.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table.head()

import pickle
#pickle.dump(),takes two arguments: the object you want to pickle and the file to which the object has to be saved.
# To open the file for writing, simply use the open() function. The first argument should be the name of your file. 
# The second argument is 'wb'. The w means that you'll be writing to the file, and b refers to binary mode
# Here we export our model to a 'SAV' file with file name 'pd_model.sav'.
pickle.dump(reg2, open('pd_model1.sav', 'wb'))

6. PD Model Performance¶

6.1. Accuracy Scores and Confusion Matrices¶

# Here, from the dataframe with inputs for testing, we keep the same variables that we used in our final PD model.
inputs_test_with_ref_cat = loan_data_inputs_test.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]

# And here, in the list below, we keep the variable names for the reference categories,
# only for the variables we used in our final PD model.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']

inputs_test = inputs_test_with_ref_cat.drop(ref_categories, axis = 1)
inputs_test.head()

# Calculates the predicted binary values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model
# Output values > 0.5 = 1; Output values < 0.5 = 0;
y_hat_test = reg2.model.predict(inputs_test)
y_hat_test

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

loan_data_targets_test_temp = loan_data_targets_test

loan_data_targets_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.

# Concatenates two dataframes.
df_actual_predicted = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test)], axis = 1)
# Names Columns
df_actual_predicted.columns = ['loan_data_targets_test', 'y_hat_test (0.5)']
# Makes the index of one dataframe equal to the index of another dataframe.
df_actual_predicted.index = loan_data_inputs_test.index
df_actual_predicted.head()

import itertools
import numpy as np
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float')
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.3f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "white")

    plt.tight_layout()
    plt.ylabel('*TRUE LABEL*', fontsize=14)
    plt.xlabel('*PREDICTED LABEL*', fontsize=14)
    plt.show()

cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)

Confusion matrix, without normalization
[[    6 10184]
 [    6 83061]]

cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)

Confusion matrix, without normalization
[[    6 10184]
 [    6 83061]]

# Actual vs Predicted Binary target variables (where 0,5 is cutoff for predicted default/non-default)
from sklearn.metrics import accuracy_score
print("Accuracy (Out-of-Sample, threshold=0.5): ", accuracy_score(loan_data_targets_test_temp, y_hat_test))

Accuracy (Out-of-Sample, threshold=0.5):  0.8907320630086749

# Calculates the predicted probability values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model.
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.
y_hat_test_proba = reg2.model.predict_proba(inputs_test)
y_hat_test_proba = y_hat_test_proba[:][:,1]
y_hat_test_proba

array([0.92430569, 0.84923866, 0.88534974, ..., 0.97321347, 0.95979153,
       0.95236655])

df_actual_predicted_probs = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test_proba)], axis = 1)
df_actual_predicted_probs.columns = ['loan_data_targets_test', 'y_hat_test_proba']
df_actual_predicted_probs.index = loan_data_inputs_test.index
df_actual_predicted_probs.head()

df_actual_predicted_probs.head()

import matplotlib.pyplot as plt
plt.hist(df_actual_predicted_probs['y_hat_test_proba'], bins=50)
plt.title('Probability Distribution - No Default', fontsize=20)
plt.show()

tr = 0.9
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)

# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.

cm_df = pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], 
                    df_actual_predicted_probs['y_hat_test'], 
                    rownames = ['Actual'], colnames = ['Predicted'])

# Confusion Matrix as numpy array
cm_arr = np.array(cm_df)
cm_arr

array([[ 7374,  2816],
       [35812, 47255]], dtype=int64)

# Confusion Matrix normalized by number of observations
cm_arr_norm = np.array([[cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])), 
                          cm_arr[0,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))], 
                         [cm_arr[1,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])), 
                          cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))]])
cm_arr_norm

array([[0.07907181, 0.03019612],
       [0.38401407, 0.50671799]])

classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.9',
                          cmap=plt.cm.RdYlGn)

Confusion matrix, without normalization
[[ 7374  2816]
 [35812 47255]]

classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr_norm, classes,
                          normalize=True,
                          title='NORM. CONFUSION MATRIX - Threshold = 0.9',
                          cmap=plt.cm.RdYlGn)

Normalized confusion matrix
[[0.07907181 0.03019612]
 [0.38401407 0.50671799]]

print("Accuracy (Out-of-Sample, threshold=0.9): ", cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])) 
      + cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])))

Accuracy (Out-of-Sample, threshold=0.9):  0.5857898066633067

6.2. Receiver Operating Characteristic (ROC) curve , Area under Curve (AUC) and Gini Coefficient¶

Model performance is evaluated taking into consideration the shape of the ROC curve, the Area under the ROC Cuve and the Gini Coefficient for the Testing (Out of Sample) Data. </font>

from sklearn.metrics import roc_curve, roc_auc_score

roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.

(array([0.        , 0.        , 0.        , ..., 0.99960746, 1.        ,
        1.        ]),
 array([0.00000000e+00, 1.20384750e-05, 1.20384750e-04, ...,
        9.99975923e-01, 9.99975923e-01, 1.00000000e+00]),
 array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
        0.37527935]))

fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Here we store each of the three arrays in a separate variable.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False Pos rate (% of Bad Loans Incorr. classified)')
# We name the x-axis "False positive rate".
plt.ylabel('True Pos rate (% of Good Loans Corr. Classified)')
# We name the x-axis "True positive rate".
plt.title('ROC curve',fontsize=20)
# We name the graph "ROC curve".

Text(0.5, 1.0, 'ROC curve')

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
optimal_tpr = tpr[optimal_idx]
optimal_fpr = fpr[optimal_idx]
optimal_threshold, optimal_tpr, optimal_fpr
print("Optimal Threshold of          : ", optimal_threshold)
print("At Index                      : ", optimal_idx)
print("With Optimal True Pos Rate of : ", optimal_tpr)
print("And Optimal False Pos Rate of : ", optimal_fpr)

Optimal Threshold of          :  0.8859712927282212
At Index                      :  6625
With Optimal True Pos Rate of :  0.6438778335560451
And Optimal False Pos Rate of :  0.34720314033366045

j_scores = tpr-fpr
j_ordered = sorted(zip(fpr, tpr, j_scores, thresholds))
j_ordered_df = pd.DataFrame(data=j_ordered, columns=['FPR', 'TPR', 'TPR-FPR','Thresholds'])
j_ordered_df.head()

j_ordered_df.tail()

AUROC = roc_auc_score(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC

0.7022080707330224

df_actual_predicted_probs = df_actual_predicted_probs.sort_values('y_hat_test_proba')
# Sorts a dataframe by the values of a specific column.

df_actual_predicted_probs.head()

df_actual_predicted_probs.tail()

df_actual_predicted_probs = df_actual_predicted_probs.reset_index()
# We reset the index of a dataframe and overwrite it.

df_actual_predicted_probs.head()

df_actual_predicted_probs['Cumulative N Population'] = df_actual_predicted_probs.index + 1
# We calculate the cumulative number of all observations.
# We use the new index for that. Since indexing in ython starts from 0, we add 1 to each index.
df_actual_predicted_probs['Cumulative N Good'] = df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'good', which is the cumulative sum of the column with actual observations.
df_actual_predicted_probs['Cumulative N Bad'] = df_actual_predicted_probs['Cumulative N Population'] - df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'bad', which is
# the difference between the cumulative number of all observations and cumulative number of 'good' for each row.

df_actual_predicted_probs.head()

df_actual_predicted_probs['Cumulative Perc Population'] = df_actual_predicted_probs['Cumulative N Population'] / (df_actual_predicted_probs.shape[0])
# We calculate the cumulative percentage of all observations.
df_actual_predicted_probs['Cumulative Perc Good'] = df_actual_predicted_probs['Cumulative N Good'] / df_actual_predicted_probs['loan_data_targets_test'].sum()
# We calculate cumulative percentage of 'good'.
df_actual_predicted_probs['Cumulative Perc Bad'] = df_actual_predicted_probs['Cumulative N Bad'] / (df_actual_predicted_probs.shape[0] - df_actual_predicted_probs['loan_data_targets_test'].sum())
# We calculate the cumulative percentage of 'bad'.

df_actual_predicted_probs.head()

df_actual_predicted_probs.tail()

# Plot Prob of Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative Perc Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,

plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)

Text(0.5, 1.0, 'Probability Default - Portfolio Constituents')

# Plot Prob of No Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative N Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,

plt.xlabel('Cumulative N Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)
# We name the graph "Gini".

Text(0.5, 1.0, 'Probability Default - Portfolio Constituents')

# Plot Gini
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Bad'])
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,
# thus plotting the Gini curve.
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Population'], 
         linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Cumulative % Observed Bad')
# We name the y-axis "Cumulative % Bad".
plt.title('Gini',fontsize=20)
# We name the graph "Gini".

Text(0.5, 1.0, 'Gini')

Gini = AUROC * 2 - 1
# Here we calculate Gini from AUROC.
Gini

0.40441614146604477

6.3. Kolmogorov-Smirnov Coefficient¶

Model performance is evaluated taking into consideration the KS Coefficient for the Testing (Out of Sample) Data which measures the maximum difference between the cumulative distribution functions of observed good and bad borrowers with respect to the estimated probabilities of "Good" according to the model. The greater the difference, the better the model.

# Plot KS
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Bad'], color = 'r')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'bad' along the y-axis,
# colored in red.
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Good'], color = 'b')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'good' along the y-axis,
# colored in red.
plt.xlabel('Estimated Probability for being Good')
# We name the x-axis "Estimated Probability for being Good".
plt.ylabel('Cumulative %')
# We name the y-axis "Cumulative %".
plt.legend(['Cumulative Perc Bad', 'Cumulative Perc Good'])
plt.title('Kolmogorov-Smirnov',fontsize=20)
# We name the graph "Kolmogorov-Smirnov".

Text(0.5, 1.0, 'Kolmogorov-Smirnov')

KS = max(df_actual_predicted_probs['Cumulative Perc Bad'] - df_actual_predicted_probs['Cumulative Perc Good'])
# We calculate KS from the data. It is the maximum of the difference between the cumulative percentage of 'bad'
# and the cumulative percentage of 'good'.
print("KS Coefficient: ", KS)

KS Coefficient:  0.2966746932223847

7. PD Model Application¶

Calculating PD of individual accounts¶

#pd.options.display.max_columns = None
# Sets the pandas dataframe options to display all columns/ rows.

inputs_test_with_ref_cat.head()

summary_table.head()

y_hat_test_proba

array([0.92430569, 0.84923866, 0.88534974, ..., 0.97321347, 0.95979153,
       0.95236655])

Creating a Scorecard¶

summary_table.head()

ref_categories

['grade:G',
 'home_ownership:RENT_OTHER_NONE_ANY',
 'addr_state:ND_NE_IA_NV_FL_HI_AL',
 'verification_status:Verified',
 'purpose:educ__sm_b__wedd__ren_en__mov__house',
 'initial_list_status:f',
 'term:60',
 'emp_length:0',
 'mths_since_issue_d:>84',
 'int_rate:>20.281',
 'mths_since_earliest_cr_line:<140',
 'inq_last_6mths:>6',
 'acc_now_delinq:0',
 'annual_inc:<20K',
 'dti:>35',
 'mths_since_last_delinq:0-3',
 'mths_since_last_record:0-2']

df_ref_categories = pd.DataFrame(ref_categories, columns = ['Feature name'])
# We create a new dataframe with one column. Its values are the values from the 'reference_categories' list.
# We name it 'Feature name'.
df_ref_categories['Coefficients'] = 0
# We create a second column, called 'Coefficients', which contains only 0 values.
df_ref_categories['p_values'] = np.nan
# We create a third column, called 'p_values', with contains only NaN values.
df_ref_categories.head()

df_scorecard = pd.concat([summary_table, df_ref_categories])
# Concatenates two dataframes.
df_scorecard = df_scorecard.reset_index()
# We reset the index of a dataframe.
df_scorecard

df_scorecard['Original feature name'] = df_scorecard['Feature name'].str.split(':').str[0]
# We create a new column, called 'Original feature name', which contains the value of the 'Feature name' column,
# up to the column symbol.
df_scorecard

min_score = 300
max_score = 850

df_scorecard.groupby('Original feature name')['Coefficients'].min()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.

Original feature name
Intercept                     -1.374054
acc_now_delinq                 0.000000
addr_state                     0.000000
annual_inc                    -0.081517
dti                            0.000000
emp_length                     0.000000
grade                          0.000000
home_ownership                 0.000000
initial_list_status            0.000000
inq_last_6mths                 0.000000
int_rate                       0.000000
mths_since_earliest_cr_line    0.000000
mths_since_issue_d            -0.071796
mths_since_last_delinq         0.000000
mths_since_last_record         0.000000
purpose                        0.000000
term                           0.000000
verification_status           -0.011183
Name: Coefficients, dtype: float64

min_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].min().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the minimum values.
min_sum_coef

-1.5385497433222481

df_scorecard.groupby('Original feature name')['Coefficients'].max()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.

Original feature name
Intercept                     -1.374054
acc_now_delinq                 0.180360
addr_state                     0.521964
annual_inc                     0.552376
dti                            0.384454
emp_length                     0.125851
grade                          1.123655
home_ownership                 0.106250
initial_list_status            0.053828
inq_last_6mths                 0.666280
int_rate                       0.883160
mths_since_earliest_cr_line    0.129361
mths_since_issue_d             1.084200
mths_since_last_delinq         0.183097
mths_since_last_record         0.502965
purpose                        0.301853
term                           0.078943
verification_status            0.085721
Name: Coefficients, dtype: float64

max_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].max().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the maximum values.
max_sum_coef

5.590263276946006

df_scorecard['Score - Calculation'] = df_scorecard['Coefficients'] * (max_score - min_score) / (max_sum_coef - min_sum_coef)
# We multiply the value of the 'Coefficients' column by the ration of the differences between
# maximum score and minimum score and maximum sum of coefficients and minimum sum of cefficients.
df_scorecard

df_scorecard['Score - Calculation'][0] = ((df_scorecard['Coefficients'][0] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score
# We divide the difference of the value of the 'Coefficients' column and the minimum sum of coefficients by
# the difference of the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we multiply that by the difference between the maximum score and the minimum score.
# Then, we add minimum score. 
df_scorecard.head()

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

df_scorecard['Score - Preliminary'] = df_scorecard['Score - Calculation'].round()
# We round the values of the 'Score - Calculation' column.
df_scorecard.head()

min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel

300.0

max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel

851.0

# One has to be subtracted from the maximum score for one original variable. Which one? We'll evaluate based on differences.

df_scorecard['Difference'] = df_scorecard['Score - Preliminary'] - df_scorecard['Score - Calculation']
df_scorecard.head()

df_scorecard['Score - Final'] = df_scorecard['Score - Preliminary']
df_scorecard['Score - Final'][77] = 16
df_scorecard.head()

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel

300.0

max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel

853.0

Caclulating Credit Score¶

inputs_test_with_ref_cat.head()

df_scorecard.head()

inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat

inputs_test_with_ref_cat_w_intercept.insert(0, 'Intercept', 1)
# We insert a column in the dataframe, with an index of 0, that is, in the beginning of the dataframe.
# The name of that column is 'Intercept', and its values are 1s.

inputs_test_with_ref_cat_w_intercept.head()

inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat_w_intercept[df_scorecard['Feature name'].values]
# Here, from the 'inputs_test_with_ref_cat_w_intercept' dataframe, we keep only the columns with column names,
# exactly equal to the row values of the 'Feature name' column from the 'df_scorecard' dataframe.

inputs_test_with_ref_cat_w_intercept.head()

scorecard_scores = df_scorecard['Score - Final']

inputs_test_with_ref_cat_w_intercept.shape

(93257, 102)

scorecard_scores.shape

(102,)

scorecard_scores = scorecard_scores.values.reshape(102, 1)

scorecard_scores.shape

(102, 1)

y_scores = inputs_test_with_ref_cat_w_intercept.dot(scorecard_scores)
# Here we multiply the values of each row of the dataframe by the values of each column of the variable,
# which is an argument of the 'dot' method, and sum them. It's essentially the sum of the products.

y_scores.head()

y_scores.tail()

From Credit Score to PD¶

sum_coef_from_score = ((y_scores - min_score) / (max_score - min_score)) * (max_sum_coef - min_sum_coef) + min_sum_coef
# We divide the difference between the scores and the minimum score by
# the difference between the maximum score and the minimum score.
# Then, we multiply that by the difference between the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we add the minimum sum of coefficients.

y_hat_proba_from_score = np.exp(sum_coef_from_score) / (np.exp(sum_coef_from_score) + 1)
# Here we divide an exponent raised to sum of coefficients from score by
# an exponent raised to sum of coefficients from score plus one.
y_hat_proba_from_score.head()

y_hat_test_proba[0: 5]

array([0.92430569, 0.84923866, 0.88534974, 0.94063609, 0.96866495])

df_actual_predicted_probs['y_hat_test_proba'].head()

0    0.375279
1    0.392099
2    0.393734
3    0.448967
4    0.457733
Name: y_hat_test_proba, dtype: float64

Setting Cut-offs¶

# We need the confusion matrix again.
#np.where(np.squeeze(np.array(loan_data_targets_test)) == np.where(y_hat_test_proba >= tr, 1, 0), 1, 0).sum() / loan_data_targets_test.shape[0]
tr = 0.9
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)
#df_actual_predicted_probs['loan_data_targets_test'] == np.where(df_actual_predicted_probs['y_hat_test_proba'] >= tr, 1, 0)

pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted'])

pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]

(pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]

0.5857898066633067

from sklearn.metrics import roc_curve, roc_auc_score

roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])

(array([0.        , 0.        , 0.        , ..., 0.99960746, 1.        ,
        1.        ]),
 array([0.00000000e+00, 1.20384750e-05, 1.20384750e-04, ...,
        9.99975923e-01, 9.99975923e-01, 1.00000000e+00]),
 array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
        0.37527935]))

fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.plot(fpr, tpr)
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')

Text(0.5, 1.0, 'ROC curve')

thresholds

array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
       0.37527935])

thresholds.shape

(17263,)

df_cutoffs = pd.concat([pd.DataFrame(thresholds), pd.DataFrame(fpr), pd.DataFrame(tpr)], axis = 1)
# We concatenate 3 dataframes along the columns.

df_cutoffs.columns = ['thresholds', 'fpr', 'tpr']
# We name the columns of the dataframe 'thresholds', 'fpr', and 'tpr'.

df_cutoffs.head()

df_cutoffs['thresholds'][0] = 1 - 1 / np.power(10, 16)
# Let the first threshold (the value of the thresholds column with index 0) be equal to a number, very close to 1
# but smaller than 1, say 1 - 1 / 10 ^ 16.

df_cutoffs['Score'] = ((np.log(df_cutoffs['thresholds'] / (1 - df_cutoffs['thresholds'])) - min_sum_coef) * ((max_score - min_score) / (max_sum_coef - min_sum_coef)) + min_score).round()
# The score corresponsing to each threshold equals:
# The the difference between the natural logarithm of the ratio of the threshold and 1 minus the threshold and
# the minimum sum of coefficients
# multiplied by
# the sum of the minimum score and the ratio of the difference between the maximum score and minimum score and 
# the difference between the maximum sum of coefficients and the minimum sum of coefficients.

df_cutoffs.head()

df_cutoffs['Score'][0] = max_score

df_cutoffs.head()

df_cutoffs.tail()

# We define a function called 'n_approved' which assigns a value of 1 if a predicted probability
# is greater than the parameter p, which is a threshold, and a value of 0, if it is not.
# Then it sums the column.
# Thus, if given any percentage values, the function will return
# the number of rows wih estimated probabilites greater than the threshold. 
def n_approved(p):
    return np.where(df_actual_predicted_probs['y_hat_test_proba'] >= p, 1, 0).sum()

df_cutoffs['N Approved'] = df_cutoffs['thresholds'].apply(n_approved)
# Assuming that all credit applications above a given probability of being 'good' will be approved,
# when we apply the 'n_approved' function to a threshold, it will return the number of approved applications.
# Thus, here we calculate the number of approved appliations for al thresholds.
df_cutoffs['N Rejected'] = df_actual_predicted_probs['y_hat_test_proba'].shape[0] - df_cutoffs['N Approved']
# Then, we calculate the number of rejected applications for each threshold.
# It is the difference between the total number of applications and the approved applications for that threshold.
df_cutoffs['Approval Rate'] = df_cutoffs['N Approved'] / df_actual_predicted_probs['y_hat_test_proba'].shape[0]
# Approval rate equalts the ratio of the approved applications and all applications.
df_cutoffs['Rejection Rate'] = 1 - df_cutoffs['Approval Rate']
# Rejection rate equals one minus approval rate.

df_cutoffs.head()

df_cutoffs.tail()

df_cutoffs.iloc[5000: 5200, ]
# Here we display the dataframe with cutoffs form line with index 5000 to line with index 5200.

df_cutoffs.iloc[1000: 1200, ]
# Here we display the dataframe with cutoffs form line with index 1000 to line with index 1200.

inputs_train_with_ref_cat.to_csv('inputs_train_with_ref_cat.csv')

df_scorecard.to_csv('df_scorecard.csv')

8. LGD and EAD Models¶

8.1. Defining Dependent Variable for LGD and EAD Models¶

import numpy as np
import pandas as pd

# Import data.
loan_data_preprocessed_backup = pd.read_csv('loan_data_2007_2014_preprocessed.csv')

C:\Users\delga\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (21,49) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

loan_data_preprocessed = loan_data_preprocessed_backup.copy()

loan_data_preprocessed.columns.values
# Displays all column names.

array(['Unnamed: 0', 'Unnamed: 0.1', 'id', 'member_id', 'loan_amnt',
       'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'emp_length_int', 'earliest_cr_line_date',
       'mths_since_earliest_cr_line', 'term_int', 'issue_d_date',
       'mths_since_issue_d', 'grade:A', 'grade:B', 'grade:C', 'grade:D',
       'grade:E', 'grade:F', 'grade:G', 'sub_grade:A1', 'sub_grade:A2',
       'sub_grade:A3', 'sub_grade:A4', 'sub_grade:A5', 'sub_grade:B1',
       'sub_grade:B2', 'sub_grade:B3', 'sub_grade:B4', 'sub_grade:B5',
       'sub_grade:C1', 'sub_grade:C2', 'sub_grade:C3', 'sub_grade:C4',
       'sub_grade:C5', 'sub_grade:D1', 'sub_grade:D2', 'sub_grade:D3',
       'sub_grade:D4', 'sub_grade:D5', 'sub_grade:E1', 'sub_grade:E2',
       'sub_grade:E3', 'sub_grade:E4', 'sub_grade:E5', 'sub_grade:F1',
       'sub_grade:F2', 'sub_grade:F3', 'sub_grade:F4', 'sub_grade:F5',
       'sub_grade:G1', 'sub_grade:G2', 'sub_grade:G3', 'sub_grade:G4',
       'sub_grade:G5', 'home_ownership:ANY', 'home_ownership:MORTGAGE',
       'home_ownership:NONE', 'home_ownership:OTHER',
       'home_ownership:OWN', 'home_ownership:RENT',
       'verification_status:Not Verified',
       'verification_status:Source Verified',
       'verification_status:Verified', 'loan_status:Charged Off',
       'loan_status:Current', 'loan_status:Default',
       'loan_status:Does not meet the credit policy. Status:Charged Off',
       'loan_status:Does not meet the credit policy. Status:Fully Paid',
       'loan_status:Fully Paid', 'loan_status:In Grace Period',
       'loan_status:Late (16-30 days)', 'loan_status:Late (31-120 days)',
       'purpose:car', 'purpose:credit_card', 'purpose:debt_consolidation',
       'purpose:educational', 'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'addr_state:AK', 'addr_state:AL', 'addr_state:AR', 'addr_state:AZ',
       'addr_state:CA', 'addr_state:CO', 'addr_state:CT', 'addr_state:DC',
       'addr_state:DE', 'addr_state:FL', 'addr_state:GA', 'addr_state:HI',
       'addr_state:IA', 'addr_state:ID', 'addr_state:IL', 'addr_state:IN',
       'addr_state:KS', 'addr_state:KY', 'addr_state:LA', 'addr_state:MA',
       'addr_state:MD', 'addr_state:ME', 'addr_state:MI', 'addr_state:MN',
       'addr_state:MO', 'addr_state:MS', 'addr_state:MT', 'addr_state:NC',
       'addr_state:NE', 'addr_state:NH', 'addr_state:NJ', 'addr_state:NM',
       'addr_state:NV', 'addr_state:NY', 'addr_state:OH', 'addr_state:OK',
       'addr_state:OR', 'addr_state:PA', 'addr_state:RI', 'addr_state:SC',
       'addr_state:SD', 'addr_state:TN', 'addr_state:TX', 'addr_state:UT',
       'addr_state:VA', 'addr_state:VT', 'addr_state:WA', 'addr_state:WI',
       'addr_state:WV', 'addr_state:WY', 'initial_list_status:f',
       'initial_list_status:w', 'good_bad'], dtype=object)

loan_data_preprocessed.head()

pd.options.display.max_columns = None
loan_data_preprocessed

# Create a series of Boolean values indicating whether loan is recognised as "Charged Off"
loan_data_preprocessed['loan_status'].isin(['Charged Off','Does not meet the credit policy. Status:Charged Off'])
# Creata a Dataframe with data only for those accounts recognized as "Charged Off"
loan_data_defaults = loan_data_preprocessed[loan_data_preprocessed['loan_status'].isin(['Charged Off',
                                                                                        'Does not meet the credit policy. Status:Charged Off'])]
loan_data_defaults

pd.options.display.max_rows = None
loan_data_defaults.isnull().sum()

Unnamed: 0                                                             0
Unnamed: 0.1                                                           0
id                                                                     0
member_id                                                              0
loan_amnt                                                              0
funded_amnt                                                            0
funded_amnt_inv                                                        0
term                                                                   0
int_rate                                                               0
installment                                                            0
grade                                                                  0
sub_grade                                                              0
emp_title                                                           3287
emp_length                                                          2337
home_ownership                                                         0
annual_inc                                                             0
verification_status                                                    0
issue_d                                                                0
loan_status                                                            0
pymnt_plan                                                             0
url                                                                    0
desc                                                               27396
purpose                                                                0
title                                                                  3
zip_code                                                               0
addr_state                                                             0
dti                                                                    0
delinq_2yrs                                                            0
earliest_cr_line                                                       3
inq_last_6mths                                                         0
mths_since_last_delinq                                             23950
mths_since_last_record                                             37821
open_acc                                                               0
pub_rec                                                                0
revol_bal                                                              0
revol_util                                                            53
total_acc                                                              0
initial_list_status                                                    0
out_prncp                                                              0
out_prncp_inv                                                          0
total_pymnt                                                            0
total_pymnt_inv                                                        0
total_rec_prncp                                                        0
total_rec_int                                                          0
total_rec_late_fee                                                     0
recoveries                                                             0
collection_recovery_fee                                                0
last_pymnt_d                                                         376
last_pymnt_amnt                                                        0
next_pymnt_d                                                       42475
last_credit_pull_d                                                     6
collections_12_mths_ex_med                                            28
mths_since_last_major_derog                                        35283
policy_code                                                            0
application_type                                                       0
annual_inc_joint                                                   43236
dti_joint                                                          43236
verification_status_joint                                          43236
acc_now_delinq                                                         0
tot_coll_amt                                                       10780
tot_cur_bal                                                        10780
open_acc_6m                                                        43236
open_il_6m                                                         43236
open_il_12m                                                        43236
open_il_24m                                                        43236
mths_since_rcnt_il                                                 43236
total_bal_il                                                       43236
il_util                                                            43236
open_rv_12m                                                        43236
open_rv_24m                                                        43236
max_bal_bc                                                         43236
all_util                                                           43236
total_rev_hi_lim                                                       0
inq_fi                                                             43236
total_cu_tl                                                        43236
inq_last_12m                                                       43236
emp_length_int                                                         0
earliest_cr_line_date                                                  3
mths_since_earliest_cr_line                                            0
term_int                                                               0
issue_d_date                                                           0
mths_since_issue_d                                                     0
grade:A                                                                0
grade:B                                                                0
grade:C                                                                0
grade:D                                                                0
grade:E                                                                0
grade:F                                                                0
grade:G                                                                0
sub_grade:A1                                                           0
sub_grade:A2                                                           0
sub_grade:A3                                                           0
sub_grade:A4                                                           0
sub_grade:A5                                                           0
sub_grade:B1                                                           0
sub_grade:B2                                                           0
sub_grade:B3                                                           0
sub_grade:B4                                                           0
sub_grade:B5                                                           0
sub_grade:C1                                                           0
sub_grade:C2                                                           0
sub_grade:C3                                                           0
sub_grade:C4                                                           0
sub_grade:C5                                                           0
sub_grade:D1                                                           0
sub_grade:D2                                                           0
sub_grade:D3                                                           0
sub_grade:D4                                                           0
sub_grade:D5                                                           0
sub_grade:E1                                                           0
sub_grade:E2                                                           0
sub_grade:E3                                                           0
sub_grade:E4                                                           0
sub_grade:E5                                                           0
sub_grade:F1                                                           0
sub_grade:F2                                                           0
sub_grade:F3                                                           0
sub_grade:F4                                                           0
sub_grade:F5                                                           0
sub_grade:G1                                                           0
sub_grade:G2                                                           0
sub_grade:G3                                                           0
sub_grade:G4                                                           0
sub_grade:G5                                                           0
home_ownership:ANY                                                     0
home_ownership:MORTGAGE                                                0
home_ownership:NONE                                                    0
home_ownership:OTHER                                                   0
home_ownership:OWN                                                     0
home_ownership:RENT                                                    0
verification_status:Not Verified                                       0
verification_status:Source Verified                                    0
verification_status:Verified                                           0
loan_status:Charged Off                                                0
loan_status:Current                                                    0
loan_status:Default                                                    0
loan_status:Does not meet the credit policy. Status:Charged Off        0
loan_status:Does not meet the credit policy. Status:Fully Paid         0
loan_status:Fully Paid                                                 0
loan_status:In Grace Period                                            0
loan_status:Late (16-30 days)                                          0
loan_status:Late (31-120 days)                                         0
purpose:car                                                            0
purpose:credit_card                                                    0
purpose:debt_consolidation                                             0
purpose:educational                                                    0
purpose:home_improvement                                               0
purpose:house                                                          0
purpose:major_purchase                                                 0
purpose:medical                                                        0
purpose:moving                                                         0
purpose:other                                                          0
purpose:renewable_energy                                               0
purpose:small_business                                                 0
purpose:vacation                                                       0
purpose:wedding                                                        0
addr_state:AK                                                          0
addr_state:AL                                                          0
addr_state:AR                                                          0
addr_state:AZ                                                          0
addr_state:CA                                                          0
addr_state:CO                                                          0
addr_state:CT                                                          0
addr_state:DC                                                          0
addr_state:DE                                                          0
addr_state:FL                                                          0
addr_state:GA                                                          0
addr_state:HI                                                          0
addr_state:IA                                                          0
addr_state:ID                                                          0
addr_state:IL                                                          0
addr_state:IN                                                          0
addr_state:KS                                                          0
addr_state:KY                                                          0
addr_state:LA                                                          0
addr_state:MA                                                          0
addr_state:MD                                                          0
addr_state:ME                                                          0
addr_state:MI                                                          0
addr_state:MN                                                          0
addr_state:MO                                                          0
addr_state:MS                                                          0
addr_state:MT                                                          0
addr_state:NC                                                          0
addr_state:NE                                                          0
addr_state:NH                                                          0
addr_state:NJ                                                          0
addr_state:NM                                                          0
addr_state:NV                                                          0
addr_state:NY                                                          0
addr_state:OH                                                          0
addr_state:OK                                                          0
addr_state:OR                                                          0
addr_state:PA                                                          0
addr_state:RI                                                          0
addr_state:SC                                                          0
addr_state:SD                                                          0
addr_state:TN                                                          0
addr_state:TX                                                          0
addr_state:UT                                                          0
addr_state:VA                                                          0
addr_state:VT                                                          0
addr_state:WA                                                          0
addr_state:WI                                                          0
addr_state:WV                                                          0
addr_state:WY                                                          0
initial_list_status:f                                                  0
initial_list_status:w                                                  0
good_bad                                                               0
dtype: int64

# We fill the missing values with zeroes.
loan_data_defaults['mths_since_last_delinq'].fillna(0, inplace = True)
loan_data_defaults['mths_since_last_record'].fillna(0, inplace=True)

C:\Users\delga\anaconda3\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)

# We calculate the dependent variable for the LGD model, the recovery rate, and add to the default dataframe
loan_data_defaults['recovery_rate'] = loan_data_defaults['recoveries'] / loan_data_defaults['funded_amnt']
loan_data_defaults['recovery_rate'].describe()

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

count    43236.000000
mean         0.060820
std          0.089770
min          0.000000
25%          0.000000
50%          0.029466
75%          0.114044
max          1.220774
Name: recovery_rate, dtype: float64

formatted_mean = "{:.4f}".format(loan_data_defaults['recovery_rate'].mean())

print("Total Defaulted Loans                 : " ,loan_data_defaults['recovery_rate'].count())
print("Mean Recovery Rate on Defaulted Loans : " ,formatted_mean)

Total Defaulted Loans                 :  43236
Mean Recovery Rate on Defaulted Loans :  0.0608

loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] > 1, 
                                               1, loan_data_defaults['recovery_rate'])
loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] < 0, 
                                               0, loan_data_defaults['recovery_rate'])
# We set recovery rates that are greater than 1 to 1 and recovery rates that are less than 0 to 0.

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.

loan_data_defaults['CCF'] = (loan_data_defaults['funded_amnt'] - loan_data_defaults['total_rec_prncp']) / loan_data_defaults['funded_amnt']
# We calculate the dependent variable for the EAD model: credit conversion factor.
# It is the ratio of the difference of the amount used at the moment of default to the total funded amount.

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

loan_data_defaults['CCF'].describe()
# Shows some descriptive statisics for the values of a column.

count    43236.000000
mean         0.735952
std          0.200742
min          0.000438
25%          0.632088
50%          0.789908
75%          0.888543
max          1.000000
Name: CCF, dtype: float64

loan_data_defaults.to_csv('loan_data_defaults.csv')
# We save the data to a CSV file.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.title('Distribution Recovery Rate',fontsize=20)
plt.hist(loan_data_defaults['recovery_rate'], bins = 100);
# We plot a histogram of a variable with 50 bins.

plt.title('Distribution CCF',fontsize=20)
plt.hist(loan_data_defaults['CCF'], bins = 100);
# We plot a histogram of a variable with 100 bins.

loan_data_defaults['recovery_rate_0_1'] = np.where(loan_data_defaults['recovery_rate'] == 0, 0, 1)
loan_data_defaults['recovery_rate_0_1'].head()
# We create a new variable which is 0 if recovery rate is 0 and 1 otherwise.

C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

1     1
8     1
9     1
12    1
14    1
Name: recovery_rate_0_1, dtype: int32

loan_data_defaults['recovery_rate_0_1'].tail()

466254    0
466256    0
466276    1
466277    0
466281    0
Name: recovery_rate_0_1, dtype: int32

8.1. Defining Dependent Variable for LGD and EAD Models¶

from sklearn.model_selection import train_test_split

# LGD model stage 1 datasets: recovery rate 0 or greater than 0.
lgd_inputs_stage_1_train, lgd_inputs_stage_1_test, lgd_targets_stage_1_train, lgd_targets_stage_1_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['recovery_rate_0_1'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.

Preparing the Inputs¶

features_all = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:MORTGAGE',
'home_ownership:NONE',
'home_ownership:OTHER',
'home_ownership:OWN',
'home_ownership:RENT',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:car',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:educational',
'purpose:home_improvement',
'purpose:house',
'purpose:major_purchase',
'purpose:medical',
'purpose:moving',
'purpose:other',
'purpose:renewable_energy',
'purpose:small_business',
'purpose:vacation',
'purpose:wedding',
'initial_list_status:f',
'initial_list_status:w',
'term_int',
'emp_length_int',
'mths_since_issue_d',
'mths_since_earliest_cr_line',
'funded_amnt',
'int_rate',
'installment',
'annual_inc',
'dti',
'delinq_2yrs',
'inq_last_6mths',
'mths_since_last_delinq',
'mths_since_last_record',
'open_acc',
'pub_rec',
'total_acc',
'acc_now_delinq',
'total_rev_hi_lim']
# List of all independent variables for the models.

features_reference_cat = ['grade:G',
'home_ownership:RENT',
'verification_status:Verified',
'purpose:credit_card',
'initial_list_status:f']
# List of the dummy variable reference categories.

lgd_inputs_stage_1_train = lgd_inputs_stage_1_train[features_all]
# Here we keep only the variables we need for the model.

lgd_inputs_stage_1_train = lgd_inputs_stage_1_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

lgd_inputs_stage_1_train.isnull().sum()
# Check for missing values. We check whether the value of each row for each column is missing or not,
# then sum accross columns.

grade:A                                0
grade:B                                0
grade:C                                0
grade:D                                0
grade:E                                0
grade:F                                0
home_ownership:MORTGAGE                0
home_ownership:NONE                    0
home_ownership:OTHER                   0
home_ownership:OWN                     0
verification_status:Not Verified       0
verification_status:Source Verified    0
purpose:car                            0
purpose:debt_consolidation             0
purpose:educational                    0
purpose:home_improvement               0
purpose:house                          0
purpose:major_purchase                 0
purpose:medical                        0
purpose:moving                         0
purpose:other                          0
purpose:renewable_energy               0
purpose:small_business                 0
purpose:vacation                       0
purpose:wedding                        0
initial_list_status:w                  0
term_int                               0
emp_length_int                         0
mths_since_issue_d                     0
mths_since_earliest_cr_line            0
funded_amnt                            0
int_rate                               0
installment                            0
annual_inc                             0
dti                                    0
delinq_2yrs                            0
inq_last_6mths                         0
mths_since_last_delinq                 0
mths_since_last_record                 0
open_acc                               0
pub_rec                                0
total_acc                              0
acc_now_delinq                         0
total_rev_hi_lim                       0
dtype: int64

Estimating the Model¶

# P values for sklearn logistic regression.

# Class to display p-values for logistic regression in sklearn.

from sklearn import linear_model
import scipy.stats as stat

class LogisticRegression_with_p_values:
    
    def __init__(self,*args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)

    def fit(self,X,y):
        self.model.fit(X,y)
        
        #### Get p-values for the fitted model ####
        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
        Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
        p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
        
        self.coef_ = self.model.coef_
        self.intercept_ = self.model.intercept_
        #self.z_scores = z_scores
        self.p_values = p_values
        #self.sigma_estimates = sigma_estimates
        #self.F_ij = F_ij

reg_lgd_st_1 = LogisticRegression_with_p_values()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_1.fit(lgd_inputs_stage_1_train, lgd_targets_stage_1_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.

C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

feature_name = lgd_inputs_stage_1_train.columns.values
# Stores the names of the columns of a dataframe in a variable.

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_1.p_values
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_1.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table

Testing the Model¶

lgd_inputs_stage_1_test = lgd_inputs_stage_1_test[features_all]
# Here we keep only the variables we need for the model.

lgd_inputs_stage_1_test = lgd_inputs_stage_1_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

y_hat_test_lgd_stage_1 = reg_lgd_st_1.model.predict(lgd_inputs_stage_1_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.

y_hat_test_lgd_stage_1

array([1, 1, 0, ..., 1, 1, 1])

y_hat_test_proba_lgd_stage_1 = reg_lgd_st_1.model.predict_proba(lgd_inputs_stage_1_test)
# Calculates the predicted probability values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.

y_hat_test_proba_lgd_stage_1
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.

array([[0.41294934, 0.58705066],
       [0.38572094, 0.61427906],
       [0.5178263 , 0.4821737 ],
       ...,
       [0.45880386, 0.54119614],
       [0.40552032, 0.59447968],
       [0.47566543, 0.52433457]])

y_hat_test_proba_lgd_stage_1 = y_hat_test_proba_lgd_stage_1[: ][: , 1]
# Here we take all the arrays in the array, and from each array, we take all rows, and only the element with index 1,
# that is, the second element.
# In other words, we take only the probabilities for being 1.

y_hat_test_proba_lgd_stage_1

array([0.58705066, 0.61427906, 0.4821737 , ..., 0.54119614, 0.59447968,
       0.52433457])

lgd_targets_stage_1_test_temp = lgd_targets_stage_1_test

lgd_targets_stage_1_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.

df_actual_predicted_probs = pd.concat([lgd_targets_stage_1_test_temp, pd.DataFrame(y_hat_test_proba_lgd_stage_1)], axis = 1)
# Concatenates two dataframes.

df_actual_predicted_probs.columns = ['lgd_targets_stage_1_test', 'y_hat_test_proba_lgd_stage_1']

df_actual_predicted_probs.index = lgd_inputs_stage_1_test.index
# Makes the index of one dataframe equal to the index of another dataframe.

df_actual_predicted_probs.head()

Estimating the Аccuracy of the Мodel¶

import itertools
import numpy as np
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float')
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.3f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "white")

    plt.tight_layout()
    plt.ylabel('*TRUE LABEL*', fontsize=14)
    plt.xlabel('*PREDICTED LABEL*', fontsize=14)
    plt.show()

tr = 0.5
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test_lgd_stage_1'] = np.where(df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'] > tr, 1, 0)

pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted'])
# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.

cm_lgd_N = confusion_matrix(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                          df_actual_predicted_probs['y_hat_test_lgd_stage_1'])
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_lgd_N, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)

Confusion matrix, without normalization
[[1036 2726]
 [ 683 4203]]

pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
# Here we divide each value of the table by the total number of observations,
# thus getting percentages, or, rates.

cm_lgd_pc = pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                        df_actual_predicted_probs['y_hat_test_lgd_stage_1'], 
                        rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
cm_arr = np.array(cm_lgd_pc)
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_arr, classes,
                          normalize=True,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)

Normalized confusion matrix
[[0.11979648 0.31521739]
 [0.0789778  0.48600833]]

(pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]
# Here we calculate Accuracy of the model, which is the sum of the diagonal rates.

0.605804810360777

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                                 df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.
# we store each of the three arrays in a separate variable.

plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False positive rate')
# We name the x-axis "False positive rate".
plt.ylabel('True positive rate')
# We name the x-axis "True positive rate".
plt.title('ROC curve')
# We name the graph "ROC curve".

Text(0.5, 1.0, 'ROC curve')

AUROC = roc_auc_score(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC

0.650978133446841

Saving the Model¶

import pickle

pickle.dump(reg_lgd_st_1, open('lgd_model_stage_1.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.

8.3. Stage 2 LGD: Linear Regression¶

lgd_stage_2_data = loan_data_defaults[loan_data_defaults['recovery_rate_0_1'] == 1]
# Here we take only rows where the original recovery rate variable is greater than one,
# i.e. where the indicator variable we created is equal to 1.

# LGD model stage 2 datasets: how much more than 0 is the recovery rate
lgd_inputs_stage_2_train, lgd_inputs_stage_2_test, lgd_targets_stage_2_train, lgd_targets_stage_2_test = train_test_split(lgd_stage_2_data.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), lgd_stage_2_data['recovery_rate'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Since the p-values are obtained through certain statistics, we need the 'stat' module from scipy.stats
import scipy.stats as stat

# Since we are using an object oriented language such as Python, we can simply define our own 
# LinearRegression class (the same one from sklearn)
# By typing the code below we will ovewrite a part of the class with one that includes p-values
# Here's the full source code of the ORIGINAL class: https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/linear_model/base.py#L362


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """
    
    # nothing changes in __init__
    def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=1):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.n_jobs = n_jobs

    
    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)
        
        # Calculate SSE (sum of squared errors)
        # and SE (standard error)
        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])

        # compute the t-statistic for each feature
        self.t = self.coef_ / se
        # find the p-value for each feature
        self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
        return self

import scipy.stats as stat

class LinearRegression(linear_model.LinearRegression):
    def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=1):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.n_jobs = n_jobs
    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)
        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])
        self.t = self.coef_ / se
        self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
        return self

lgd_inputs_stage_2_train = lgd_inputs_stage_2_train[features_all]
# Here we keep only the variables we need for the model.

lgd_inputs_stage_2_train = lgd_inputs_stage_2_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

reg_lgd_st_2 = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_2.fit(lgd_inputs_stage_2_train, lgd_targets_stage_2_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

feature_name = lgd_inputs_stage_2_train.columns.values
# Stores the names of the columns of a dataframe in a variable.

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values.round(3)
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_2.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values.round(3)
summary_table

Stage 2 – Linear Regression Evaluation¶

lgd_inputs_stage_2_test = lgd_inputs_stage_2_test[features_all]
# Here we keep only the variables we need for the model.

lgd_inputs_stage_2_test = lgd_inputs_stage_2_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

lgd_inputs_stage_2_test.columns.values
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.

array(['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
       'home_ownership:MORTGAGE', 'home_ownership:NONE',
       'home_ownership:OTHER', 'home_ownership:OWN',
       'verification_status:Not Verified',
       'verification_status:Source Verified', 'purpose:car',
       'purpose:debt_consolidation', 'purpose:educational',
       'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'initial_list_status:w', 'term_int', 'emp_length_int',
       'mths_since_issue_d', 'mths_since_earliest_cr_line', 'funded_amnt',
       'int_rate', 'installment', 'annual_inc', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'total_acc',
       'acc_now_delinq', 'total_rev_hi_lim'], dtype=object)

y_hat_test_lgd_stage_2 = reg_lgd_st_2.predict(lgd_inputs_stage_2_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.

lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test

lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.

pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.

corr_mat = pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()

corr_arr = np.array(corr_mat)
classes = ['   ', '   ']
plot_confusion_matrix(corr_arr, classes,
                          normalize=True,
                          title='CORRELATION MATRIX - Act Vs Pred Recov Rates',
                          cmap=plt.cm.RdYlGn)

Normalized confusion matrix
[[1.        0.3079956]
 [0.3079956 1.       ]]

sns.distplot(lgd_targets_stage_2_test - y_hat_test_lgd_stage_2)
# We plot the distribution of the residuals.

<matplotlib.axes._subplots.AxesSubplot at 0x16f90b74cf8>

pickle.dump(reg_lgd_st_2, open('lgd_model_stage_2.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.

8.4. LGD: Combining Stages 1 and 2¶

y_hat_test_lgd_stage_2_all = reg_lgd_st_2.predict(lgd_inputs_stage_1_test)

y_hat_test_lgd_stage_2_all

array([0.1193906 , 0.09605635, 0.13367631, ..., 0.12078611, 0.11587422,
       0.15667447])

y_hat_test_lgd = y_hat_test_lgd_stage_1 * y_hat_test_lgd_stage_2_all
# Here we combine the predictions of the models from the two stages.

pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.

pd.DataFrame(y_hat_test_lgd).sum()/pd.DataFrame(y_hat_test_lgd).count()

0    0.086175
dtype: float64

y_hat_test_lgd = np.where(y_hat_test_lgd < 0, 0, y_hat_test_lgd)
y_hat_test_lgd = np.where(y_hat_test_lgd > 1, 1, y_hat_test_lgd)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.

pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.

8.5 EAD Computation¶

Estimation and Interpretation¶

# EAD model datasets
ead_inputs_train, ead_inputs_test, ead_targets_train, ead_targets_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['CCF'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.

ead_inputs_train.columns.values

array(['Unnamed: 0', 'Unnamed: 0.1', 'id', 'member_id', 'loan_amnt',
       'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'emp_length_int', 'earliest_cr_line_date',
       'mths_since_earliest_cr_line', 'term_int', 'issue_d_date',
       'mths_since_issue_d', 'grade:A', 'grade:B', 'grade:C', 'grade:D',
       'grade:E', 'grade:F', 'grade:G', 'sub_grade:A1', 'sub_grade:A2',
       'sub_grade:A3', 'sub_grade:A4', 'sub_grade:A5', 'sub_grade:B1',
       'sub_grade:B2', 'sub_grade:B3', 'sub_grade:B4', 'sub_grade:B5',
       'sub_grade:C1', 'sub_grade:C2', 'sub_grade:C3', 'sub_grade:C4',
       'sub_grade:C5', 'sub_grade:D1', 'sub_grade:D2', 'sub_grade:D3',
       'sub_grade:D4', 'sub_grade:D5', 'sub_grade:E1', 'sub_grade:E2',
       'sub_grade:E3', 'sub_grade:E4', 'sub_grade:E5', 'sub_grade:F1',
       'sub_grade:F2', 'sub_grade:F3', 'sub_grade:F4', 'sub_grade:F5',
       'sub_grade:G1', 'sub_grade:G2', 'sub_grade:G3', 'sub_grade:G4',
       'sub_grade:G5', 'home_ownership:ANY', 'home_ownership:MORTGAGE',
       'home_ownership:NONE', 'home_ownership:OTHER',
       'home_ownership:OWN', 'home_ownership:RENT',
       'verification_status:Not Verified',
       'verification_status:Source Verified',
       'verification_status:Verified', 'loan_status:Charged Off',
       'loan_status:Current', 'loan_status:Default',
       'loan_status:Does not meet the credit policy. Status:Charged Off',
       'loan_status:Does not meet the credit policy. Status:Fully Paid',
       'loan_status:Fully Paid', 'loan_status:In Grace Period',
       'loan_status:Late (16-30 days)', 'loan_status:Late (31-120 days)',
       'purpose:car', 'purpose:credit_card', 'purpose:debt_consolidation',
       'purpose:educational', 'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'addr_state:AK', 'addr_state:AL', 'addr_state:AR', 'addr_state:AZ',
       'addr_state:CA', 'addr_state:CO', 'addr_state:CT', 'addr_state:DC',
       'addr_state:DE', 'addr_state:FL', 'addr_state:GA', 'addr_state:HI',
       'addr_state:IA', 'addr_state:ID', 'addr_state:IL', 'addr_state:IN',
       'addr_state:KS', 'addr_state:KY', 'addr_state:LA', 'addr_state:MA',
       'addr_state:MD', 'addr_state:ME', 'addr_state:MI', 'addr_state:MN',
       'addr_state:MO', 'addr_state:MS', 'addr_state:MT', 'addr_state:NC',
       'addr_state:NE', 'addr_state:NH', 'addr_state:NJ', 'addr_state:NM',
       'addr_state:NV', 'addr_state:NY', 'addr_state:OH', 'addr_state:OK',
       'addr_state:OR', 'addr_state:PA', 'addr_state:RI', 'addr_state:SC',
       'addr_state:SD', 'addr_state:TN', 'addr_state:TX', 'addr_state:UT',
       'addr_state:VA', 'addr_state:VT', 'addr_state:WA', 'addr_state:WI',
       'addr_state:WV', 'addr_state:WY', 'initial_list_status:f',
       'initial_list_status:w'], dtype=object)

ead_inputs_train = ead_inputs_train[features_all]
# Here we keep only the variables we need for the model.

ead_inputs_train = ead_inputs_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

reg_ead = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_ead.fit(ead_inputs_train, ead_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

feature_name = ead_inputs_train.columns.values

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_ead.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table

Model Validation¶

ead_inputs_test = ead_inputs_test[features_all]
# Here we keep only the variables we need for the model.

ead_inputs_test = ead_inputs_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

ead_inputs_test.columns.values

array(['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
       'home_ownership:MORTGAGE', 'home_ownership:NONE',
       'home_ownership:OTHER', 'home_ownership:OWN',
       'verification_status:Not Verified',
       'verification_status:Source Verified', 'purpose:car',
       'purpose:debt_consolidation', 'purpose:educational',
       'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'initial_list_status:w', 'term_int', 'emp_length_int',
       'mths_since_issue_d', 'mths_since_earliest_cr_line', 'funded_amnt',
       'int_rate', 'installment', 'annual_inc', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'total_acc',
       'acc_now_delinq', 'total_rev_hi_lim'], dtype=object)

y_hat_test_ead = reg_ead.predict(ead_inputs_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.

ead_targets_test_temp = ead_targets_test

ead_targets_test_temp = ead_targets_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.

pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.

corr_mat_2 = pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()

corr_arr_2 = np.array(corr_mat_2)
classes = ['   ', '   ']
plot_confusion_matrix(corr_arr_2, classes,
                          normalize=True,
                          title='CORRELATION MATRIX - Act Vs Pred EAD',
                          cmap=plt.cm.RdYlGn)

Normalized confusion matrix
[[1.         0.53065383]
 [0.53065383 1.        ]]

sns.distplot(ead_targets_test - y_hat_test_ead)
# We plot the distribution of the residuals.

<matplotlib.axes._subplots.AxesSubplot at 0x16f90b89748>

(ead_targets_test - y_hat_test_ead).mean()

-0.002040148452007224

pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.

y_hat_test_ead = np.where(y_hat_test_ead < 0, 0, y_hat_test_ead)
y_hat_test_ead = np.where(y_hat_test_ead > 1, 1, y_hat_test_ead)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.

pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.

pd.DataFrame(y_hat_test_ead).sum()/pd.DataFrame(y_hat_test_ead).count()

0    0.735992
dtype: float64

9. Expected Credit Loss¶

9.1. ECL Calculation¶

loan_data_preprocessed.head()

loan_data_preprocessed['mths_since_last_delinq'].fillna(0, inplace = True)
# We fill the missing values with zeroes.

loan_data_preprocessed['mths_since_last_record'].fillna(0, inplace = True)
# We fill the missing values with zeroes.

loan_data_preprocessed_lgd_ead = loan_data_preprocessed[features_all]
# Here we keep only the variables we need for the model.

loan_data_preprocessed_lgd_ead = loan_data_preprocessed_lgd_ead.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.

loan_data_preprocessed['recovery_rate_st_1'] = reg_lgd_st_1.model.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 1 LGD model and calculate predicted values.

loan_data_preprocessed['recovery_rate_st_2'] = reg_lgd_st_2.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 2 LGD model and calculate predicted values.

loan_data_preprocessed['recovery_rate'] = loan_data_preprocessed['recovery_rate_st_1'] * loan_data_preprocessed['recovery_rate_st_2']
# We combine the predicted values from the stage 1 predicted model and the stage 2 predicted model
# to calculate the final estimated recovery rate.

loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] < 0, 0, loan_data_preprocessed['recovery_rate'])
loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] > 1, 1, loan_data_preprocessed['recovery_rate'])
# We set estimated recovery rates that are greater than 1 to 1 and  estimated recovery rates that are less than 0 to 0.

loan_data_preprocessed['LGD'] = 1 - loan_data_preprocessed['recovery_rate']
# We calculate estimated LGD. Estimated LGD equals 1 - estimated recovery rate.

loan_data_preprocessed['LGD'].describe()
# Shows some descriptive statisics for the values of a column.

count    466285.000000
mean          0.921094
std           0.057400
min           0.659786
25%           0.874298
50%           0.899998
75%           1.000000
max           1.000000
Name: LGD, dtype: float64

loan_data_preprocessed['CCF'] = reg_ead.predict(loan_data_preprocessed_lgd_ead)
# We apply the EAD model to calculate estimated credit conversion factor.

loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] < 0, 0, loan_data_preprocessed['CCF'])
loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] > 1, 1, loan_data_preprocessed['CCF'])
# We set estimated CCF that are greater than 1 to 1 and  estimated CCF that are less than 0 to 0.

loan_data_preprocessed['EAD'] = loan_data_preprocessed['CCF'] * loan_data_preprocessed_lgd_ead['funded_amnt']
# We calculate estimated EAD. Estimated EAD equals estimated CCF multiplied by funded amount.

loan_data_preprocessed['EAD'].describe()
# Shows some descriptive statisics for the values of a column.

count    466285.000000
mean      10814.846760
std        6935.184562
min         190.347372
25%        5495.101413
50%        9208.479591
75%       14692.844549
max       35000.000000
Name: EAD, dtype: float64

loan_data_preprocessed.head()

loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
# We import data to apply the PD model.

loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
# We import data to apply the PD model.

loan_data_inputs_pd = pd.concat([loan_data_inputs_train, loan_data_inputs_test], axis = 0)
# We concatenate the two dataframes along the rows.

loan_data_inputs_pd.shape

(466285, 324)

loan_data_inputs_pd.head()

features_all_pd = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86']

ref_categories_pd = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']

loan_data_inputs_pd_temp = loan_data_inputs_pd[features_all_pd]
# Here we keep only the variables we need for the model.

loan_data_inputs_pd_temp = loan_data_inputs_pd_temp.drop(ref_categories_pd, axis = 1)
# Here we remove the dummy variable reference categories.

loan_data_inputs_pd_temp.shape

(466285, 84)

import pickle

reg_pd = pickle.load(open('pd_model.sav', 'rb'))
# We import the PD model, stored in the 'pd_model.sav' file.

reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.

array([0.02958535, 0.09214789, 0.03735891, ..., 0.02678653, 0.04020847,
       0.04763345])

loan_data_inputs_pd['PD'] = reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.

loan_data_inputs_pd['PD'].head()

427211    0.029585
206088    0.092148
136020    0.037359
412305    0.204329
36159     0.200845
Name: PD, dtype: float64

loan_data_inputs_pd['PD'].describe()
# Shows some descriptive statisics for the values of a column.

count    466285.000000
mean          0.109307
std           0.070917
min           0.007314
25%           0.056064
50%           0.093493
75%           0.146558
max           0.635822
Name: PD, dtype: float64

loan_data_preprocessed_new = pd.concat([loan_data_preprocessed, loan_data_inputs_pd], axis = 1)
# We concatenate the dataframes where we calculated LGD and EAD and the dataframe where we calculated PD along the columns.

loan_data_preprocessed_new.shape

(466285, 540)

loan_data_preprocessed_new.head()

loan_data_preprocessed_new['EL'] = loan_data_preprocessed_new['PD'] * loan_data_preprocessed_new['LGD'] * loan_data_preprocessed_new['EAD']
# We calculate Expected Loss. EL = PD * LGD * EAD.

loan_data_preprocessed_new['EL'].describe()
# Shows some descriptive statisics for the values of a column.

count    466285.000000
mean       1076.294727
std        1090.970241
min           9.542825
25%         355.816284
50%         706.183752
75%        1396.046226
max       11909.918457
Name: EL, dtype: float64

Output = loan_data_preprocessed_new[['funded_amnt', 'PD', 'LGD', 'EAD', 'EL']]
Output = Output.loc[:,~Output.columns.duplicated()]
Output.head()

EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()
Weight = EAD_LGD/EAD_LGD.sum()
Output['Weight'] = Weight
Wtd_PD = Output['Weight'] * Output['PD']
Output['Wtd_PD'] = Wtd_PD
Output.head()

Output['Wtd_PD'].sum()

0.1087824629037051

Output['Wtd_PD'].sum()

0.1087824629037051

EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()

4613428243.825036

EAD_LGD.sum() * Output['Wtd_PD'].sum()

501860086.79280233

Output['EAD'].sum()

5042800821.555224

EAD_LGD.sum()

4613428243.825036

EAD_LGD.sum()/Output['EAD'].sum()  * Output['EAD'].sum() * Output['Wtd_PD'].sum()

501860086.79280233

EAD_LGD.sum()/Output['EAD'].sum()

0.9148543452490023

RR = 1- (EAD_LGD.sum()/Output['EAD'].sum())
RR

0.08514565475099767

import math
Ave_PD = Output['Wtd_PD'].sum() 
Ass_corr_1 = 0.04* (1-math.exp(-50*Ave_PD)) / (1-math.exp(-50))
Ass_corr_2 = 0.0039*(1-(1-math.exp(-50*Ave_PD)/(1-math.exp(-50))))
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr

0.03984320722969012

from scipy.stats import norm
Ave_PD
Eco_Scen = 0.70
a = norm.ppf(Ave_PD) 
b = norm.ppf(Eco_Scen)
Cor_Coef = Ass_corr

c = (a + math.sqrt(Cor_Coef)*b)/ math.sqrt(1 - Cor_Coef)
PD_Corr = norm.cdf(c, loc = 0, scale = 1)
PD_Corr

0.12475750738254893

9.2. ECL Model Output¶

title = "CREDIT CARD RECEIVABLES PORTFOLIO - RISK SUMMARY"
bolded_title = "\033[34;1;4m" + title + "\033[0m"
formatted_PD  = "{:.4f}".format(Output['Wtd_PD'].sum())
formatted_EAD = "{:,.0f}".format(Output['EAD'].sum())
formatted_RR = "{:.4f}".format(1- (EAD_LGD.sum()/Output['EAD'].sum()))
formatted_ECL= "{:,.0f}".format(Output['EL'].sum())
formatted_ECL_pc= "{:.4f}".format(Output['EL'].sum()
                                  /Output['funded_amnt'].sum())
formatted_FA= "{:,.0f}".format(Output['funded_amnt'].sum())
formatted_AC= "{:,.4f}".format(Ass_corr)
formatted_Eco= "{:,.3f}".format(Eco_Scen)
formatted_PD_Corr  = "{:,.4f}".format(PD_Corr)
formatted_ECL_Corr  = "{:,.0f}".format(EAD_LGD.sum()/Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr)
formatted_ECL_Corr_pc= "{:.4f}".format(EAD_LGD.sum()/
                                       Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr/Output['funded_amnt'].sum())
print("            ")
print("            ", bolded_title)
print("            ")
print("------------------------------------------------------------------")
print("Current Funded Amount                         : " , formatted_FA)
print("------------------------------------------------------------------")
print("Weighted Average Probability of Default       : " , formatted_PD)
print("Expected Exposure at Default                  : " , formatted_EAD)
print("Expected Recovery Rate                        : " , formatted_RR)
print("\033[1mExpected Credit Loss Assuming Independence\033[0m    : " , "\033[1m" + formatted_ECL + "\033[0m")
print("\033[1mECL Assuming Independence ÷ Funded Amount\033[0m     : " , "\033[1m" + formatted_ECL_pc + "\033[0m")
print("------------------------------------------------------------------")
print("Asset Correlation                             : " , formatted_AC)
print("Economic Scenario                             : " , formatted_Eco)
print("Portfolio Prob. Default Assuming Correlation  : " , formatted_PD_Corr)
print("\033[1mExpected Credit Loss Assuming Correlation\033[0m     : " , "\033[1m" + formatted_ECL_Corr + "\033[0m")
print("\033[1mECL Assuming Correlation ÷ Funded Amount\033[0m      : " , "\033[1m" + formatted_ECL_Corr_pc + "\033[0m")
print("__________________________________________________________________")
print("            ")

            
             CREDIT CARD RECEIVABLES PORTFOLIO - RISK SUMMARY
            
------------------------------------------------------------------
Current Funded Amount                         :  6,664,052,450
------------------------------------------------------------------
Weighted Average Probability of Default       :  0.1088
Expected Exposure at Default                  :  5,042,800,822
Expected Recovery Rate                        :  0.0851
Expected Credit Loss Assuming Independence    :  501,860,087
ECL Assuming Independence ÷ Funded Amount     :  0.0753
------------------------------------------------------------------
Asset Correlation                             :  0.0398
Economic Scenario                             :  0.700
Portfolio Prob. Default Assuming Correlation  :  0.1248
Expected Credit Loss Assuming Correlation     :  575,559,808
ECL Assuming Correlation ÷ Funded Amount      :  0.0864
__________________________________________________________________

	grade	n_obs	prop_good	prop_n_obs	n_good	n_bad	prop_n_good	prop_n_bad	WoE	diff_prop_good	diff_WoE	IV
0	G	2654	0.727958	0.007115	1932.0	722.0	0.005815	0.017706	-1.113459	NaN	NaN	0.288636
1	F	10530	0.754416	0.028228	7944.0	2586.0	0.023910	0.063417	-0.975440	0.026458	0.138019	0.288636
2	E	28612	0.805257	0.076702	23040.0	5572.0	0.069345	0.136642	-0.678267	0.050841	0.297173	0.288636
3	D	61498	0.846304	0.164862	52046.0	9452.0	0.156647	0.231792	-0.391843	0.041047	0.286424	0.288636
4	C	100245	0.885770	0.268733	88794.0	11451.0	0.267251	0.280813	-0.049503	0.039466	0.342340	0.288636
5	B	109730	0.921015	0.294160	101063.0	8667.0	0.304178	0.212541	0.358476	0.035245	0.407979	0.288636
6	A	59759	0.961044	0.160200	57431.0	2328.0	0.172855	0.057090	1.107830	0.040028	0.749353	0.288636

	emp_length_int	n_obs	prop_good	prop_n_obs	n_good	n_bad	prop_n_good	prop_n_bad	WoE	diff_prop_good	diff_WoE	IV
0	0.0	45720	0.876400	0.122565	40069.0	5651.0	0.120599	0.138580	-0.138975	NaN	NaN	0.006506
1	1.0	23654	0.886996	0.063411	20981.0	2673.0	0.063148	0.065550	-0.037329	0.010596	0.101645	0.006506
2	2.0	33078	0.890955	0.088674	29471.0	3607.0	0.088701	0.088455	0.002785	0.003959	0.040114	0.006506
3	3.0	29205	0.890772	0.078292	26015.0	3190.0	0.078299	0.078228	0.000907	0.000183	0.001878	0.006506
4	4.0	22468	0.890644	0.060231	20011.0	2457.0	0.060229	0.060253	-0.000404	0.000128	0.001311	0.006506
5	5.0	24602	0.884725	0.065952	21766.0	2836.0	0.065511	0.069547	-0.059790	0.005920	0.059387	0.006506
6	6.0	20887	0.883899	0.055993	18462.0	2425.0	0.055567	0.059468	-0.067862	0.000826	0.008071	0.006506
7	7.0	21049	0.887453	0.056427	18680.0	2369.0	0.056223	0.058095	-0.032759	0.003554	0.035102	0.006506
8	8.0	17853	0.889878	0.047860	15887.0	1966.0	0.047816	0.048212	-0.008245	0.002425	0.024515	0.006506
9	9.0	14267	0.886662	0.038246	12650.0	1617.0	0.038074	0.039654	-0.040660	0.003217	0.032416	0.006506
10	10.0	120245	0.900312	0.322348	108258.0	11987.0	0.325833	0.293958	0.102950	0.013650	0.143610	0.006506

	loan_data_targets_test	y_hat_test (0.5)
362514	1	1
288564	1	1
213591	1	1
263083	1	1
165001	1	1

	loan_data_targets_test	y_hat_test_proba
362514	1	0.924306
288564	1	0.849239
213591	1	0.885350
263083	1	0.940636
165001	1	0.968665

	loan_data_targets_test	y_hat_test_proba
362514	1	0.924306
288564	1	0.849239
213591	1	0.885350
263083	1	0.940636
165001	1	0.968665

	Feature name	Coefficients
0	Intercept	-1.696416
1	grade:A	1.145283
2	grade:B	0.894099
3	grade:C	0.698473
4	grade:D	0.509253

	Feature name	Coefficients	p_values
0	Intercept	-1.331106	NaN
1	grade:A	1.160091	1.706560e-37
2	grade:B	0.906053	1.039904e-49
3	grade:C	0.708862	6.548970e-36
4	grade:D	0.519061	4.996770e-22

	grade:A	grade:C	grade:D	home_ownership:MORTGAGE	...	mths_since_last_delinq:Missing	mths_since_last_delinq:4-30	mths_since_last_delinq:31-56	mths_since_last_record:Missing
427211	1	0	0	1	...	1	0	0	1
206088	0	1	0	1	...	0	1	0	1
136020	1	0	0	1	...	0	0	1	1
412305	0	0	1	0	...	0	1	0	1
36159	0	1	0	1	...	1	0	0	1

	Feature name	Coefficients	p_values
0	Intercept	-1.374054	NaN
1	grade:A	1.123655	3.233003e-35
2	grade:B	0.878922	4.274889e-47
3	grade:C	0.684800	6.707744e-34
4	grade:D	0.496923	1.346968e-20

	FPR	TPR	TPR-FPR	Thresholds
0	0.000000	0.000000	0.000000	1.992629
1	0.000000	0.000012	0.000012	0.992629
2	0.000000	0.000120	0.000120	0.990698
3	0.000098	0.000120	0.000022	0.990653
4	0.000098	0.000433	0.000335	0.989762

	FPR	TPR	TPR-FPR	Thresholds
17258	0.999411	0.999964	0.000553	0.493404
17259	0.999607	0.999964	0.000356	0.488601
17260	0.999607	0.999976	0.000368	0.487910
17261	1.000000	0.999976	-0.000024	0.393734
17262	1.000000	1.000000	0.000000	0.375279

	loan_data_targets_test	y_hat_test_proba
42341	1	0.375279
42344	1	0.392099
39810	0	0.393734
40518	0	0.448967
42396	0	0.457733

	loan_data_targets_test	y_hat_test_proba	y_hat_test
262480	1	0.991292	1
231463	1	0.991304	1
239228	1	0.991652	1
261086	1	0.992058	1
242624	1	0.992629	1

	index	loan_data_targets_test	y_hat_test_proba	Cumulative N Population	Cumulative N Good	Cumulative N Bad	Cumulative Perc Population	Cumulative Perc Good	Cumulative Perc Bad
0	42341	1	0.375279	1	1	0	0.000011	0.000012	0.000000
1	42344	1	0.392099	2	2	0	0.000021	0.000024	0.000000
2	39810	0	0.393734	3	2	1	0.000032	0.000024	0.000098
3	40518	0	0.448967	4	2	2	0.000043	0.000024	0.000196
4	42396	0	0.457733	5	2	3	0.000054	0.000024	0.000294

	index	loan_data_targets_test	y_hat_test_proba	y_hat_test	Cumulative N Population	Cumulative N Good	Cumulative N Bad	Cumulative Perc Population	Cumulative Perc Good	Cumulative Perc Bad
93252	262480	1	0.991292	1	93253	83063	10190	0.999957	0.999952	1.0
93253	231463	1	0.991304	1	93254	83064	10190	0.999968	0.999964	1.0
93254	239228	1	0.991652	1	93255	83065	10190	0.999979	0.999976	1.0
93255	261086	1	0.992058	1	93256	83066	10190	0.999989	0.999988	1.0
93256	242624	1	0.992629	1	93257	83067	10190	1.000000	1.000000	1.0

	0
362514	614.0
288564	553.0
213591	579.0
263083	633.0
165001	685.0

	0
115	573.0
296284	679.0
61777	696.0
91763	664.0
167512	651.0

	0
362514	0.926311
288564	0.850776
213591	0.888717
263083	0.941455
165001	0.969279

Predicted	0	1
Actual
0	7374	2816
1	35812	47255

	Feature name	p_values
0	grade:G	NaN
1	home_ownership:RENT_OTHER_NONE_ANY	NaN
2	addr_state:ND_NE_IA_NV_FL_HI_AL	NaN
3	verification_status:Verified	NaN
4	purpose:educ__sm_b__wedd__ren_en__mov__house	NaN

	thresholds	fpr	tpr	Score	N Approved	N Rejected	Approval Rate	Rejection Rate
5000	0.903616	0.259176	0.547462	591.0	48117	45140	0.515961	0.484039
5001	0.903598	0.259176	0.547630	591.0	48131	45126	0.516111	0.483889
5002	0.903596	0.259274	0.547630	591.0	48132	45125	0.516122	0.483878
5003	0.903592	0.259274	0.547666	591.0	48135	45122	0.516154	0.483846
5004	0.903591	0.259372	0.547666	591.0	48136	45121	0.516165	0.483835
...	...	...	...	...	...	...	...	...
5195	0.901334	0.270265	0.560945	589.0	49350	43907	0.529183	0.470817
5196	0.901333	0.270363	0.560945	589.0	49351	43906	0.529194	0.470806
5197	0.901275	0.270363	0.561330	589.0	49383	43874	0.529537	0.470463
5198	0.901272	0.270461	0.561330	589.0	49384	43873	0.529547	0.470453
5199	0.901268	0.270461	0.561426	589.0	49392	43865	0.529633	0.470367

	thresholds	fpr	tpr	Score	N Approved	N Rejected	Approval Rate	Rejection Rate
1000	0.953241	0.049166	0.206592	651.0	17662	75595	0.189391	0.810609
1001	0.953231	0.049166	0.206737	651.0	17674	75583	0.189519	0.810481
1002	0.953227	0.049264	0.206737	651.0	17675	75582	0.189530	0.810470
1003	0.953219	0.049264	0.206809	651.0	17681	75576	0.189594	0.810406
1004	0.953219	0.049362	0.206809	651.0	17682	75575	0.189605	0.810395
...	...	...	...	...	...	...	...	...
1195	0.949153	0.059961	0.233968	645.0	20046	73211	0.214954	0.785046
1196	0.949149	0.060059	0.233968	644.0	20047	73210	0.214965	0.785035
1197	0.949105	0.060059	0.234317	644.0	20076	73181	0.215276	0.784724
1198	0.949104	0.060157	0.234317	644.0	20077	73180	0.215287	0.784713
1199	0.949080	0.060157	0.234509	644.0	20093	73164	0.215458	0.784542

	Unnamed: 0	Unnamed: 0.1	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	...	initial_list_status:f	good_bad
0	0	0	1077501	1296599	5000	5000	4975.0	36 months	10.65	162.87	...	1	1
1	1	1	1077430	1314167	2500	2500	2500.0	60 months	15.27	59.83	...	1	0
2	2	2	1077175	1313524	2400	2400	2400.0	36 months	15.96	84.33	...	1	1
3	3	3	1076863	1277178	10000	10000	10000.0	36 months	13.49	339.31	...	1	1
4	4	4	1075358	1311748	3000	3000	3000.0	60 months	12.69	67.79	...	1	1

	Feature name	Coefficients	p_values
0	Intercept	-1.491132e-04	NaN
1	grade:A	-1.806131e-05	9.998289e-01
2	grade:B	-1.007435e-04	9.988205e-01
3	grade:C	-1.867464e-04	9.977367e-01
4	grade:D	2.154426e-05	9.997493e-01
5	grade:E	1.856218e-05	9.998042e-01
6	grade:F	1.001171e-04	9.990811e-01
7	home_ownership:MORTGAGE	-4.783431e-05	9.984942e-01
8	home_ownership:NONE	1.214327e-06	9.999988e-01
9	home_ownership:OTHER	5.544467e-07	9.999989e-01
10	home_ownership:OWN	-7.193095e-06	9.998607e-01
11	verification_status:Not Verified	-1.103123e-04	9.970601e-01
12	verification_status:Source Verified	-2.384667e-04	9.930049e-01
13	purpose:car	1.596011e-05	9.998903e-01
14	purpose:debt_consolidation	-2.185229e-04	9.942409e-01
15	purpose:educational	9.575423e-07	9.999973e-01
16	purpose:home_improvement	1.415813e-05	9.998020e-01
17	purpose:house	1.520248e-05	9.999149e-01
18	purpose:major_purchase	2.382484e-05	9.997795e-01
19	purpose:medical	1.926731e-06	9.999854e-01
20	purpose:moving	4.028908e-06	9.999731e-01
21	purpose:other	5.031387e-05	9.992368e-01
22	purpose:renewable_energy	4.301377e-06	9.999899e-01
23	purpose:small_business	6.649088e-05	9.992567e-01
24	purpose:vacation	5.392512e-06	9.999701e-01
25	purpose:wedding	1.419223e-05	9.999226e-01
26	initial_list_status:w	-1.047834e-03	9.689110e-01
27	term_int	-3.451490e-03	1.676719e-01
28	emp_length_int	-5.371616e-04	8.607217e-01
29	mths_since_issue_d	1.902701e-02	8.120148e-105
30	mths_since_earliest_cr_line	-1.276420e-03	1.436173e-18
31	funded_amnt	4.328177e-05	1.486584e-05
32	int_rate	1.289677e-04	9.822643e-01
33	installment	-9.036366e-04	3.598916e-03
34	annual_inc	-5.885637e-07	8.207896e-02
35	dti	-7.542635e-03	3.511480e-06
36	delinq_2yrs	-7.240915e-05	9.960501e-01
37	inq_last_6mths	2.245953e-04	9.809502e-01
38	mths_since_last_delinq	-1.082779e-03	3.246562e-02
39	mths_since_last_record	-2.154423e-03	4.783218e-04
40	open_acc	-2.972121e-03	3.635158e-01
41	pub_rec	-5.971938e-05	9.987920e-01
42	total_acc	-7.135397e-03	8.431064e-07
43	acc_now_delinq	4.634389e-06	9.999788e-01
44	total_rev_hi_lim	-4.551291e-06	2.613813e-11

	lgd_targets_stage_1_test	y_hat_test_proba_lgd_stage_1
178928	1	0.587051
69814	1	0.614279
101396	0	0.482174
463268	1	0.555024
253729	0	0.411905

	Feature name	Coefficients	p_values
0	Intercept	2.406858e-01	NaN
1	grade:A	-6.826892e-02	0.000
2	grade:B	-5.083556e-02	0.000
3	grade:C	-3.748066e-02	0.000
4	grade:D	-2.717310e-02	0.000
5	grade:E	-1.315941e-02	0.002
6	grade:F	-5.260168e-03	0.275
7	home_ownership:MORTGAGE	2.832212e-03	0.061
8	home_ownership:NONE	1.459035e-01	0.000
9	home_ownership:OTHER	-9.475922e-03	0.644
10	home_ownership:OWN	5.000678e-03	0.040
11	verification_status:Not Verified	1.056585e-03	0.553
12	verification_status:Source Verified	-1.009915e-03	0.535
13	purpose:car	-2.995960e-03	0.634
14	purpose:debt_consolidation	8.206319e-05	0.965
15	purpose:educational	7.625467e-02	0.000
16	purpose:home_improvement	-3.702374e-03	0.273
17	purpose:house	-3.786803e-03	0.620
18	purpose:major_purchase	2.914439e-03	0.538
19	purpose:medical	1.078825e-02	0.074
20	purpose:moving	1.398692e-02	0.039
21	purpose:other	4.841345e-03	0.109
22	purpose:renewable_energy	2.420645e-02	0.142
23	purpose:small_business	6.212343e-04	0.869
24	purpose:vacation	-3.002398e-03	0.731
25	purpose:wedding	2.034853e-02	0.006
26	initial_list_status:w	1.464671e-02	0.000
27	term_int	3.316229e-04	0.020
28	emp_length_int	8.727462e-05	0.635
29	mths_since_issue_d	-1.521649e-03	0.000
30	mths_since_earliest_cr_line	3.418678e-05	0.000
31	funded_amnt	-2.186999e-07	0.699
32	int_rate	-2.544714e-03	0.000
33	installment	-1.037621e-05	0.557
34	annual_inc	6.389841e-08	0.001
35	dti	1.775655e-04	0.069
36	delinq_2yrs	1.757943e-03	0.050
37	inq_last_6mths	1.274095e-03	0.018
38	mths_since_last_delinq	-1.094747e-06	0.971
39	mths_since_last_record	-5.558083e-05	0.181
40	open_acc	-1.196505e-03	0.000
41	pub_rec	3.447322e-03	0.208
42	total_acc	4.766629e-04	0.000
43	acc_now_delinq	4.278394e-03	0.658
44	total_rev_hi_lim	2.263456e-07	0.000

	0
count	8648.000000
mean	0.086175
std	0.049851
min	-0.007634
25%	0.061983
50%	0.100503
75%	0.122541
max	0.236973

	Feature name	Coefficients	p_values
0	Intercept	1.109746e+00	NaN
1	grade:A	-3.030033e-01	0.000000e+00
2	grade:B	-2.364277e-01	0.000000e+00
3	grade:C	-1.720232e-01	0.000000e+00
4	grade:D	-1.198470e-01	1.970424e-12
5	grade:E	-6.768713e-02	1.918578e-03
6	grade:F	-2.045907e-02	2.748685e-01
7	home_ownership:MORTGAGE	-6.343341e-03	6.050271e-02
8	home_ownership:NONE	-5.539064e-03	9.092582e-05
9	home_ownership:OTHER	-2.426052e-03	6.436926e-01
10	home_ownership:OWN	-1.619582e-03	3.963089e-02
11	verification_status:Not Verified	5.339510e-05	5.528332e-01
12	verification_status:Source Verified	8.967822e-03	5.354622e-01
13	purpose:car	7.904787e-04	6.340924e-01
14	purpose:debt_consolidation	1.264922e-02	9.646959e-01
15	purpose:educational	9.643587e-02	5.368894e-09
16	purpose:home_improvement	1.923044e-02	2.729279e-01
17	purpose:house	1.607120e-02	6.200015e-01
18	purpose:major_purchase	2.984917e-02	5.376877e-01
19	purpose:medical	3.962479e-02	7.391253e-02
20	purpose:moving	4.577630e-02	3.865040e-02
21	purpose:other	3.706744e-02	1.089028e-01
22	purpose:renewable_energy	7.212969e-02	1.423251e-01
23	purpose:small_business	5.128674e-02	8.692143e-01
24	purpose:vacation	1.874863e-02	7.311861e-01
25	purpose:wedding	4.350522e-02	5.539872e-03
26	initial_list_status:w	1.318126e-02	4.662937e-15
27	term_int	4.551882e-03	2.042660e-02
28	emp_length_int	-1.591478e-03	6.350976e-01
29	mths_since_issue_d	-4.305274e-03	0.000000e+00
30	mths_since_earliest_cr_line	-3.634030e-05	1.087757e-04
31	funded_amnt	2.212126e-06	6.992397e-01
32	int_rate	-1.172652e-02	1.887379e-14
33	installment	-6.865607e-05	5.568684e-01
34	annual_inc	5.021816e-09	1.308549e-03
35	dti	2.832769e-04	6.897933e-02
36	delinq_2yrs	4.833234e-04	5.043858e-02
37	inq_last_6mths	1.131678e-02	1.819423e-02
38	mths_since_last_delinq	-1.965980e-04	9.708935e-01
39	mths_since_last_record	-5.085639e-05	1.809354e-01
40	open_acc	-2.142130e-03	1.365854e-09
41	pub_rec	6.782062e-03	2.079574e-01
42	total_acc	4.518110e-04	5.133116e-08
43	acc_now_delinq	9.974801e-03	6.583494e-01
44	total_rev_hi_lim	2.166527e-07	2.592628e-08

	Unnamed: 0.1	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	url	desc	purpose	title	zip_code	addr_state	dti	delinq_2yrs	earliest_cr_line	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	revol_bal	revol_util	total_acc	initial_list_status	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d	mths_since_last_major_derog	policy_code	application_type	annual_inc_joint	dti_joint	verification_status_joint	tot_coll_amt	tot_cur_bal	open_acc_6m	open_il_6m	open_il_12m	open_il_24m	mths_since_rcnt_il	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m	emp_length_int	term_int	earliest_cr_line_date	mths_since_earliest_cr_line	issue_d_date	mths_since_issue_d	grade:A	grade:C	grade:D	sub_grade:A5	sub_grade:C1	sub_grade:C2	sub_grade:D1	home_ownership:MORTGAGE	home_ownership:RENT	verification_status:Not Verified	verification_status:Source Verified	verification_status:Verified	loan_status:Charged Off	loan_status:Current	loan_status:Fully Paid	loan_status:Late (31-120 days)	purpose:debt_consolidation	purpose:educational	addr_state:CA	addr_state:GA	addr_state:MA	addr_state:NJ	addr_state:SC	initial_list_status:f	home_ownership:RENT_OTHER_NONE_ANY	addr_state:UT_KY_AZ_NJ	addr_state:RI_MA_DE_SD_IN	addr_state:GA_WA_OR	addr_state:KS_SC_CO_VT_AK_MS	purpose:educ__sm_b__wedd__ren_en__mov__house	term:36	emp_length:2-4	emp_length:5-6	emp_length:7-9	emp_length:10	mths_since_issue_d_factor	mths_since_issue_d:42-48	mths_since_issue_d:53-64	mths_since_issue_d:>84	int_rate_factor	int_rate:<9.548	int_rate:12.025-15.74	int_rate:15.74-20.281	funded_amnt_factor	mths_since_earliest_cr_line_factor	mths_since_earliest_cr_line:165-247	mths_since_earliest_cr_line:271-352	mths_since_earliest_cr_line:>352	delinq_2yrs:0	delinq_2yrs:1-3	inq_last_6mths:1-2	inq_last_6mths:3-6	open_acc:4-12	open_acc:13-17	open_acc:18-22	pub_rec:0-2	total_acc_factor	total_acc:<=27	total_acc:28-51	acc_now_delinq:0	total_rev_hi_lim_factor	total_rev_hi_lim:5K-10K	total_rev_hi_lim:10K-20K	total_rev_hi_lim:30K-40K	total_rev_hi_lim:55K-95K	installment_factor	annual_inc_factor	annual_inc:40K-50K	annual_inc:70K-80K	annual_inc:80K-90K	annual_inc:90K-100K	annual_inc:100K-120K	mths_since_last_delinq:Missing	mths_since_last_delinq:4-30	mths_since_last_delinq:31-56	dti_factor	dti:3.5-7.7	dti:10.5-16.1	dti:21.7-22.4	dti:22.4-35	mths_since_last_record:Missing
427211	427211	12796369	14818505	24000	24000	24000.0	36 months	8.90	762.08	A	A5	Supervisor inventory management	3 years	MORTGAGE	77000.0	Source Verified	Mar-14	Current	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 03/12/14 > I have 5 credit...	debt_consolidation	Debt consolidation	295xx	SC	21.91	0.0	Dec-86	1.0	NaN	NaN	20.0	30489	53.5	32.0	f	10098.30	10098.30	16765.76000	16765.76	13901.70	2864.06	0.00	0.00	0.000	Jan-16	762.08	Feb-16	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	0.0	348253.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	57000.0	NaN	NaN	NaN	3.0	36	1986-12-01	372.0	2014-03-01	45.0	1	0	0	1	0	0	0	1	0	0	1	0	0	1	0	0	1	0	0	0	0	0	1	1	0	0	0	0	1	0	1	1	0	0	0	(43.2, 45.0]	1	0	0	(8.722, 9.135]	1	0	0	(23960.0, 24650.0]	(363.94, 375.68]	0	0	1	1	0	1	0	0	0	1	1	(30.0, 33.0]	0	1	1	(54999.994, 59999.994]	0	0	0	1	(740.716, 768.603]	(73294.82, 144693.64]	0	1	0	0	0	1	0	0	(21.595, 21.995]	0	0	1	0	1
206088	206088	1439740	1691948	10000	10000	10000.0	36 months	14.33	343.39	C	C1	mizuho corporate bank	6 years	MORTGAGE	112000.0	Not Verified	Aug-12	Fully Paid	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 07/23/12 > I was looking f...	debt_consolidation	Credit card consolidation	070xx	NJ	7.49	1.0	Dec-97	2.0	18.0	NaN	15.0	15836	53.1	38.0	f	0.00	0.00	12357.02066	12357.02	10000.00	2357.02	0.00	0.00	0.000	Aug-15	355.11	NaN	Jul-15	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	10000.0	NaN	NaN	NaN	6.0	36	1997-12-01	240.0	2012-08-01	64.0	0	1	0	0	1	0	0	1	0	1	0	0	0	0	1	0	1	0	0	0	0	1	0	1	0	1	0	0	0	0	1	0	1	0	0	(63.0, 64.8]	0	1	0	(14.089, 14.502]	0	1	0	(9470.0, 10160.0]	(234.8, 246.54]	1	0	0	0	1	1	0	0	1	0	1	(36.0, 39.0]	0	1	1	(9999.999, 14999.998]	1	0	0	0	(322.42, 350.307]	(73294.82, 144693.64]	0	0	0	0	1	0	1	0	(7.198, 7.598]	1	0	0	0	1
136020	136020	5214749	6556909	20425	20425	20425.0	36 months	8.90	648.56	A	A5	Internal Medicine of Griffin	10+ years	MORTGAGE	84000.0	Verified	Jun-13	Current	n	https://www.lendingclub.com/browse/loanDetail....	NaN	debt_consolidation	Lend Club	302xx	GA	14.83	0.0	Jul-91	1.0	46.0	NaN	9.0	29813	89.5	20.0	f	3183.62	3183.62	20090.40000	20090.40	17241.38	2849.02	0.00	0.00	0.000	Jan-16	648.56	Feb-16	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	0.0	385187.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	33300.0	NaN	NaN	NaN	10.0	36	1991-07-01	317.0	2013-06-01	54.0	1	0	0	1	0	0	0	1	0	0	0	1	0	1	0	0	1	0	0	1	0	0	0	1	0	0	0	1	0	0	1	0	0	0	1	(52.2, 54.0]	0	1	0	(8.722, 9.135]	1	0	0	(19820.0, 20510.0]	(316.98, 328.72]	0	1	0	1	0	1	0	1	0	0	1	(18.0, 21.0]	1	0	1	(29999.997, 34999.996]	0	0	1	0	(629.171, 657.057]	(73294.82, 144693.64]	0	0	1	0	0	0	0	1	(14.796, 15.196]	0	1	0	0	1
412305	412305	13827698	15890016	17200	17200	17200.0	36 months	16.59	609.73	D	D1	Administrative Assistant	7 years	RENT	43000.0	Source Verified	Apr-14	Late (31-120 days)	n	https://www.lendingclub.com/browse/loanDetail....	NaN	debt_consolidation	Debt consolidation	015xx	MA	13.68	1.0	Oct-98	3.0	6.0	NaN	9.0	7523	60.2	13.0	f	9459.50	9459.50	11615.36000	11615.36	7740.50	3844.37	30.49	0.00	0.000	Jan-16	640.22	Feb-16	Jan-16	50.0	1	INDIVIDUAL	NaN	NaN	NaN	0.0	22958.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12500.0	NaN	NaN	NaN	7.0	36	1998-10-01	230.0	2014-04-01	44.0	0	0	1	0	0	0	1	0	1	0	1	0	0	0	0	1	1	0	0	0	1	0	0	1	1	0	1	0	0	0	1	0	0	1	0	(43.2, 45.0]	1	0	0	(16.566, 16.978]	0	0	1	(17060.0, 17750.0]	(223.06, 234.8]	1	0	0	0	1	0	1	1	0	0	1	(12.0, 15.0]	1	0	1	(9999.999, 14999.998]	0	1	0	0	(601.284, 629.171]	(-5243.882, 73294.82]	1	0	0	0	0	0	1	0	(13.597, 13.997]	0	1	0	0	1
36159	36159	422455	496525	8400	8400	7450.0	36 months	12.84	282.40	C	C2	Bank of A	5 years	MORTGAGE	94000.0	Verified	Jul-09	Charged Off	n	https://www.lendingclub.com/browse/loanDetail....	Unexpectd California tuition hike - Need help ...	educational	Student Loan	913xx	CA	22.54	0.0	Jul-98	1.0	NaN	NaN	14.0	65621	81.5	30.0	f	0.00	0.00	5422.21000	4808.80	3566.58	1231.84	0.00	623.79	28.256	Dec-10	282.40	NaN	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8400.0	NaN	NaN	NaN	5.0	36	1998-07-01	233.0	2009-07-01	101.0	0	1	0	0	0	1	0	1	0	0	0	1	1	0	0	0	0	1	1	0	0	0	0	1	0	0	0	0	0	1	1	0	1	0	0	(100.8, 102.6]	0	0	1	(12.438, 12.85]	0	1	0	(8090.0, 8780.0]	(223.06, 234.8]	1	0	0	1	0	1	0	0	1	0	1	(27.0, 30.0]	0	1	1	(5000.0, 9999.999]	1	0	0	0	(266.648, 294.534]	(73294.82, 144693.64]	0	0	0	1	0	1	0	0	(22.394, 22.794]	0	0	0	1	1

	funded_amnt	PD	LGD	EAD	EL
0	5000	0.164761	0.913729	2949.608449	444.052967
1	2500	0.282340	0.915482	1944.433378	502.591700
2	2400	0.229758	0.919484	1579.934302	333.775488
3	10000	0.208891	0.904924	6606.559612	1248.839565
4	3000	0.129555	0.911453	2124.631667	250.883310