IFRS Technical Note

Design and Implementation of an Expected Credit Loss (ECL) Model for a Portfolio of Credit Card Receivables under IFRS 9: Financial Instruments

Paul McAteer, MSc, MBA

pcm353@stern.nyu.edu


1. Summary of the Guidance on Valuation Principles in the Standard


IFRS 9.5.5 introduces an impairment model for financial assets based on Expected Credit Losses (ECL), which requires entities to recognize a loss allowance prior to loss materialization, utilizing forward-looking and historical information. IFRS 9.5.5.1 stipulates that an entity “shall recognise a loss allowance for expected credit losses on a financial asset that is measured in accordance with paragraphs 4.1.2”, that is, financial assets “measured at amortised cost” held “to collect contractual cash flows” and whose “contractual terms of the financial asset give rise on specified dates to cash flows that are solely payments of principal and interest on the principal amount outstanding”. Referring to the impairment model’s input data, IFRS 9.5.5.4 expects entities to consider “all reasonable and supportable information, including that which is forward-looking”.

The ECL for Trade Receivables that contain a "significant financing component"1 under IFRS 15, such as credit card receivables, can be measured under the “Simplified Approach”. In contratst with the "General Approach", the Simplified Approach allows entities to recognise lifetime expected losses on all these assets without the need to identify significant increases in credit risk. In any case because the maturities will typically be 12 months or less, the credit loss for 12-month and lifetime ECLs would be the same. IFRS 9.5.5.15 states that "an entity shall always measure the loss allowance at an amount equal to lifetime expected credit losses for...trade receivables or contract assets that result from transactions that are within the scope of IFRS 15, and that…contain a significant financing component in accordance with IFRS 15, if the entity chooses as its accounting policy to measure the loss allowance at an amount equal to lifetime expected credit losses."

Lifetime expected credit loss is the discounted value of expected credit losses that result from probable default events over the expected life of a financial instrument. IFRS 9.5.5.17 clarifies that "An entity shall measure expected credit losses of a financial instrument in a way that reflects: (a)an unbiased and probability-weighted amount that is determined by evaluating a range of possible outcomes; (b)the time value of money. "The term ‘default’ is not defined in IFRS 9. IFRS 9:B5.5.37 states that a definition of default should be "consistent with the definition used for internal credit risk management purposes". Entities will need to consider the requirements of this paragraph where it states there is a "rebuttable presumption that default does not occur later than when a financial asset is 90 days past due unless an entity has reasonable and supportable information to demonstrate that a more lagging default criterion is more appropriate".

IFRS 9.5.5.19 indicates that the maximum expected life is generally understood as the contractual life: "The maximum period to consider when measuring expected credit losses is the maximum contractual period (including extension options) over which the entity is exposed to credit risk and not a longer period". The expected period of exposure is more subjective. IFRS 9:B5.5.40 states that when dtermining expected life "an entity should consider factors such as historical information and experience about:(a) the period over which the entity was exposed to credit risk on similar financial instruments;(b) the length of time for related defaults to occur on similar financial instruments following a significant increase in credit risk; and (c) the credit risk management actions that an entity expects to take once the credit risk on the financial instrument has increased, such as the reduction or removal of undrawn limits."

With specific reference to revolving credit facilities, IFRS 9:B5.5.39 prevails upon the entity to apply dicretionary judgement regarding the time horizon of credit exposure. Where financial instruments include both a loan and an undrawn commitment component (such as credit cards and overdraft facilities), the contractual ability to demand repayment and cancel the undrawn commitment does not necessarily limit the exposure to credit losses beyond the contractual period. For those financial instruments, management should measure ECL over the period that the entity is exposed to credit risk and ECL would not be mitigated by credit risk management actions, even if that period extends beyond the maximum contractual period. In the Illustrative Examples, IFRS 9:IE60 provides further guidance on which factors should be taken into consideration when determining size and time horizon of credit exposure: "At the reporting date the outstanding balance on the credit card portfolio is CU60,000 and the available undrawn facility is CU40,000. Bank A determines the expected life of the portfolio by estimating the period over which it expectsto be exposed to credit risk on the facilities at the reporting date, taking into account: (a) the period over which it was exposed to credit risk on a similar portfolio of credit cards; (b) the length of time for related defaults to occur on similar financial instruments; and (c) past events that led to credit risk management actions because of an increase in credit risk on similar financial instruments, such as the reduction or removal of undrawn credit limits.


1 A significant financing component exists if the timing of payments agreed to by the parties to the contract (either explicitly or implicitly) provides the customer or the entity with a significant benefit of financing the transfer of goods or services to the customer. [IFRS 15:60]

2. Interpretation of the Guidance


- The ECL calculation model should calculate an unbiased and probability-weighted amount to be presented as an impairment to the book value of the financial assets in the Balance sheet.

- This unbiased and probability weighted amount is the difference between the present value of cashflows due under contract and the present value of cashflows that an entity expects to receive.

- The Expected Credit Loss determined by the probability of default, the size of the exposure to defaulting customers, the expected recoverable amount in the event of default and the discount rate applied.

- The estimated size of the exposure is necessarily related to the expectations on the customers drawdown of the undrawn commitment component over a defined time frame. The time frame will be governed by subjective evaluations focussing on how long it will take the entity to identify and take remedial action in relation to problem credit

- The Lifetime Expected Credit Losses will have to incorporate the term structure of the default probability of the assets. In other words, the hazard rate or default intensity, which connotes an instantaneous rate of failure, should be used along with the exponential distribution to compute the cumulative probability of default for a given time horizon.

- The entity should apply a granular and dynamic approach for portfolio segmentation by grouping financial assets based on shared credit characteristics.

- As with all such forward-looking models, expected loss should be taken into consideration the expected loss at a aggregate portfolio level which generally involves incorporating some expectation of the effect of correlation between the constituent assets.

3. Model Design


The future value of Lifetime Expected Credit Loss of portfolio Π at future time T is defined as a function of the probability of default PD, expected exposure at the time of default EAD and the size of the expected loss in the event of default LGD. The present value of this future ECL is obtained by discounting it at the Expected Interest Rate of the portfolio assets, EIR. Thus:

ECLΠ,T=PDΠ,T EADΠ,T LGDΠ,T 11+EIRT

3.1. Probability of Default


PD(t,t+dt) is the hazard rate or default intensity. More precisely, it is the (instantaneous) probability of default, (λ), over an infinitesimally small time interval (dt):

PD(t,t+dt)=λdt

The estimation of default probabilities of each credit portfolio constituent i is achieved with the logit model, which employs the technique of logistic transformation to generate a sigmoid function bounded by 0 and 1.:

PD[i,(t,t+dt)]=1Pi(loan status=1=NoDefault)=111+eYi

Where Y is a linear regression function of the form:

Yi=B0+B1X1(i)+BnXn(i)

Where Bn are parameters that are estimated statistically and Xn(i) are scores, ratios and other explanatory variables for obligor i, transformed into binary "dummy" variables. PD¯Π,T is the average cumulative probability of default of the portfolio over (0,T) , that is, the output of the cumulative default time distribution F(t)=1eλt at time horizon T, where T denotes the weighted average lifetime of the credit portfolio:

PD¯Π,T=1eλT

The Vasicek Model offers an elegant solution allowing the computation of a portfolio default rate , PD^Π,T, which integrates the impact of (negative) assumptions about future economic conditions and the effect of the correlation between the portfolio assets. The model takes three inputs: * The weighted average standalone probability of default, denoted by  PD¯Π,T * The average correlation of portfolio assets with the broader economy, denoted by ρ; * A common systematic economic factor (such as GDP growth , general levels of credit quality etc.) denoted by e~M The default rate for an asymptotic portfolio, having estimated the average default probability, the default correlation parameter and the common market factor, is given by:
PD^Π,T= Φ[Φ1 (PD¯Π,T)ρ  e~M1ρ]

e~M is a standard normal variable,~e~M N(0,1) , representing the assumed severity of economic downturn. The higher the probability of default, the greater the correlation coefficient and the larger the assumed market downturn, the smaller the distance from default, the closer to default and the higher the associated default rate for the portfolio. It may make more intuitive sense if the e~M variable is restated in terms of the inverse of the standard normal cumulative distribution and a probability input (x) ranging from 0.5 to 0,999, where the higher the input value, the more severe the assumed economic downturn. This results in:
PD^Π,T= Φ[Φ1 (PD¯Π,T)+ρ  Φ1  (x) 1ρ]

The correlation coefficient, ρ, can be obtained by adapting the Basel II IRB risk-weighted formula for corporate exposures, which is based on the Vasicek model and which prescribes that correlations are bounded by upper and lower limits and are function of the probability of default weighted average. For credit card default correlations, we employ the empirical study of Crook et al1 to set the lower bounds at 0.396% and the upper bound at 4% and assume that correlation is an increasing function of the default probability:
ρ=0.396%(11ekp1ek)+8%(11ekp1ek)
Where the k parameter which controls the exponential decline is set to 50 as under Basel regulations and p=PD¯Π,T


3.2. Loss Given Default


A "Two-stage" LGD model is implemented. The "Stage 1" model is a classification model to predict whether the loan will have a recovery rate (RR) greater than zero. The "Stage 2" model a regression-type model to predict the value of the recovered amount of when the recovery rate is expected to be positive. The predicted recovery is the expected value of the two combined models, that is, the product of a binary value representing the event of recovery and the expected recovery value. So, for obligor i, predicted RR will be either:

RR¯i=[P(RR>0)i=1]Y^i(RR>0)

Or:
RR¯i=[P(RR>0)i=0]

Where Y^ is the predicted amount of postive RR obtained from a multivariate linear regression, P(RR>0) is the probability of a postive RR obtained from a multivariate logistic regression assuming some threshold and RR¯ is the obligor-specific recovery rate.

LGD is therefore:
LGD¯i=1RR¯i

3.3. Expected Exposure at Default


For credit card portfolios, EAD estimation is bedevilled by the revolving nature of the credit line which poses challenges to predicting the exposure at default time. Additional borrowings in the period prior to default means taking the current balance for non-defaulted customers does not produce a sufficiently conservative enough estimate of the amount drawn by the time of default. One solution is to use historic data to derive a Credit Conversion Factor (CCF) which is the proportion of the current undrawn amount that will likely be drawn down at time of default. The dependent variable in the regression analysis will be:

(FundedAmountDefaultedLoanDrawnAmountDefaultedLoan)÷FundedAmountDefaultedLoan

So, for obligor i, predicted EAD will be:

EAD¯i=CurrentDrawnAmounti+(CCFiCurrentUndrawnAmounti)


Where CCF is the obligor-specific CCF multiplier obtained by applying the multivariate linear regression function to the obligor's data.



2 J. Crook & T. Bellotti (2012) Asset correlations for credit card defaults, Applied Financial Economics, 22:2, 87-95

4. PD Model Inputs

4.1. Preliminary Data Exploration and Preprocessing


To avoid any suggestion of the selective usage of raw data and the gaming of model results, the procedure for treating raw data should be transparent and rigourous. For example:

  1. Retrieve raw data into dataframe
  2. Convert string values to integers where necessary
  3. Convert string points in time to numeric periods of time where necessary
  4. Transform all discrete variables into dummy variables and concatenate in single dataframe
  5. Replace missing values with appropriate alternative value or remove from dataset
  6. Incorporate new dummy variables into master dataframe
  7. Search for errors/anomalies/outliers in the dataset. Remove or replace
In [1]:
import numpy as np
import pandas as pd

# 1) Retrieve loan data into dataframe
loan_data = pd.read_csv('loan_data_2007_2014.csv')

# 2) Convert string values to integers where necessary. First removing text...
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\+ years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a',  str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')
#...then converting string datatype to numeric datatype
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])

# 2) Convert string values to integers where necessary, replacing text with empty space
loan_data['term_int'] = pd.to_numeric(loan_data['term'].str.replace(' months', ''))


# 3) Convert string points in time to numeric periods of time where necessary.First converting to datetime format...
loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format = '%b-%y')
#...then converting to a new passage of time variable
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') 
                                                                - loan_data['earliest_cr_line_date']) 
                                                               / np.timedelta64(1, 'M')))


loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format = '%b-%y')
loan_data['mths_since_issue_d'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') 
                                                       - loan_data['issue_d_date']) 
                                                      / np.timedelta64(1, 'M')))


# 4) Transform all discrete variables into dummy variables and concatenate in single dataframe

loan_data_dummies = [pd.get_dummies(loan_data['grade'], prefix = 'grade', prefix_sep = ':'),
                     pd.get_dummies(loan_data['sub_grade'], prefix = 'sub_grade', prefix_sep = ':'),
                     pd.get_dummies(loan_data['home_ownership'], prefix = 'home_ownership', prefix_sep = ':'),
                     pd.get_dummies(loan_data['verification_status'], prefix = 'verification_status', prefix_sep = ':'),
                     pd.get_dummies(loan_data['loan_status'], prefix = 'loan_status', prefix_sep = ':'),
                     pd.get_dummies(loan_data['purpose'], prefix = 'purpose', prefix_sep = ':'),
                     pd.get_dummies(loan_data['addr_state'], prefix = 'addr_state', prefix_sep = ':'),
                     pd.get_dummies(loan_data['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':')]

loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)

# 5) Incorporate new dummy variables into master dataframe

loan_data = pd.concat([loan_data, loan_data_dummies], axis = 1)

# 6) Replace missing values with appropriate alternative value or remove from dataset

loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'], inplace=True) # other variable
loan_data['annual_inc'].fillna(loan_data['annual_inc'].mean(), inplace=True) # mean value
loan_data['mths_since_earliest_cr_line'].fillna(0, inplace=True) # zero value
loan_data['acc_now_delinq'].fillna(0, inplace=True) # zero value
loan_data['total_acc'].fillna(0, inplace=True) # zero value
loan_data['pub_rec'].fillna(0, inplace=True) # zero value
loan_data['open_acc'].fillna(0, inplace=True) # zero value
loan_data['inq_last_6mths'].fillna(0, inplace=True) # zero value
loan_data['delinq_2yrs'].fillna(0, inplace=True) # zero value
loan_data['emp_length_int'].fillna(0, inplace=True) # zero value

# To remove null values from dataset:
#indices = loan_data[loan_data['person _ emp_ length'].isnull()].index 
#loan_data.drop(indices, inplace=True)

# 7) Search for errors/anomalies/outliers in the dataset. Remove or replace
pd.crosstab(loan_data['home_ownership'], 
            loan_data['emp_length_int'], 
            values=loan_data['mths_since_earliest_cr_line'], 
            aggfunc='min').round(2)

loan_data['mths_since_earliest_cr_line'].describe()

# Replace all negative values in dataset with max.
loan_data['mths_since_earliest_cr_line'][loan_data['mths_since_earliest_cr_line'] 
                                         < 0] = loan_data['mths_since_earliest_cr_line'].max()

# Remove all negative values from dataset
#indices = loan_data[cr _ loan['person _ emp_ length'] < 0].index 
#loan_data.drop(indices, inplace=True)
C:\Users\delga\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (20) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:78: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

4.2. Feature and Target Variable Preparation and Selection


The data should be divided into training and testing datasets. All discrete and continous feature variables should be transformed into dummy variables. The initial transformation of the feature variables of the training dataset into narrow categories of arbitrary size is referred to as "fine classing". The process of creating new, refined and usually enlarged categories based on the initial ones are refined is a process known as "coarse classing".

A metric called 'Weight of Evidence' (WoE) is employed to this end. The objective is to lower the number of dummy variables. Weight of evidence shows to what extent each of the different categories of an independent variable explains the dependent variable. The objective is to obtain categories with a similar WOE. Ideally, each category (bin) should have at least 5% of the observations. Each category (bin) should be non-zero for both non-events and events. The WoE should be monotonic, i.e. either growing or decreasing with the groupings.

The formula for WoE is:

The steps to calculate WoE are:

  1. For a continuous variable, split data into ordered parts (or less depending on the distribution)
  2. Calculate the number of events and non-events in each group (bin)
  3. Calculate the % of events and % of non-events in each group
  4. Calculate WoE by taking natural log of division of % of non-events and % of events
Note : For a dicrete variable, it will unecessary to split the the data. Some discrete variables can be ordered. The interpretation of the WoE for a given category of an independent variable is relatively straightforward. The further the distance from zero of the WoE, the more powerful the category is in differentiating between the two outcomes (Default/Non-Default) of the dependent variable. Whilst WoE describes the relationship between a category value and a binary target variable; the Information Value (IV) measures the predictive power of a feature. WoE considers only the discriminatory power of each bin, without regard to the proportion of observations in the bin. IV is a weighted sum of the (WoE) values. It is therefore a measure bounded between 0 and 1 of how much information an independent variable brings to explaining the dependent variable. It is thus a valuable tool for feature selection. The formula for IV is: The conventional procedure for feature analysis and selection is:
  1. Define the independent "Default Variable
  2. Split data set into training and testing sets
  3. Create a datframe for each input variable and the target variable, ensuring where possible the input data is organized into ordered bins
  4. For each variable, compute the WoE of each category (bin)
  5. Adjust the dimensions of the categories in accordance with the interpretation of their WoE
  6. Compute the IV of the adjusted (coarse-classed) categories
  7. Applying qualitative and quantitative criteria, select the best predictor variables
In [2]:
# Define dependent 'Default' variable and add to loan_data dataframe

loan_data['good_bad'] = np.where(loan_data['loan_status'].isin(['Charged Off', 'Default',
                                                       'Does not meet the credit policy. Status:Charged Off',
                                                       'Late (31-120 days)']), 0, 1)
In [3]:
# Imports the libraries we need.
from sklearn.model_selection import train_test_split


cr_inp_train, cr_inp_test, cr_tgt_train, cr_tgt_test = train_test_split(loan_data.drop('good_bad', axis = 1), 
                                                                        loan_data['good_bad'], 
                                                                        test_size = 0.2, 
                                                                        random_state = 42)
In [4]:
# WoE function for discrete unordered variables
# The function takes 3 arguments: a feature dataframe, a string, and a target dataframe. 
# The function returns a dataframe as a result.

def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    df = df.sort_values(['WoE'])
    df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df

# NOTE ON GROUPBY
# Groups the data according to a criterion contained in one column (1st = Grade)
# Does not turn the names of the values of the criterion into index if as_index = False
# Aggregates the data in another column (Good_bd) to these groups, using a selected function (mean)
# Syntax: Produces Pandas DataFrame >>> df.groupby('month')[['duration']].sum()
In [5]:
# WoE function for ordered discrete and continuous variables

def woe_ordered_continuous(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    #df = df.sort_values(['WoE'])
    #df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df

# NOTE: We order the results by the values of a different column.
In [6]:
# WoE Visualization

import matplotlib.pyplot as plt
import seaborn as sns
# Imports the libraries we need.
sns.set()
# We set the default style of the graphs to the seaborn style.

# Below we define a graphing function that takes 2 arguments: a WoE dataframe and a number to rotate x labels
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
    x = np.array(df_WoE.iloc[:, 0].apply(str))
    # Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
    y = df_WoE['WoE']
    plt.figure(figsize=(18, 6))
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    plt.xlabel(df_WoE.columns[0])
    # Names the x-axis with the name of the column with index 0.
    plt.ylabel('Weight of Evidence')
    # Names the y-axis 'Weight of Evidence'.
    plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
    # Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
    plt.xticks(rotation = rotation_of_x_axis_labels)
    # Rotates the labels of the x-axis a predefined number of degrees.
In [7]:
##### Procedure will be run twice. Once with training data and once with testing data #####


# New dataframe with training/test inputs and targets

df_inputs_prepr = cr_inp_train
df_targets_prepr = cr_tgt_train

#df_inputs_prepr = cr_inp_test
#df_targets_prepr = cr_tgt_test
In [8]:
df_targets_prepr
Out[8]:
427211    1
206088    1
136020    1
412305    0
36159     0
         ..
259178    1
365838    1
131932    1
146867    1
121958    1
Name: good_bad, Length: 373028, dtype: int32
In [9]:
df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_temp
Out[9]:
grade n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 G 2654 0.727958 0.007115 1932.0 722.0 0.005815 0.017706 -1.113459 NaN NaN 0.288636
1 F 10530 0.754416 0.028228 7944.0 2586.0 0.023910 0.063417 -0.975440 0.026458 0.138019 0.288636
2 E 28612 0.805257 0.076702 23040.0 5572.0 0.069345 0.136642 -0.678267 0.050841 0.297173 0.288636
3 D 61498 0.846304 0.164862 52046.0 9452.0 0.156647 0.231792 -0.391843 0.041047 0.286424 0.288636
4 C 100245 0.885770 0.268733 88794.0 11451.0 0.267251 0.280813 -0.049503 0.039466 0.342340 0.288636
5 B 109730 0.921015 0.294160 101063.0 8667.0 0.304178 0.212541 0.358476 0.035245 0.407979 0.288636
6 A 59759 0.961044 0.160200 57431.0 2328.0 0.172855 0.057090 1.107830 0.040028 0.749353 0.288636
In [10]:
plot_by_woe(df_temp)
In [11]:
df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[11]:
emp_length_int n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 45720 0.876400 0.122565 40069.0 5651.0 0.120599 0.138580 -0.138975 NaN NaN 0.006506
1 1.0 23654 0.886996 0.063411 20981.0 2673.0 0.063148 0.065550 -0.037329 0.010596 0.101645 0.006506
2 2.0 33078 0.890955 0.088674 29471.0 3607.0 0.088701 0.088455 0.002785 0.003959 0.040114 0.006506
3 3.0 29205 0.890772 0.078292 26015.0 3190.0 0.078299 0.078228 0.000907 0.000183 0.001878 0.006506
4 4.0 22468 0.890644 0.060231 20011.0 2457.0 0.060229 0.060253 -0.000404 0.000128 0.001311 0.006506
5 5.0 24602 0.884725 0.065952 21766.0 2836.0 0.065511 0.069547 -0.059790 0.005920 0.059387 0.006506
6 6.0 20887 0.883899 0.055993 18462.0 2425.0 0.055567 0.059468 -0.067862 0.000826 0.008071 0.006506
7 7.0 21049 0.887453 0.056427 18680.0 2369.0 0.056223 0.058095 -0.032759 0.003554 0.035102 0.006506
8 8.0 17853 0.889878 0.047860 15887.0 1966.0 0.047816 0.048212 -0.008245 0.002425 0.024515 0.006506
9 9.0 14267 0.886662 0.038246 12650.0 1617.0 0.038074 0.039654 -0.040660 0.003217 0.032416 0.006506
10 10.0 120245 0.900312 0.322348 108258.0 11987.0 0.325833 0.293958 0.102950 0.013650 0.143610 0.006506
In [12]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
In [13]:
# Using woE we combine residential status categories.
df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership:RENT'], 
                                                             df_inputs_prepr['home_ownership:OTHER'],
                                                             df_inputs_prepr['home_ownership:NONE'],
                                                             df_inputs_prepr['home_ownership:ANY']])

# IF a region does not feature in the address (state) column, then it should be added and assigned zero values
if ['addr_state:ND'] in df_inputs_prepr.columns.values:
    pass
else:
    df_inputs_prepr['addr_state:ND'] = 0

# Using woE we combine region categories.
df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_inputs_prepr['addr_state:ND'], df_inputs_prepr['addr_state:NE'],
                                              df_inputs_prepr['addr_state:IA'], df_inputs_prepr['addr_state:NV'],
                                              df_inputs_prepr['addr_state:FL'], df_inputs_prepr['addr_state:HI'],
                                                          df_inputs_prepr['addr_state:AL']])

df_inputs_prepr['addr_state:NM_VA'] = sum([df_inputs_prepr['addr_state:NM'], df_inputs_prepr['addr_state:VA']])

df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_inputs_prepr['addr_state:OK'], df_inputs_prepr['addr_state:TN'],
                                              df_inputs_prepr['addr_state:MO'], df_inputs_prepr['addr_state:LA'],
                                              df_inputs_prepr['addr_state:MD'], df_inputs_prepr['addr_state:NC']])

df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_inputs_prepr['addr_state:UT'], df_inputs_prepr['addr_state:KY'],
                                              df_inputs_prepr['addr_state:AZ'], df_inputs_prepr['addr_state:NJ']])

df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_inputs_prepr['addr_state:AR'], df_inputs_prepr['addr_state:MI'],
                                              df_inputs_prepr['addr_state:PA'], df_inputs_prepr['addr_state:OH'],
                                              df_inputs_prepr['addr_state:MN']])

df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_inputs_prepr['addr_state:RI'], df_inputs_prepr['addr_state:MA'],
                                              df_inputs_prepr['addr_state:DE'], df_inputs_prepr['addr_state:SD'],
                                              df_inputs_prepr['addr_state:IN']])

df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_inputs_prepr['addr_state:GA'], df_inputs_prepr['addr_state:WA'],
                                              df_inputs_prepr['addr_state:OR']])

df_inputs_prepr['addr_state:WI_MT'] = sum([df_inputs_prepr['addr_state:WI'], df_inputs_prepr['addr_state:MT']])

df_inputs_prepr['addr_state:IL_CT'] = sum([df_inputs_prepr['addr_state:IL'], df_inputs_prepr['addr_state:CT']])

df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_inputs_prepr['addr_state:KS'], df_inputs_prepr['addr_state:SC'],
                                              df_inputs_prepr['addr_state:CO'], df_inputs_prepr['addr_state:VT'],
                                              df_inputs_prepr['addr_state:AK'], df_inputs_prepr['addr_state:MS']])

df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_inputs_prepr['addr_state:WV'], df_inputs_prepr['addr_state:NH'],
                                              df_inputs_prepr['addr_state:WY'], df_inputs_prepr['addr_state:DC'],
                                              df_inputs_prepr['addr_state:ME'], df_inputs_prepr['addr_state:ID']])

# Using WoE we combine purpose categories.

df_inputs_prepr['purpose:educ__sm_b__wedd__ren_en__mov__house'] = sum([df_inputs_prepr['purpose:educational'], df_inputs_prepr['purpose:small_business'],
                                                                 df_inputs_prepr['purpose:wedding'], df_inputs_prepr['purpose:renewable_energy'],
                                                                 df_inputs_prepr['purpose:moving'], df_inputs_prepr['purpose:house']])
df_inputs_prepr['purpose:oth__med__vacation'] = sum([df_inputs_prepr['purpose:other'], df_inputs_prepr['purpose:medical'],
                                             df_inputs_prepr['purpose:vacation']])
df_inputs_prepr['purpose:major_purch__car__home_impr'] = sum([df_inputs_prepr['purpose:major_purchase'], df_inputs_prepr['purpose:car'],
                                                        df_inputs_prepr['purpose:home_improvement']])
In [14]:
df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)

# We create the following categories: '0', '1', '2 - 4', '5 - 6', '7 - 9', '10'
# '0' will be the reference category
df_inputs_prepr['emp_length:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length:5-6'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 7)), 1, 0)
df_inputs_prepr['emp_length:7-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['emp_length:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['mths_since_issue_d_factor'] = pd.cut(df_inputs_prepr['mths_since_issue_d'], 50)
# Here we perform coarse -classing: we create the following categories:
# < 38, 38 - 39, 40 - 41, 42 - 48, 49 - 52, 53 - 64, 65 - 84, > 84.
df_inputs_prepr['mths_since_issue_d:<38'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38)), 1, 0)
df_inputs_prepr['mths_since_issue_d:38-39'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38, 40)), 1, 0)
df_inputs_prepr['mths_since_issue_d:40-41'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(40, 42)), 1, 0)
df_inputs_prepr['mths_since_issue_d:42-48'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(42, 49)), 1, 0)
df_inputs_prepr['mths_since_issue_d:49-52'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(49, 53)), 1, 0)
df_inputs_prepr['mths_since_issue_d:53-64'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(53, 65)), 1, 0)
df_inputs_prepr['mths_since_issue_d:65-84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(65, 85)), 1, 0)
df_inputs_prepr['mths_since_issue_d:>84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(85, 
                                                                                                      int(df_inputs_prepr['mths_since_issue_d'].max()))), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)
# Here we perform coarse -classing: we create the following categories:
# '< 9.548', '9.548 - 12.025', '12.025 - 15.74', '15.74 - 20.281', '> 20.281'
df_inputs_prepr['int_rate:<9.548'] = np.where((df_inputs_prepr['int_rate'] <= 9.548), 1, 0)
df_inputs_prepr['int_rate:9.548-12.025'] = np.where((df_inputs_prepr['int_rate'] > 9.548) & (df_inputs_prepr['int_rate'] <= 12.025), 1, 0)
df_inputs_prepr['int_rate:12.025-15.74'] = np.where((df_inputs_prepr['int_rate'] > 12.025) & (df_inputs_prepr['int_rate'] <= 15.74), 1, 0)
df_inputs_prepr['int_rate:15.74-20.281'] = np.where((df_inputs_prepr['int_rate'] > 15.74) & (df_inputs_prepr['int_rate'] <= 20.281), 1, 0)
df_inputs_prepr['int_rate:>20.281'] = np.where((df_inputs_prepr['int_rate'] > 20.281), 1, 0)


# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['funded_amnt_factor'] = pd.cut(df_inputs_prepr['funded_amnt'], 50)
# We retain these categories

# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 50)

# Here we perform coarse-classing: we create the following categories:
#< 140, # 141 - 164, # 165 - 247, # 248 - 270, # 271 - 352, # > 352
df_inputs_prepr['mths_since_earliest_cr_line:<140'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:141-164'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140, 165)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:165-247'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(165, 248)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:248-270'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(248, 271)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:271-352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(271, 353)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(353, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)

# Here we perform coarse-classing: we create the following categories:
# Categories: 0, 1-3, >=4
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1-3'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 1) & (df_inputs_prepr['delinq_2yrs'] <= 3), 1, 0)
df_inputs_prepr['delinq_2yrs:>=4'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 9), 1, 0)

# Categories: 0, 1 - 2, 3 - 6, > 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 6), 1, 0)
df_inputs_prepr['inq_last_6mths:>6'] = np.where((df_inputs_prepr['inq_last_6mths'] > 6), 1, 0)

# Categories: '0', '1-3', '4-12', '13-17', '18-22', '23-25', '26-30', '>30'
df_inputs_prepr['open_acc:0'] = np.where((df_inputs_prepr['open_acc'] == 0), 1, 0)
df_inputs_prepr['open_acc:1-3'] = np.where((df_inputs_prepr['open_acc'] >= 1) & (df_inputs_prepr['open_acc'] <= 3), 1, 0)
df_inputs_prepr['open_acc:4-12'] = np.where((df_inputs_prepr['open_acc'] >= 4) & (df_inputs_prepr['open_acc'] <= 12), 1, 0)
df_inputs_prepr['open_acc:13-17'] = np.where((df_inputs_prepr['open_acc'] >= 13) & (df_inputs_prepr['open_acc'] <= 17), 1, 0)
df_inputs_prepr['open_acc:18-22'] = np.where((df_inputs_prepr['open_acc'] >= 18) & (df_inputs_prepr['open_acc'] <= 22), 1, 0)
df_inputs_prepr['open_acc:23-25'] = np.where((df_inputs_prepr['open_acc'] >= 23) & (df_inputs_prepr['open_acc'] <= 25), 1, 0)
df_inputs_prepr['open_acc:26-30'] = np.where((df_inputs_prepr['open_acc'] >= 26) & (df_inputs_prepr['open_acc'] <= 30), 1, 0)
df_inputs_prepr['open_acc:>=31'] = np.where((df_inputs_prepr['open_acc'] >= 31), 1, 0)

# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['pub_rec:0-2'] = np.where((df_inputs_prepr['pub_rec'] >= 0) & (df_inputs_prepr['pub_rec'] <= 2), 1, 0)
df_inputs_prepr['pub_rec:3-4'] = np.where((df_inputs_prepr['pub_rec'] >= 3) & (df_inputs_prepr['pub_rec'] <= 4), 1, 0)
df_inputs_prepr['pub_rec:>=5'] = np.where((df_inputs_prepr['pub_rec'] >= 5), 1, 0)

# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 50)

# # Here we perform coarse-classing: we create the following categories: '<=27', '28-51', '>51'
df_inputs_prepr['total_acc:<=27'] = np.where((df_inputs_prepr['total_acc'] <= 27), 1, 0)
df_inputs_prepr['total_acc:28-51'] = np.where((df_inputs_prepr['total_acc'] >= 28) & (df_inputs_prepr['total_acc'] <= 51), 1, 0)
df_inputs_prepr['total_acc:>=52'] = np.where((df_inputs_prepr['total_acc'] >= 52), 1, 0)

# Coarse classed. Categories: '0', '>=1'
df_inputs_prepr['acc_now_delinq:0'] = np.where((df_inputs_prepr['acc_now_delinq'] == 0), 1, 0)
df_inputs_prepr['acc_now_delinq:>=1'] = np.where((df_inputs_prepr['acc_now_delinq'] >= 1), 1, 0)

# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['total_rev_hi_lim_factor'] = pd.cut(df_inputs_prepr['total_rev_hi_lim'], 2000)

# Coarse classed. Categories: <=5K', '5K-10K', '10K-20K', '20K-30K', '30K-40K', '40K-55K', '55K-95K', '>95K'
df_inputs_prepr['total_rev_hi_lim:<=5K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] <= 5000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:5K-10K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 5000) & (df_inputs_prepr['total_rev_hi_lim'] <= 10000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:10K-20K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 10000) & (df_inputs_prepr['total_rev_hi_lim'] <= 20000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:20K-30K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 20000) & (df_inputs_prepr['total_rev_hi_lim'] <= 30000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:30K-40K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 30000) & (df_inputs_prepr['total_rev_hi_lim'] <= 40000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:40K-55K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 40000) & (df_inputs_prepr['total_rev_hi_lim'] <= 55000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:55K-95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 55000) & (df_inputs_prepr['total_rev_hi_lim'] <= 95000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:>95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 95000), 1, 0)

# Fine classed. Categories: Evenly split into 50 bins
df_inputs_prepr['installment_factor'] = pd.cut(df_inputs_prepr['installment'], 50)

# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 100)

# Coarse classed. We split income in 10 equal categories, each with width of 15k.
df_inputs_prepr['annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)

# Categories: Missing, 0-3, 4-30, 31-56, >=57
df_inputs_prepr['mths_since_last_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_last_delinq'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_delinq:0-3'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 0) & (df_inputs_prepr['mths_since_last_delinq'] <= 3), 1, 0)
df_inputs_prepr['mths_since_last_delinq:4-30'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 4) & (df_inputs_prepr['mths_since_last_delinq'] <= 30), 1, 0)
df_inputs_prepr['mths_since_last_delinq:31-56'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 31) & (df_inputs_prepr['mths_since_last_delinq'] <= 56), 1, 0)
df_inputs_prepr['mths_since_last_delinq:>=57'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 57), 1, 0)

# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 100)

# Categories:
df_inputs_prepr['dti:<=1.4'] = np.where((df_inputs_prepr['dti'] <= 1.4), 1, 0)
df_inputs_prepr['dti:1.4-3.5'] = np.where((df_inputs_prepr['dti'] > 1.4) & (df_inputs_prepr['dti'] <= 3.5), 1, 0)
df_inputs_prepr['dti:3.5-7.7'] = np.where((df_inputs_prepr['dti'] > 3.5) & (df_inputs_prepr['dti'] <= 7.7), 1, 0)
df_inputs_prepr['dti:7.7-10.5'] = np.where((df_inputs_prepr['dti'] > 7.7) & (df_inputs_prepr['dti'] <= 10.5), 1, 0)
df_inputs_prepr['dti:10.5-16.1'] = np.where((df_inputs_prepr['dti'] > 10.5) & (df_inputs_prepr['dti'] <= 16.1), 1, 0)
df_inputs_prepr['dti:16.1-20.3'] = np.where((df_inputs_prepr['dti'] > 16.1) & (df_inputs_prepr['dti'] <= 20.3), 1, 0)
df_inputs_prepr['dti:20.3-21.7'] = np.where((df_inputs_prepr['dti'] > 20.3) & (df_inputs_prepr['dti'] <= 21.7), 1, 0)
df_inputs_prepr['dti:21.7-22.4'] = np.where((df_inputs_prepr['dti'] > 21.7) & (df_inputs_prepr['dti'] <= 22.4), 1, 0)
df_inputs_prepr['dti:22.4-35'] = np.where((df_inputs_prepr['dti'] > 22.4) & (df_inputs_prepr['dti'] <= 35), 1, 0)
df_inputs_prepr['dti:>35'] = np.where((df_inputs_prepr['dti'] > 35), 1, 0)

# Categories: 'Missing', '0-2', '3-20', '21-31', '32-80', '81-86', '>86'
df_inputs_prepr['mths_since_last_record:Missing'] = np.where((df_inputs_prepr['mths_since_last_record'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_record:0-2'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 0) & (df_inputs_prepr['mths_since_last_record'] <= 2), 1, 0)
df_inputs_prepr['mths_since_last_record:3-20'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 3) & (df_inputs_prepr['mths_since_last_record'] <= 20), 1, 0)
df_inputs_prepr['mths_since_last_record:21-31'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 21) & (df_inputs_prepr['mths_since_last_record'] <= 31), 1, 0)
df_inputs_prepr['mths_since_last_record:32-80'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 32) & (df_inputs_prepr['mths_since_last_record'] <= 80), 1, 0)
df_inputs_prepr['mths_since_last_record:81-86'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 81) & (df_inputs_prepr['mths_since_last_record'] <= 86), 1, 0)
df_inputs_prepr['mths_since_last_record:>86'] = np.where((df_inputs_prepr['mths_since_last_record'] > 86), 1, 0)
In [15]:
# View metadata
df_inputs_prepr.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 373028 entries, 427211 to 121958
Columns: 324 entries, Unnamed: 0 to mths_since_last_record:>86
dtypes: category(9), datetime64[ns](2), float64(49), int32(92), int64(10), object(22), uint8(140)
memory usage: 423.5+ MB
In [16]:
##### Store training inputs in dataframe  #####

cr_inp_train = df_inputs_prepr

##### Store test inputs in dataframe
#cr_inp_test = df_inputs_prepr
In [17]:
##### Save training data to CSV file #####
cr_inp_train.to_csv('cr_inp_train.csv')
cr_tgt_train.to_csv('cr_tgt_train.csv')

##### Save test data to CSV file #####

#cr_inp_test.to_csv('cr_inp_test.csv')
#cr_tgt_test.to_csv('cr_tgt_test.csv')

5. PD Model Estimation


5.1. Fit a Naive Model Using the Pre-Selected Predictor Variables


Having performed an initial filtration of predictor variables, a preliminary model is run with these variables. Care should be taken to remove one dummy for each original variable to avoid the so-called dummy variable trap.

In [18]:
loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
loan_data_targets_train = pd.read_csv('cr_tgt_train.csv', index_col = 0)
loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
loan_data_targets_test = pd.read_csv('cr_tgt_test.csv', index_col = 0)
In [19]:
# Select a limited set of input variables in a new dataframe.
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'delinq_2yrs:0',
'delinq_2yrs:1-3',
'delinq_2yrs:>=4',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'open_acc:0',
'open_acc:1-3',
'open_acc:4-12',
'open_acc:13-17',
'open_acc:18-22',
'open_acc:23-25',
'open_acc:26-30',
'open_acc:>=31',
'pub_rec:0-2',
'pub_rec:3-4',
'pub_rec:>=5',
'total_acc:<=27',
'total_acc:28-51',
'total_acc:>=52',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'total_rev_hi_lim:<=5K',
'total_rev_hi_lim:5K-10K',
'total_rev_hi_lim:10K-20K',
'total_rev_hi_lim:20K-30K',
'total_rev_hi_lim:30K-40K',
'total_rev_hi_lim:40K-55K',
'total_rev_hi_lim:55K-95K',
'total_rev_hi_lim:>95K',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]
In [20]:
# Here we store the names of the reference category dummy variables in a list.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'delinq_2yrs:>=4',
'inq_last_6mths:>6',
'open_acc:0',
'pub_rec:0-2',
'total_acc:<=27',
'acc_now_delinq:0',
'total_rev_hi_lim:<=5K',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
In [21]:
# Drop the variables with variable names in the list with reference categories to avoid dummy variable trap
inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Create an instance of an object from the 'LogisticRegression' class with specified parameters
reg = LogisticRegression(solver='lbfgs', max_iter=200,)

# Sets the pandas dataframe options to display all columns/ rows.
#pd.options.display.max_rows = None

# Estimates the coefficients of the object from the 'LogisticRegression' class
# np.ravel(training_labels) is be required to convert the target data into a 1D numpy array
reg.fit(inputs_train, np.ravel(loan_data_targets_train))
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Out[22]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [23]:
# Displays the intercept contain in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.intercept_
Out[23]:
array([-1.69641599])
In [24]:
# Displays the coefficients contained in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.coef_
Out[24]:
array([[ 1.14528344e+00,  8.94099019e-01,  6.98472774e-01,
         5.09252785e-01,  3.30640758e-01,  1.42515826e-01,
         9.17426145e-02,  1.07264042e-01,  3.60975951e-02,
         5.99599905e-02,  6.02816830e-02,  6.31110559e-02,
         7.91132581e-02,  1.36599873e-01,  1.01090449e-01,
         1.85222420e-01,  2.40112427e-01,  2.24601236e-01,
         2.63215859e-01,  3.21179074e-01,  5.24236957e-01,
         8.78829789e-02, -1.05476109e-02,  3.02722220e-01,
         1.99358523e-01,  2.11662337e-01,  2.64641132e-01,
         5.40210138e-02,  7.87898747e-02,  1.00483270e-01,
         1.25279467e-01,  9.04521958e-02,  6.02747351e-02,
         1.23138866e-01,  1.07294514e+00,  8.72162779e-01,
         7.72066669e-01,  5.70767806e-01,  4.08989085e-01,
         1.63339576e-01, -7.20172808e-02,  8.62232885e-01,
         5.47195666e-01,  2.97027495e-01,  1.07258640e-01,
         5.49390592e-02,  3.70621790e-02,  7.80084156e-02,
         1.19031700e-01,  1.24564448e-01,  9.08550235e-02,
         4.84293783e-02,  6.59248926e-01,  5.15715221e-01,
         3.06846154e-01,  3.40689586e-01,  2.44551577e-01,
         2.18617828e-01,  2.02962566e-01,  1.99136097e-01,
         2.35670531e-01,  1.43405647e-01,  1.17522819e-01,
         1.32391438e-01, -2.04353597e-02,  2.42785981e-02,
         1.86650314e-01,  4.13758648e-02,  1.03429616e-02,
         9.69416557e-03,  2.38900769e-02,  4.49928796e-02,
         7.05952276e-02,  2.20377919e-01, -6.82175446e-02,
         2.17117164e-04,  9.77340972e-02,  1.66648488e-01,
         2.42140782e-01,  3.16380993e-01,  3.91557060e-01,
         4.10242844e-01,  4.87244656e-01,  5.77630174e-01,
         5.01437979e-01,  1.83312115e-01,  3.12406482e-01,
         3.42476618e-01,  2.83282138e-01,  2.07270706e-01,
         1.09447019e-01,  9.97760027e-02,  6.26082943e-02,
         2.47770690e-02,  6.92607279e-02,  1.37932727e-01,
         1.48595420e-01,  1.00664499e-01,  3.48068437e-01,
         4.36647841e-01,  3.71307633e-01,  5.34928100e-01,
         1.97559958e-01,  2.64340836e-01]])
In [25]:
feature_name = inputs_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
In [26]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
summary_table.head()
Out[26]:
Feature name Coefficients
0 Intercept -1.696416
1 grade:A 1.145283
2 grade:B 0.894099
3 grade:C 0.698473
4 grade:D 0.509253

5.2. Compute Statistical Significance of Predictor Variables


Having fitted the preliminary model, the p-values of the beta coefficients of the feature variables should be analysed to ascertain their statistical significance and to determine if they should be retained or discarded.

In [27]:
# P values for sklearn logistic regression.

# Class to display p-values for logistic regression in sklearn.

from sklearn import linear_model
import scipy.stats as stat

class LogisticRegression_with_p_values:
    
    def __init__(self,*args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)

    def fit(self,X,y):
        self.model.fit(X,y)
        
        #### Get p-values for the fitted model ####
        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
        Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
        p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
        
        self.coef_ = self.model.coef_
        self.intercept_ = self.model.intercept_
        #self.z_scores = z_scores
        self.p_values = p_values
        #self.sigma_estimates = sigma_estimates
        #self.F_ij = F_ij
In [28]:
reg = LogisticRegression_with_p_values()
# We create an instance of an object from the newly created 'LogisticRegression_with_p_values()' class.
In [29]:
reg.fit(inputs_train, loan_data_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
C:\Users\delga\anaconda3\lib\site-packages\sklearn\utils\validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [30]:
# Same as above.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()

# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = reg.p_values
# Add the intercept for completeness.
p_values = np.append(np.nan, np.array(p_values))
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' var.
summary_table['p_values'] = p_values
summary_table.head()
Out[30]:
Feature name Coefficients p_values
0 Intercept -1.331106 NaN
1 grade:A 1.160091 1.706560e-37
2 grade:B 0.906053 1.039904e-49
3 grade:C 0.708862 6.548970e-36
4 grade:D 0.519061 4.996770e-22

5.3. Fit a Refined Model Using the Statistically Significant Predictor Variables

In [31]:
# We are going to remove some features, the coefficients for all or almost all of the dummy variables for which,
# are not tatistically significant.

# We do that by specifying another list of dummy variables as reference categories, and a list of variables to remove.
# Then, we are going to drop the two datasets from the original list of dummy variables.

# Variables
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]

ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
In [32]:
inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
inputs_train.head()
Out[32]:
grade:A grade:B grade:C grade:D grade:E grade:F home_ownership:OWN home_ownership:MORTGAGE addr_state:NM_VA addr_state:NY ... mths_since_last_delinq:Missing mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 mths_since_last_record:Missing mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
427211 1 0 0 0 0 0 0 1 0 0 ... 1 0 0 0 1 0 0 0 0 0
206088 0 0 1 0 0 0 0 1 0 0 ... 0 1 0 0 1 0 0 0 0 0
136020 1 0 0 0 0 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0
412305 0 0 0 1 0 0 0 0 0 0 ... 0 1 0 0 1 0 0 0 0 0
36159 0 0 1 0 0 0 0 1 0 0 ... 1 0 0 0 1 0 0 0 0 0

5 rows × 84 columns

In [33]:
# Here we run a new model.
reg2 = LogisticRegression_with_p_values()
reg2.fit(inputs_train, loan_data_targets_train)
C:\Users\delga\anaconda3\lib\site-packages\sklearn\utils\validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [34]:
feature_name = inputs_train.columns.values
In [35]:
# Results for our final PD model.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg2.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg2.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table.head()
Out[35]:
Feature name Coefficients p_values
0 Intercept -1.374054 NaN
1 grade:A 1.123655 3.233003e-35
2 grade:B 0.878922 4.274889e-47
3 grade:C 0.684800 6.707744e-34
4 grade:D 0.496923 1.346968e-20
In [36]:
import pickle
#pickle.dump(),takes two arguments: the object you want to pickle and the file to which the object has to be saved.
# To open the file for writing, simply use the open() function. The first argument should be the name of your file. 
# The second argument is 'wb'. The w means that you'll be writing to the file, and b refers to binary mode
# Here we export our model to a 'SAV' file with file name 'pd_model.sav'.
pickle.dump(reg2, open('pd_model1.sav', 'wb'))

6. PD Model Performance

6.1. Accuracy Scores and Confusion Matrices

In [37]:
# Here, from the dataframe with inputs for testing, we keep the same variables that we used in our final PD model.
inputs_test_with_ref_cat = loan_data_inputs_test.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]
In [38]:
# And here, in the list below, we keep the variable names for the reference categories,
# only for the variables we used in our final PD model.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
In [39]:
inputs_test = inputs_test_with_ref_cat.drop(ref_categories, axis = 1)
inputs_test.head()
Out[39]:
grade:A grade:B grade:C grade:D grade:E grade:F home_ownership:OWN home_ownership:MORTGAGE addr_state:NM_VA addr_state:NY ... mths_since_last_delinq:Missing mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 mths_since_last_record:Missing mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
362514 0 0 1 0 0 0 0 1 0 0 ... 1 0 0 0 1 0 0 0 0 0
288564 0 0 0 0 1 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
213591 0 0 1 0 0 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0
263083 0 0 1 0 0 0 0 1 0 0 ... 1 0 0 0 1 0 0 0 0 0
165001 1 0 0 0 0 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0

5 rows × 84 columns

In [40]:
# Calculates the predicted binary values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model
# Output values > 0.5 = 1; Output values < 0.5 = 0;
y_hat_test = reg2.model.predict(inputs_test)
y_hat_test
Out[40]:
array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
In [41]:
loan_data_targets_test_temp = loan_data_targets_test
In [42]:
loan_data_targets_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.
In [43]:
# Concatenates two dataframes.
df_actual_predicted = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test)], axis = 1)
# Names Columns
df_actual_predicted.columns = ['loan_data_targets_test', 'y_hat_test (0.5)']
# Makes the index of one dataframe equal to the index of another dataframe.
df_actual_predicted.index = loan_data_inputs_test.index
df_actual_predicted.head()
Out[43]:
loan_data_targets_test y_hat_test (0.5)
362514 1 1
288564 1 1
213591 1 1
263083 1 1
165001 1 1
In [44]:
import itertools
import numpy as np
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float')
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.3f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "white")

    plt.tight_layout()
    plt.ylabel('*TRUE LABEL*', fontsize=14)
    plt.xlabel('*PREDICTED LABEL*', fontsize=14)
    plt.show()
In [45]:
cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)
Confusion matrix, without normalization
[[    6 10184]
 [    6 83061]]
In [46]:
cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)
Confusion matrix, without normalization
[[    6 10184]
 [    6 83061]]
In [47]:
# Actual vs Predicted Binary target variables (where 0,5 is cutoff for predicted default/non-default)
from sklearn.metrics import accuracy_score
print("Accuracy (Out-of-Sample, threshold=0.5): ", accuracy_score(loan_data_targets_test_temp, y_hat_test))
Accuracy (Out-of-Sample, threshold=0.5):  0.8907320630086749
In [48]:
# Calculates the predicted probability values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model.
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.
y_hat_test_proba = reg2.model.predict_proba(inputs_test)
y_hat_test_proba = y_hat_test_proba[:][:,1]
y_hat_test_proba
Out[48]:
array([0.92430569, 0.84923866, 0.88534974, ..., 0.97321347, 0.95979153,
       0.95236655])
In [49]:
df_actual_predicted_probs = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test_proba)], axis = 1)
df_actual_predicted_probs.columns = ['loan_data_targets_test', 'y_hat_test_proba']
df_actual_predicted_probs.index = loan_data_inputs_test.index
df_actual_predicted_probs.head()
Out[49]:
loan_data_targets_test y_hat_test_proba
362514 1 0.924306
288564 1 0.849239
213591 1 0.885350
263083 1 0.940636
165001 1 0.968665
In [50]:
df_actual_predicted_probs.head()
Out[50]:
loan_data_targets_test y_hat_test_proba
362514 1 0.924306
288564 1 0.849239
213591 1 0.885350
263083 1 0.940636
165001 1 0.968665
In [51]:
import matplotlib.pyplot as plt
plt.hist(df_actual_predicted_probs['y_hat_test_proba'], bins=50)
plt.title('Probability Distribution - No Default', fontsize=20)
plt.show()
In [52]:
tr = 0.9
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)
In [53]:
# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.

cm_df = pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], 
                    df_actual_predicted_probs['y_hat_test'], 
                    rownames = ['Actual'], colnames = ['Predicted'])

# Confusion Matrix as numpy array
cm_arr = np.array(cm_df)
cm_arr
Out[53]:
array([[ 7374,  2816],
       [35812, 47255]], dtype=int64)
In [54]:
# Confusion Matrix normalized by number of observations
cm_arr_norm = np.array([[cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])), 
                          cm_arr[0,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))], 
                         [cm_arr[1,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])), 
                          cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))]])
cm_arr_norm
Out[54]:
array([[0.07907181, 0.03019612],
       [0.38401407, 0.50671799]])
In [55]:
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.9',
                          cmap=plt.cm.RdYlGn)
Confusion matrix, without normalization
[[ 7374  2816]
 [35812 47255]]
In [56]:
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr_norm, classes,
                          normalize=True,
                          title='NORM. CONFUSION MATRIX - Threshold = 0.9',
                          cmap=plt.cm.RdYlGn)
Normalized confusion matrix
[[0.07907181 0.03019612]
 [0.38401407 0.50671799]]
In [57]:
print("Accuracy (Out-of-Sample, threshold=0.9): ", cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])) 
      + cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])))
Accuracy (Out-of-Sample, threshold=0.9):  0.5857898066633067

6.2. Receiver Operating Characteristic (ROC) curve , Area under Curve (AUC) and Gini Coefficient


Model performance is evaluated taking into consideration the shape of the ROC curve, the Area under the ROC Cuve and the Gini Coefficient for the Testing (Out of Sample) Data. </font>

In [58]:
from sklearn.metrics import roc_curve, roc_auc_score
In [59]:
roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.
Out[59]:
(array([0.        , 0.        , 0.        , ..., 0.99960746, 1.        ,
        1.        ]),
 array([0.00000000e+00, 1.20384750e-05, 1.20384750e-04, ...,
        9.99975923e-01, 9.99975923e-01, 1.00000000e+00]),
 array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
        0.37527935]))
In [60]:
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Here we store each of the three arrays in a separate variable. 
In [61]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [62]:
plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False Pos rate (% of Bad Loans Incorr. classified)')
# We name the x-axis "False positive rate".
plt.ylabel('True Pos rate (% of Good Loans Corr. Classified)')
# We name the x-axis "True positive rate".
plt.title('ROC curve',fontsize=20)
# We name the graph "ROC curve".
Out[62]:
Text(0.5, 1.0, 'ROC curve')
In [63]:
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
optimal_tpr = tpr[optimal_idx]
optimal_fpr = fpr[optimal_idx]
optimal_threshold, optimal_tpr, optimal_fpr
print("Optimal Threshold of          : ", optimal_threshold)
print("At Index                      : ", optimal_idx)
print("With Optimal True Pos Rate of : ", optimal_tpr)
print("And Optimal False Pos Rate of : ", optimal_fpr)
Optimal Threshold of          :  0.8859712927282212
At Index                      :  6625
With Optimal True Pos Rate of :  0.6438778335560451
And Optimal False Pos Rate of :  0.34720314033366045
In [64]:
j_scores = tpr-fpr
j_ordered = sorted(zip(fpr, tpr, j_scores, thresholds))
j_ordered_df = pd.DataFrame(data=j_ordered, columns=['FPR', 'TPR', 'TPR-FPR','Thresholds'])
j_ordered_df.head()
Out[64]:
FPR TPR TPR-FPR Thresholds
0 0.000000 0.000000 0.000000 1.992629
1 0.000000 0.000012 0.000012 0.992629
2 0.000000 0.000120 0.000120 0.990698
3 0.000098 0.000120 0.000022 0.990653
4 0.000098 0.000433 0.000335 0.989762
In [65]:
j_ordered_df.tail()
Out[65]:
FPR TPR TPR-FPR Thresholds
17258 0.999411 0.999964 0.000553 0.493404
17259 0.999607 0.999964 0.000356 0.488601
17260 0.999607 0.999976 0.000368 0.487910
17261 1.000000 0.999976 -0.000024 0.393734
17262 1.000000 1.000000 0.000000 0.375279
In [66]:
AUROC = roc_auc_score(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC
Out[66]:
0.7022080707330224
In [67]:
df_actual_predicted_probs = df_actual_predicted_probs.sort_values('y_hat_test_proba')
# Sorts a dataframe by the values of a specific column.
In [68]:
df_actual_predicted_probs.head()
Out[68]:
loan_data_targets_test y_hat_test_proba y_hat_test
42341 1 0.375279 0
42344 1 0.392099 0
39810 0 0.393734 0
40518 0 0.448967 0
42396 0 0.457733 0
In [69]:
df_actual_predicted_probs.tail()
Out[69]:
loan_data_targets_test y_hat_test_proba y_hat_test
262480 1 0.991292 1
231463 1 0.991304 1
239228 1 0.991652 1
261086 1 0.992058 1
242624 1 0.992629 1
In [70]:
df_actual_predicted_probs = df_actual_predicted_probs.reset_index()
# We reset the index of a dataframe and overwrite it.
In [71]:
df_actual_predicted_probs.head()
Out[71]:
index loan_data_targets_test y_hat_test_proba y_hat_test
0 42341 1 0.375279 0
1 42344 1 0.392099 0
2 39810 0 0.393734 0
3 40518 0 0.448967 0
4 42396 0 0.457733 0
In [72]:
df_actual_predicted_probs['Cumulative N Population'] = df_actual_predicted_probs.index + 1
# We calculate the cumulative number of all observations.
# We use the new index for that. Since indexing in ython starts from 0, we add 1 to each index.
df_actual_predicted_probs['Cumulative N Good'] = df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'good', which is the cumulative sum of the column with actual observations.
df_actual_predicted_probs['Cumulative N Bad'] = df_actual_predicted_probs['Cumulative N Population'] - df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'bad', which is
# the difference between the cumulative number of all observations and cumulative number of 'good' for each row.
In [73]:
df_actual_predicted_probs.head()
Out[73]:
index loan_data_targets_test y_hat_test_proba y_hat_test Cumulative N Population Cumulative N Good Cumulative N Bad
0 42341 1 0.375279 0 1 1 0
1 42344 1 0.392099 0 2 2 0
2 39810 0 0.393734 0 3 2 1
3 40518 0 0.448967 0 4 2 2
4 42396 0 0.457733 0 5 2 3
In [74]:
df_actual_predicted_probs['Cumulative Perc Population'] = df_actual_predicted_probs['Cumulative N Population'] / (df_actual_predicted_probs.shape[0])
# We calculate the cumulative percentage of all observations.
df_actual_predicted_probs['Cumulative Perc Good'] = df_actual_predicted_probs['Cumulative N Good'] / df_actual_predicted_probs['loan_data_targets_test'].sum()
# We calculate cumulative percentage of 'good'.
df_actual_predicted_probs['Cumulative Perc Bad'] = df_actual_predicted_probs['Cumulative N Bad'] / (df_actual_predicted_probs.shape[0] - df_actual_predicted_probs['loan_data_targets_test'].sum())
# We calculate the cumulative percentage of 'bad'.
In [75]:
df_actual_predicted_probs.head()
Out[75]:
index loan_data_targets_test y_hat_test_proba y_hat_test Cumulative N Population Cumulative N Good Cumulative N Bad Cumulative Perc Population Cumulative Perc Good Cumulative Perc Bad
0 42341 1 0.375279 0 1 1 0 0.000011 0.000012 0.000000
1 42344 1 0.392099 0 2 2 0 0.000021 0.000024 0.000000
2 39810 0 0.393734 0 3 2 1 0.000032 0.000024 0.000098
3 40518 0 0.448967 0 4 2 2 0.000043 0.000024 0.000196
4 42396 0 0.457733 0 5 2 3 0.000054 0.000024 0.000294
In [76]:
df_actual_predicted_probs.tail()
Out[76]:
index loan_data_targets_test y_hat_test_proba y_hat_test Cumulative N Population Cumulative N Good Cumulative N Bad Cumulative Perc Population Cumulative Perc Good Cumulative Perc Bad
93252 262480 1 0.991292 1 93253 83063 10190 0.999957 0.999952 1.0
93253 231463 1 0.991304 1 93254 83064 10190 0.999968 0.999964 1.0
93254 239228 1 0.991652 1 93255 83065 10190 0.999979 0.999976 1.0
93255 261086 1 0.992058 1 93256 83066 10190 0.999989 0.999988 1.0
93256 242624 1 0.992629 1 93257 83067 10190 1.000000 1.000000 1.0
In [77]:
# Plot Prob of Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative Perc Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,

plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)
Out[77]:
Text(0.5, 1.0, 'Probability Default - Portfolio Constituents')
In [78]:
# Plot Prob of No Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative N Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,

plt.xlabel('Cumulative N Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)
# We name the graph "Gini".
Out[78]:
Text(0.5, 1.0, 'Probability Default - Portfolio Constituents')
In [79]:
# Plot Gini
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Bad'])
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,
# thus plotting the Gini curve.
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Population'], 
         linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Cumulative % Observed Bad')
# We name the y-axis "Cumulative % Bad".
plt.title('Gini',fontsize=20)
# We name the graph "Gini".
Out[79]:
Text(0.5, 1.0, 'Gini')
In [80]:
Gini = AUROC * 2 - 1
# Here we calculate Gini from AUROC.
Gini
Out[80]:
0.40441614146604477

6.3. Kolmogorov-Smirnov Coefficient


Model performance is evaluated taking into consideration the KS Coefficient for the Testing (Out of Sample) Data which measures the maximum difference between the cumulative distribution functions of observed good and bad borrowers with respect to the estimated probabilities of "Good" according to the model. The greater the difference, the better the model.

In [81]:
# Plot KS
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Bad'], color = 'r')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'bad' along the y-axis,
# colored in red.
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Good'], color = 'b')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'good' along the y-axis,
# colored in red.
plt.xlabel('Estimated Probability for being Good')
# We name the x-axis "Estimated Probability for being Good".
plt.ylabel('Cumulative %')
# We name the y-axis "Cumulative %".
plt.legend(['Cumulative Perc Bad', 'Cumulative Perc Good'])
plt.title('Kolmogorov-Smirnov',fontsize=20)
# We name the graph "Kolmogorov-Smirnov".
Out[81]:
Text(0.5, 1.0, 'Kolmogorov-Smirnov')
In [82]:
KS = max(df_actual_predicted_probs['Cumulative Perc Bad'] - df_actual_predicted_probs['Cumulative Perc Good'])
# We calculate KS from the data. It is the maximum of the difference between the cumulative percentage of 'bad'
# and the cumulative percentage of 'good'.
print("KS Coefficient: ", KS)
KS Coefficient:  0.2966746932223847

7. PD Model Application

Calculating PD of individual accounts
In [83]:
#pd.options.display.max_columns = None
# Sets the pandas dataframe options to display all columns/ rows.
In [84]:
inputs_test_with_ref_cat.head()
Out[84]:
grade:A grade:B grade:C grade:D grade:E grade:F grade:G home_ownership:RENT_OTHER_NONE_ANY home_ownership:OWN home_ownership:MORTGAGE ... mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 mths_since_last_record:Missing mths_since_last_record:0-2 mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
362514 0 0 1 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
288564 0 0 0 0 1 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
213591 0 0 1 0 0 0 0 0 0 1 ... 0 1 0 1 0 0 0 0 0 0
263083 0 0 1 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
165001 1 0 0 0 0 0 0 0 0 1 ... 0 1 0 1 0 0 0 0 0 0

5 rows × 101 columns

In [85]:
summary_table.head()
Out[85]:
Feature name Coefficients p_values
0 Intercept -1.374054 NaN
1 grade:A 1.123655 3.233003e-35
2 grade:B 0.878922 4.274889e-47
3 grade:C 0.684800 6.707744e-34
4 grade:D 0.496923 1.346968e-20
In [86]:
y_hat_test_proba
Out[86]:
array([0.92430569, 0.84923866, 0.88534974, ..., 0.97321347, 0.95979153,
       0.95236655])
Creating a Scorecard
In [87]:
summary_table.head()
Out[87]:
Feature name Coefficients p_values
0 Intercept -1.374054 NaN
1 grade:A 1.123655 3.233003e-35
2 grade:B 0.878922 4.274889e-47
3 grade:C 0.684800 6.707744e-34
4 grade:D 0.496923 1.346968e-20
In [88]:
ref_categories
Out[88]:
['grade:G',
 'home_ownership:RENT_OTHER_NONE_ANY',
 'addr_state:ND_NE_IA_NV_FL_HI_AL',
 'verification_status:Verified',
 'purpose:educ__sm_b__wedd__ren_en__mov__house',
 'initial_list_status:f',
 'term:60',
 'emp_length:0',
 'mths_since_issue_d:>84',
 'int_rate:>20.281',
 'mths_since_earliest_cr_line:<140',
 'inq_last_6mths:>6',
 'acc_now_delinq:0',
 'annual_inc:<20K',
 'dti:>35',
 'mths_since_last_delinq:0-3',
 'mths_since_last_record:0-2']
In [89]:
df_ref_categories = pd.DataFrame(ref_categories, columns = ['Feature name'])
# We create a new dataframe with one column. Its values are the values from the 'reference_categories' list.
# We name it 'Feature name'.
df_ref_categories['Coefficients'] = 0
# We create a second column, called 'Coefficients', which contains only 0 values.
df_ref_categories['p_values'] = np.nan
# We create a third column, called 'p_values', with contains only NaN values.
df_ref_categories.head()
Out[89]:
Feature name Coefficients p_values
0 grade:G 0 NaN
1 home_ownership:RENT_OTHER_NONE_ANY 0 NaN
2 addr_state:ND_NE_IA_NV_FL_HI_AL 0 NaN
3 verification_status:Verified 0 NaN
4 purpose:educ__sm_b__wedd__ren_en__mov__house 0 NaN
In [90]:
df_scorecard = pd.concat([summary_table, df_ref_categories])
# Concatenates two dataframes.
df_scorecard = df_scorecard.reset_index()
# We reset the index of a dataframe.
df_scorecard
Out[90]:
index Feature name Coefficients p_values
0 0 Intercept -1.374054 NaN
1 1 grade:A 1.123655 3.233003e-35
2 2 grade:B 0.878922 4.274889e-47
3 3 grade:C 0.684800 6.707744e-34
4 4 grade:D 0.496923 1.346968e-20
... ... ... ... ...
97 12 acc_now_delinq:0 0.000000 NaN
98 13 annual_inc:<20K 0.000000 NaN
99 14 dti:>35 0.000000 NaN
100 15 mths_since_last_delinq:0-3 0.000000 NaN
101 16 mths_since_last_record:0-2 0.000000 NaN

102 rows × 4 columns

In [91]:
df_scorecard['Original feature name'] = df_scorecard['Feature name'].str.split(':').str[0]
# We create a new column, called 'Original feature name', which contains the value of the 'Feature name' column,
# up to the column symbol.
df_scorecard
Out[91]:
index Feature name Coefficients p_values Original feature name
0 0 Intercept -1.374054 NaN Intercept
1 1 grade:A 1.123655 3.233003e-35 grade
2 2 grade:B 0.878922 4.274889e-47 grade
3 3 grade:C 0.684800 6.707744e-34 grade
4 4 grade:D 0.496923 1.346968e-20 grade
... ... ... ... ... ...
97 12 acc_now_delinq:0 0.000000 NaN acc_now_delinq
98 13 annual_inc:<20K 0.000000 NaN annual_inc
99 14 dti:>35 0.000000 NaN dti
100 15 mths_since_last_delinq:0-3 0.000000 NaN mths_since_last_delinq
101 16 mths_since_last_record:0-2 0.000000 NaN mths_since_last_record

102 rows × 5 columns

In [92]:
min_score = 300
max_score = 850
In [93]:
df_scorecard.groupby('Original feature name')['Coefficients'].min()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
Out[93]:
Original feature name
Intercept                     -1.374054
acc_now_delinq                 0.000000
addr_state                     0.000000
annual_inc                    -0.081517
dti                            0.000000
emp_length                     0.000000
grade                          0.000000
home_ownership                 0.000000
initial_list_status            0.000000
inq_last_6mths                 0.000000
int_rate                       0.000000
mths_since_earliest_cr_line    0.000000
mths_since_issue_d            -0.071796
mths_since_last_delinq         0.000000
mths_since_last_record         0.000000
purpose                        0.000000
term                           0.000000
verification_status           -0.011183
Name: Coefficients, dtype: float64
In [94]:
min_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].min().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the minimum values.
min_sum_coef
Out[94]:
-1.5385497433222481
In [95]:
df_scorecard.groupby('Original feature name')['Coefficients'].max()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
Out[95]:
Original feature name
Intercept                     -1.374054
acc_now_delinq                 0.180360
addr_state                     0.521964
annual_inc                     0.552376
dti                            0.384454
emp_length                     0.125851
grade                          1.123655
home_ownership                 0.106250
initial_list_status            0.053828
inq_last_6mths                 0.666280
int_rate                       0.883160
mths_since_earliest_cr_line    0.129361
mths_since_issue_d             1.084200
mths_since_last_delinq         0.183097
mths_since_last_record         0.502965
purpose                        0.301853
term                           0.078943
verification_status            0.085721
Name: Coefficients, dtype: float64
In [96]:
max_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].max().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the maximum values.
max_sum_coef
Out[96]:
5.590263276946006
In [97]:
df_scorecard['Score - Calculation'] = df_scorecard['Coefficients'] * (max_score - min_score) / (max_sum_coef - min_sum_coef)
# We multiply the value of the 'Coefficients' column by the ration of the differences between
# maximum score and minimum score and maximum sum of coefficients and minimum sum of cefficients.
df_scorecard
Out[97]:
index Feature name Coefficients p_values Original feature name Score - Calculation
0 0 Intercept -1.374054 NaN Intercept -106.010607
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454
... ... ... ... ... ... ...
97 12 acc_now_delinq:0 0.000000 NaN acc_now_delinq 0.000000
98 13 annual_inc:<20K 0.000000 NaN annual_inc 0.000000
99 14 dti:>35 0.000000 NaN dti 0.000000
100 15 mths_since_last_delinq:0-3 0.000000 NaN mths_since_last_delinq 0.000000
101 16 mths_since_last_record:0-2 0.000000 NaN mths_since_last_record 0.000000

102 rows × 6 columns

In [98]:
df_scorecard['Score - Calculation'][0] = ((df_scorecard['Coefficients'][0] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score
# We divide the difference of the value of the 'Coefficients' column and the minimum sum of coefficients by
# the difference of the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we multiply that by the difference between the maximum score and the minimum score.
# Then, we add minimum score. 
df_scorecard.head()
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
Out[98]:
index Feature name Coefficients p_values Original feature name Score - Calculation
0 0 Intercept -1.374054 NaN Intercept 312.691112
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454
In [99]:
df_scorecard['Score - Preliminary'] = df_scorecard['Score - Calculation'].round()
# We round the values of the 'Score - Calculation' column.
df_scorecard.head()
Out[99]:
index Feature name Coefficients p_values Original feature name Score - Calculation Score - Preliminary
0 0 Intercept -1.374054 NaN Intercept 312.691112 313.0
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898 87.0
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341 68.0
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454 53.0
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454 38.0
In [100]:
min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel
Out[100]:
300.0
In [101]:
max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel
Out[101]:
851.0
In [102]:
# One has to be subtracted from the maximum score for one original variable. Which one? We'll evaluate based on differences.
In [103]:
df_scorecard['Difference'] = df_scorecard['Score - Preliminary'] - df_scorecard['Score - Calculation']
df_scorecard.head()
Out[103]:
index Feature name Coefficients p_values Original feature name Score - Calculation Score - Preliminary Difference
0 0 Intercept -1.374054 NaN Intercept 312.691112 313.0 0.308888
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898 87.0 0.308102
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341 68.0 0.189659
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454 53.0 0.166546
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454 38.0 -0.338454
In [104]:
df_scorecard['Score - Final'] = df_scorecard['Score - Preliminary']
df_scorecard['Score - Final'][77] = 16
df_scorecard.head()
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[104]:
index Feature name Coefficients p_values Original feature name Score - Calculation Score - Preliminary Difference Score - Final
0 0 Intercept -1.374054 NaN Intercept 312.691112 313.0 0.308888 313.0
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898 87.0 0.308102 87.0
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341 68.0 0.189659 68.0
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454 53.0 0.166546 53.0
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454 38.0 -0.338454 38.0
In [105]:
min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel
Out[105]:
300.0
In [106]:
max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel
Out[106]:
853.0
Caclulating Credit Score
In [107]:
inputs_test_with_ref_cat.head()
Out[107]:
grade:A grade:B grade:C grade:D grade:E grade:F grade:G home_ownership:RENT_OTHER_NONE_ANY home_ownership:OWN home_ownership:MORTGAGE ... mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 mths_since_last_record:Missing mths_since_last_record:0-2 mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
362514 0 0 1 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
288564 0 0 0 0 1 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
213591 0 0 1 0 0 0 0 0 0 1 ... 0 1 0 1 0 0 0 0 0 0
263083 0 0 1 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
165001 1 0 0 0 0 0 0 0 0 1 ... 0 1 0 1 0 0 0 0 0 0

5 rows × 101 columns

In [108]:
df_scorecard.head()
Out[108]:
index Feature name Coefficients p_values Original feature name Score - Calculation Score - Preliminary Difference Score - Final
0 0 Intercept -1.374054 NaN Intercept 312.691112 313.0 0.308888 313.0
1 1 grade:A 1.123655 3.233003e-35 grade 86.691898 87.0 0.308102 87.0
2 2 grade:B 0.878922 4.274889e-47 grade 67.810341 68.0 0.189659 68.0
3 3 grade:C 0.684800 6.707744e-34 grade 52.833454 53.0 0.166546 53.0
4 4 grade:D 0.496923 1.346968e-20 grade 38.338454 38.0 -0.338454 38.0
In [109]:
inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat
In [110]:
inputs_test_with_ref_cat_w_intercept.insert(0, 'Intercept', 1)
# We insert a column in the dataframe, with an index of 0, that is, in the beginning of the dataframe.
# The name of that column is 'Intercept', and its values are 1s.
In [111]:
inputs_test_with_ref_cat_w_intercept.head()
Out[111]:
Intercept grade:A grade:B grade:C grade:D grade:E grade:F grade:G home_ownership:RENT_OTHER_NONE_ANY home_ownership:OWN ... mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 mths_since_last_record:Missing mths_since_last_record:0-2 mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
362514 1 0 0 1 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
288564 1 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
213591 1 0 0 1 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
263083 1 0 0 1 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
165001 1 1 0 0 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0

5 rows × 102 columns

In [112]:
inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat_w_intercept[df_scorecard['Feature name'].values]
# Here, from the 'inputs_test_with_ref_cat_w_intercept' dataframe, we keep only the columns with column names,
# exactly equal to the row values of the 'Feature name' column from the 'df_scorecard' dataframe.
In [113]:
inputs_test_with_ref_cat_w_intercept.head()
Out[113]:
Intercept grade:A grade:B grade:C grade:D grade:E grade:F home_ownership:OWN home_ownership:MORTGAGE addr_state:NM_VA ... emp_length:0 mths_since_issue_d:>84 int_rate:>20.281 mths_since_earliest_cr_line:<140 inq_last_6mths:>6 acc_now_delinq:0 annual_inc:<20K dti:>35 mths_since_last_delinq:0-3 mths_since_last_record:0-2
362514 1 0 0 1 0 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 0 0
288564 1 0 0 0 0 1 0 0 1 0 ... 0 0 1 0 0 1 0 0 1 0
213591 1 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 0 0 0
263083 1 0 0 1 0 0 0 0 1 0 ... 0 0 0 1 0 1 0 0 0 0
165001 1 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 102 columns

In [114]:
scorecard_scores = df_scorecard['Score - Final']
In [115]:
inputs_test_with_ref_cat_w_intercept.shape
Out[115]:
(93257, 102)
In [116]:
scorecard_scores.shape
Out[116]:
(102,)
In [117]:
scorecard_scores = scorecard_scores.values.reshape(102, 1)
In [118]:
scorecard_scores.shape
Out[118]:
(102, 1)
In [119]:
y_scores = inputs_test_with_ref_cat_w_intercept.dot(scorecard_scores)
# Here we multiply the values of each row of the dataframe by the values of each column of the variable,
# which is an argument of the 'dot' method, and sum them. It's essentially the sum of the products.
In [120]:
y_scores.head()
Out[120]:
0
362514 614.0
288564 553.0
213591 579.0
263083 633.0
165001 685.0
In [121]:
y_scores.tail()
Out[121]:
0
115 573.0
296284 679.0
61777 696.0
91763 664.0
167512 651.0
From Credit Score to PD
In [122]:
sum_coef_from_score = ((y_scores - min_score) / (max_score - min_score)) * (max_sum_coef - min_sum_coef) + min_sum_coef
# We divide the difference between the scores and the minimum score by
# the difference between the maximum score and the minimum score.
# Then, we multiply that by the difference between the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we add the minimum sum of coefficients.
In [123]:
y_hat_proba_from_score = np.exp(sum_coef_from_score) / (np.exp(sum_coef_from_score) + 1)
# Here we divide an exponent raised to sum of coefficients from score by
# an exponent raised to sum of coefficients from score plus one.
y_hat_proba_from_score.head()
Out[123]:
0
362514 0.926311
288564 0.850776
213591 0.888717
263083 0.941455
165001 0.969279
In [124]:
y_hat_test_proba[0: 5]
Out[124]:
array([0.92430569, 0.84923866, 0.88534974, 0.94063609, 0.96866495])
In [125]:
df_actual_predicted_probs['y_hat_test_proba'].head()
Out[125]:
0    0.375279
1    0.392099
2    0.393734
3    0.448967
4    0.457733
Name: y_hat_test_proba, dtype: float64
Setting Cut-offs
In [126]:
# We need the confusion matrix again.
#np.where(np.squeeze(np.array(loan_data_targets_test)) == np.where(y_hat_test_proba >= tr, 1, 0), 1, 0).sum() / loan_data_targets_test.shape[0]
tr = 0.9
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)
#df_actual_predicted_probs['loan_data_targets_test'] == np.where(df_actual_predicted_probs['y_hat_test_proba'] >= tr, 1, 0)
In [127]:
pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted'])
Out[127]:
Predicted 0 1
Actual
0 7374 2816
1 35812 47255
In [128]:
pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
Out[128]:
Predicted 0 1
Actual
0 0.079072 0.030196
1 0.384014 0.506718
In [129]:
(pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]
Out[129]:
0.5857898066633067
In [130]:
from sklearn.metrics import roc_curve, roc_auc_score
In [131]:
roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
Out[131]:
(array([0.        , 0.        , 0.        , ..., 0.99960746, 1.        ,
        1.        ]),
 array([0.00000000e+00, 1.20384750e-05, 1.20384750e-04, ...,
        9.99975923e-01, 9.99975923e-01, 1.00000000e+00]),
 array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
        0.37527935]))
In [132]:
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
In [133]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [134]:
plt.plot(fpr, tpr)
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
Out[134]:
Text(0.5, 1.0, 'ROC curve')
In [135]:
thresholds
Out[135]:
array([1.99262874, 0.99262874, 0.99069789, ..., 0.48790992, 0.39373402,
       0.37527935])
In [136]:
thresholds.shape
Out[136]:
(17263,)
In [137]:
df_cutoffs = pd.concat([pd.DataFrame(thresholds), pd.DataFrame(fpr), pd.DataFrame(tpr)], axis = 1)
# We concatenate 3 dataframes along the columns.
In [138]:
df_cutoffs.columns = ['thresholds', 'fpr', 'tpr']
# We name the columns of the dataframe 'thresholds', 'fpr', and 'tpr'.
In [139]:
df_cutoffs.head()
Out[139]:
thresholds fpr tpr
0 1.992629 0.000000 0.000000
1 0.992629 0.000000 0.000012
2 0.990698 0.000000 0.000120
3 0.990653 0.000098 0.000120
4 0.989762 0.000098 0.000433
In [140]:
df_cutoffs['thresholds'][0] = 1 - 1 / np.power(10, 16)
# Let the first threshold (the value of the thresholds column with index 0) be equal to a number, very close to 1
# but smaller than 1, say 1 - 1 / 10 ^ 16.
In [141]:
df_cutoffs['Score'] = ((np.log(df_cutoffs['thresholds'] / (1 - df_cutoffs['thresholds'])) - min_sum_coef) * ((max_score - min_score) / (max_sum_coef - min_sum_coef)) + min_score).round()
# The score corresponsing to each threshold equals:
# The the difference between the natural logarithm of the ratio of the threshold and 1 minus the threshold and
# the minimum sum of coefficients
# multiplied by
# the sum of the minimum score and the ratio of the difference between the maximum score and minimum score and 
# the difference between the maximum sum of coefficients and the minimum sum of coefficients.
In [142]:
df_cutoffs.head()
Out[142]:
thresholds fpr tpr Score
0 1.000000 0.000000 0.000000 2066.0
1 0.992629 0.000000 0.000012 797.0
2 0.990698 0.000000 0.000120 779.0
3 0.990653 0.000098 0.000120 778.0
4 0.989762 0.000098 0.000433 771.0
In [143]:
df_cutoffs['Score'][0] = max_score
In [144]:
df_cutoffs.head()
Out[144]:
thresholds fpr tpr Score
0 1.000000 0.000000 0.000000 850.0
1 0.992629 0.000000 0.000012 797.0
2 0.990698 0.000000 0.000120 779.0
3 0.990653 0.000098 0.000120 778.0
4 0.989762 0.000098 0.000433 771.0
In [145]:
df_cutoffs.tail()
Out[145]:
thresholds fpr tpr Score
17258 0.493404 0.999411 0.999964 417.0
17259 0.488601 0.999607 0.999964 415.0
17260 0.487910 0.999607 0.999976 415.0
17261 0.393734 1.000000 0.999976 385.0
17262 0.375279 1.000000 1.000000 379.0
In [146]:
# We define a function called 'n_approved' which assigns a value of 1 if a predicted probability
# is greater than the parameter p, which is a threshold, and a value of 0, if it is not.
# Then it sums the column.
# Thus, if given any percentage values, the function will return
# the number of rows wih estimated probabilites greater than the threshold. 
def n_approved(p):
    return np.where(df_actual_predicted_probs['y_hat_test_proba'] >= p, 1, 0).sum()
In [147]:
df_cutoffs['N Approved'] = df_cutoffs['thresholds'].apply(n_approved)
# Assuming that all credit applications above a given probability of being 'good' will be approved,
# when we apply the 'n_approved' function to a threshold, it will return the number of approved applications.
# Thus, here we calculate the number of approved appliations for al thresholds.
df_cutoffs['N Rejected'] = df_actual_predicted_probs['y_hat_test_proba'].shape[0] - df_cutoffs['N Approved']
# Then, we calculate the number of rejected applications for each threshold.
# It is the difference between the total number of applications and the approved applications for that threshold.
df_cutoffs['Approval Rate'] = df_cutoffs['N Approved'] / df_actual_predicted_probs['y_hat_test_proba'].shape[0]
# Approval rate equalts the ratio of the approved applications and all applications.
df_cutoffs['Rejection Rate'] = 1 - df_cutoffs['Approval Rate']
# Rejection rate equals one minus approval rate.
In [148]:
df_cutoffs.head()
Out[148]:
thresholds fpr tpr Score N Approved N Rejected Approval Rate Rejection Rate
0 1.000000 0.000000 0.000000 850.0 0 93257 0.000000 1.000000
1 0.992629 0.000000 0.000012 797.0 1 93256 0.000011 0.999989
2 0.990698 0.000000 0.000120 779.0 10 93247 0.000107 0.999893
3 0.990653 0.000098 0.000120 778.0 11 93246 0.000118 0.999882
4 0.989762 0.000098 0.000433 771.0 37 93220 0.000397 0.999603
In [149]:
df_cutoffs.tail()
Out[149]:
thresholds fpr tpr Score N Approved N Rejected Approval Rate Rejection Rate
17258 0.493404 0.999411 0.999964 417.0 93248 9 0.999903 0.000097
17259 0.488601 0.999607 0.999964 415.0 93250 7 0.999925 0.000075
17260 0.487910 0.999607 0.999976 415.0 93251 6 0.999936 0.000064
17261 0.393734 1.000000 0.999976 385.0 93255 2 0.999979 0.000021
17262 0.375279 1.000000 1.000000 379.0 93257 0 1.000000 0.000000
In [150]:
df_cutoffs.iloc[5000: 5200, ]
# Here we display the dataframe with cutoffs form line with index 5000 to line with index 5200.
Out[150]:
thresholds fpr tpr Score N Approved N Rejected Approval Rate Rejection Rate
5000 0.903616 0.259176 0.547462 591.0 48117 45140 0.515961 0.484039
5001 0.903598 0.259176 0.547630 591.0 48131 45126 0.516111 0.483889
5002 0.903596 0.259274 0.547630 591.0 48132 45125 0.516122 0.483878
5003 0.903592 0.259274 0.547666 591.0 48135 45122 0.516154 0.483846
5004 0.903591 0.259372 0.547666 591.0 48136 45121 0.516165 0.483835
... ... ... ... ... ... ... ... ...
5195 0.901334 0.270265 0.560945 589.0 49350 43907 0.529183 0.470817
5196 0.901333 0.270363 0.560945 589.0 49351 43906 0.529194 0.470806
5197 0.901275 0.270363 0.561330 589.0 49383 43874 0.529537 0.470463
5198 0.901272 0.270461 0.561330 589.0 49384 43873 0.529547 0.470453
5199 0.901268 0.270461 0.561426 589.0 49392 43865 0.529633 0.470367

200 rows × 8 columns

In [151]:
df_cutoffs.iloc[1000: 1200, ]
# Here we display the dataframe with cutoffs form line with index 1000 to line with index 1200.
Out[151]:
thresholds fpr tpr Score N Approved N Rejected Approval Rate Rejection Rate
1000 0.953241 0.049166 0.206592 651.0 17662 75595 0.189391 0.810609
1001 0.953231 0.049166 0.206737 651.0 17674 75583 0.189519 0.810481
1002 0.953227 0.049264 0.206737 651.0 17675 75582 0.189530 0.810470
1003 0.953219 0.049264 0.206809 651.0 17681 75576 0.189594 0.810406
1004 0.953219 0.049362 0.206809 651.0 17682 75575 0.189605 0.810395
... ... ... ... ... ... ... ... ...
1195 0.949153 0.059961 0.233968 645.0 20046 73211 0.214954 0.785046
1196 0.949149 0.060059 0.233968 644.0 20047 73210 0.214965 0.785035
1197 0.949105 0.060059 0.234317 644.0 20076 73181 0.215276 0.784724
1198 0.949104 0.060157 0.234317 644.0 20077 73180 0.215287 0.784713
1199 0.949080 0.060157 0.234509 644.0 20093 73164 0.215458 0.784542

200 rows × 8 columns

In [152]:
inputs_train_with_ref_cat.to_csv('inputs_train_with_ref_cat.csv')
In [153]:
df_scorecard.to_csv('df_scorecard.csv')

8. LGD and EAD Models

8.1. Defining Dependent Variable for LGD and EAD Models

In [1]:
import numpy as np
import pandas as pd
In [2]:
# Import data.
loan_data_preprocessed_backup = pd.read_csv('loan_data_2007_2014_preprocessed.csv')
C:\Users\delga\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (21,49) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
loan_data_preprocessed = loan_data_preprocessed_backup.copy()
In [4]:
loan_data_preprocessed.columns.values
# Displays all column names.
Out[4]:
array(['Unnamed: 0', 'Unnamed: 0.1', 'id', 'member_id', 'loan_amnt',
       'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'emp_length_int', 'earliest_cr_line_date',
       'mths_since_earliest_cr_line', 'term_int', 'issue_d_date',
       'mths_since_issue_d', 'grade:A', 'grade:B', 'grade:C', 'grade:D',
       'grade:E', 'grade:F', 'grade:G', 'sub_grade:A1', 'sub_grade:A2',
       'sub_grade:A3', 'sub_grade:A4', 'sub_grade:A5', 'sub_grade:B1',
       'sub_grade:B2', 'sub_grade:B3', 'sub_grade:B4', 'sub_grade:B5',
       'sub_grade:C1', 'sub_grade:C2', 'sub_grade:C3', 'sub_grade:C4',
       'sub_grade:C5', 'sub_grade:D1', 'sub_grade:D2', 'sub_grade:D3',
       'sub_grade:D4', 'sub_grade:D5', 'sub_grade:E1', 'sub_grade:E2',
       'sub_grade:E3', 'sub_grade:E4', 'sub_grade:E5', 'sub_grade:F1',
       'sub_grade:F2', 'sub_grade:F3', 'sub_grade:F4', 'sub_grade:F5',
       'sub_grade:G1', 'sub_grade:G2', 'sub_grade:G3', 'sub_grade:G4',
       'sub_grade:G5', 'home_ownership:ANY', 'home_ownership:MORTGAGE',
       'home_ownership:NONE', 'home_ownership:OTHER',
       'home_ownership:OWN', 'home_ownership:RENT',
       'verification_status:Not Verified',
       'verification_status:Source Verified',
       'verification_status:Verified', 'loan_status:Charged Off',
       'loan_status:Current', 'loan_status:Default',
       'loan_status:Does not meet the credit policy. Status:Charged Off',
       'loan_status:Does not meet the credit policy. Status:Fully Paid',
       'loan_status:Fully Paid', 'loan_status:In Grace Period',
       'loan_status:Late (16-30 days)', 'loan_status:Late (31-120 days)',
       'purpose:car', 'purpose:credit_card', 'purpose:debt_consolidation',
       'purpose:educational', 'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'addr_state:AK', 'addr_state:AL', 'addr_state:AR', 'addr_state:AZ',
       'addr_state:CA', 'addr_state:CO', 'addr_state:CT', 'addr_state:DC',
       'addr_state:DE', 'addr_state:FL', 'addr_state:GA', 'addr_state:HI',
       'addr_state:IA', 'addr_state:ID', 'addr_state:IL', 'addr_state:IN',
       'addr_state:KS', 'addr_state:KY', 'addr_state:LA', 'addr_state:MA',
       'addr_state:MD', 'addr_state:ME', 'addr_state:MI', 'addr_state:MN',
       'addr_state:MO', 'addr_state:MS', 'addr_state:MT', 'addr_state:NC',
       'addr_state:NE', 'addr_state:NH', 'addr_state:NJ', 'addr_state:NM',
       'addr_state:NV', 'addr_state:NY', 'addr_state:OH', 'addr_state:OK',
       'addr_state:OR', 'addr_state:PA', 'addr_state:RI', 'addr_state:SC',
       'addr_state:SD', 'addr_state:TN', 'addr_state:TX', 'addr_state:UT',
       'addr_state:VA', 'addr_state:VT', 'addr_state:WA', 'addr_state:WI',
       'addr_state:WV', 'addr_state:WY', 'initial_list_status:f',
       'initial_list_status:w', 'good_bad'], dtype=object)
In [5]:
loan_data_preprocessed.head()
Out[5]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment ... addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad
0 0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 ... 0 0 0 0 0 0 0 1 0 1
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 ... 0 0 0 0 0 0 0 1 0 0
2 2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 ... 0 0 0 0 0 0 0 1 0 1
3 3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 ... 0 0 0 0 0 0 0 1 0 1
4 4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 ... 0 0 0 0 0 0 0 1 0 1

5 rows × 209 columns

In [6]:
pd.options.display.max_columns = None
loan_data_preprocessed
Out[6]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int earliest_cr_line_date mths_since_earliest_cr_line term_int issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad
0 0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 NaN NaN 3.0 0.0 13648 83.7 9.0 f 0.00 0.00 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5000.0 NaN NaN NaN 10.0 1985-01-01 395.0 36 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 NaN NaN 3.0 0.0 1687 9.4 4.0 f 0.00 0.00 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 1999-04-01 224.0 60 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2 2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 NaN NaN 2.0 0.0 2956 98.5 10.0 f 0.00 0.00 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2400.0 NaN NaN NaN 10.0 2001-11-01 193.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
3 3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 NaN 10.0 0.0 5598 21.0 37.0 f 0.00 0.00 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 10.0 1996-02-01 262.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
4 4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 NaN 15.0 0.0 27783 53.9 38.0 f 766.90 766.90 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3000.0 NaN NaN NaN 1.0 1996-01-01 263.0 60 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
466280 466280 466280 8598660 1440975 18400 18400 18400.0 60 months 14.47 432.64 C C2 Financial Advisor 4 years MORTGAGE 110000.0 Source Verified Jan-14 Current n https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 773xx TX 19.85 0.0 Apr-03 2.0 NaN NaN 18.0 0.0 23208 77.6 36.0 w 12574.00 12574.00 10383.360000 10383.36 5826.00 4557.36 0.00 0.00 0.00 Jan-16 432.64 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 294998.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 29900.0 NaN NaN NaN 4.0 2003-04-01 176.0 60 2014-01-01 47.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1
466281 466281 466281 9684700 11536848 22000 22000 22000.0 60 months 19.97 582.50 D D5 Chief of Interpretation (Park Ranger) 10+ years MORTGAGE 78000.0 Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 377xx TN 18.45 0.0 Jun-97 5.0 NaN 116.0 18.0 1.0 18238 46.3 30.0 f 0.00 0.00 4677.920000 4677.92 1837.04 2840.88 0.00 0.00 0.00 Dec-14 17.50 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 221830.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 39400.0 NaN NaN NaN 10.0 1997-06-01 246.0 60 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
466282 466282 466282 9584776 11436914 20700 20700 20700.0 60 months 16.99 514.34 D D1 patrol 7 years MORTGAGE 46000.0 Verified Jan-14 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/06/13 > I am going to c... debt_consolidation Debt consolidation 458xx OH 25.65 0.0 Dec-01 2.0 65.0 NaN 18.0 0.0 6688 51.1 43.0 f 14428.31 14428.31 12343.980000 12343.98 6271.69 6072.29 0.00 0.00 0.00 Jan-16 514.34 Feb-16 Dec-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 73598.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13100.0 NaN NaN NaN 7.0 2001-12-01 192.0 60 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
466283 466283 466283 9604874 11457002 2000 2000 2000.0 36 months 7.90 62.59 A A4 Server Engineer Lead 3 years OWN 83000.0 Verified Jan-14 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN credit_card Credit card refinancing 913xx CA 5.39 3.0 Feb-03 1.0 13.0 NaN 21.0 0.0 11404 21.5 27.0 w 0.00 0.00 2126.579838 2126.58 2000.00 126.58 0.00 0.00 0.00 Dec-14 1500.68 NaN Apr-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 591610.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 53100.0 NaN NaN NaN 3.0 2003-02-01 178.0 36 2014-01-01 47.0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
466284 466284 466284 9199665 11061576 10000 10000 9975.0 36 months 19.20 367.58 D D3 NaN 10+ years MORTGAGE 46000.0 Verified Jan-14 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/04/13 > I will like a l... other Other 950xx CA 22.78 1.0 Feb-00 0.0 9.0 NaN 6.0 0.0 11325 70.8 22.0 f 3984.38 3974.41 8821.620000 8799.57 6015.62 2806.00 0.00 0.00 0.00 Jan-16 367.58 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 57477.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16000.0 NaN NaN NaN 10.0 2000-02-01 214.0 36 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

466285 rows × 209 columns

In [7]:
# Create a series of Boolean values indicating whether loan is recognised as "Charged Off"
loan_data_preprocessed['loan_status'].isin(['Charged Off','Does not meet the credit policy. Status:Charged Off'])
# Creata a Dataframe with data only for those accounts recognized as "Charged Off"
loan_data_defaults = loan_data_preprocessed[loan_data_preprocessed['loan_status'].isin(['Charged Off',
                                                                                        'Does not meet the credit policy. Status:Charged Off'])]
loan_data_defaults
Out[7]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int earliest_cr_line_date mths_since_earliest_cr_line term_int issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 NaN NaN 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.71 1008.71 456.46 435.17 0.0 117.08 1.1100 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 1999-04-01 224.0 60 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
8 8 8 1071795 1306957 5600 5600 5600.0 60 months 21.28 152.39 F F2 NaN 4 years OWN 40000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I own a small h... small_business Expand Business & Buy Debt Portfolio 958xx CA 5.55 0.0 Apr-04 2.0 NaN NaN 11.0 0.0 5210 32.6 13.0 f 0.0 0.0 646.02 646.02 162.02 294.94 0.0 189.06 2.0900 Apr-12 152.39 NaN Aug-12 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5600.0 NaN NaN NaN 4.0 2004-04-01 164.0 60 2011-12-01 72.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
9 9 9 1071570 1306721 5375 5375 5350.0 60 months 12.69 121.45 B B5 Starbucks < 1 year RENT 15000.0 Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/16/11 > I'm trying to b... other Building my credit history. 774xx TX 18.08 0.0 Sep-04 0.0 NaN NaN 2.0 0.0 9279 36.5 3.0 f 0.0 0.0 1476.19 1469.34 673.48 533.42 0.0 269.29 2.5200 Nov-12 121.45 NaN Mar-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5375.0 NaN NaN NaN 0.0 2004-09-01 159.0 60 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
12 12 12 1064687 1298717 9000 9000 9000.0 36 months 13.49 305.38 C C1 Va. Dept of Conservation/Recreation < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/15/11 > Plan to pay off... debt_consolidation freedom 245xx VA 10.08 0.0 Apr-04 1.0 NaN NaN 4.0 0.0 10452 91.7 9.0 f 0.0 0.0 2270.70 2270.70 1256.14 570.26 0.0 444.30 4.1600 Jul-12 305.38 NaN Nov-12 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9000.0 NaN NaN NaN 0.0 2004-04-01 164.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
14 14 14 1069057 1303503 10000 10000 10000.0 36 months 10.65 325.74 B B2 SFMTA 3 years RENT 100000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... NaN other Other Loan 951xx CA 7.06 0.0 May-91 2.0 NaN NaN 14.0 0.0 11997 55.5 29.0 f 0.0 0.0 7471.99 7471.99 5433.47 1393.42 0.0 645.10 6.3145 Oct-13 325.74 NaN Mar-14 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 3.0 1991-05-01 319.0 36 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
466254 466254 466254 9856168 11708132 6000 6000 6000.0 60 months 23.40 170.53 E E5 NaN NaN MORTGAGE 45600.0 Source Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/13/13 > Having major me... medical Medical 317xx GA 1.50 1.0 Jan-84 0.0 15.0 NaN 3.0 0.0 1199 14.6 13.0 f 0.0 0.0 511.49 511.49 163.71 347.78 0.0 0.00 0.0000 Apr-14 170.53 NaN Jan-16 0.0 15.0 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 1199.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8200.0 NaN NaN NaN 0.0 1984-01-01 407.0 60 2014-01-01 47.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
466256 466256 466256 9835883 9309502 15000 15000 15000.0 36 months 16.99 534.72 D D1 NaN NaN RENT 50000.0 Source Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/12/13 > bill consolidat... debt_consolidation debt consolidation 196xx PA 24.39 0.0 Apr-01 0.0 NaN NaN 14.0 0.0 24551 83.8 34.0 f 0.0 0.0 5347.10 5347.10 3436.71 1910.39 0.0 0.00 0.0000 Nov-14 534.72 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 34157.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 29300.0 NaN NaN NaN 0.0 2001-04-01 200.0 36 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
466276 466276 466276 9695736 11547808 8525 8525 8525.0 60 months 18.25 217.65 D D3 MANAGER 5 years MORTGAGE 37536.0 Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... NaN medical Medical expenses 011xx MA 12.28 4.0 Nov-94 0.0 3.0 NaN 12.0 0.0 5318 10.7 26.0 f 0.0 0.0 2029.93 2029.93 360.08 510.45 0.0 1159.40 11.5940 May-14 217.65 NaN Oct-14 0.0 4.0 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 116995.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49800.0 NaN NaN NaN 5.0 1994-11-01 277.0 60 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
466277 466277 466277 9007579 10799568 18000 18000 17975.0 36 months 7.90 563.23 A A4 sales rep 3 years RENT 90000.0 Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/09/13 > consolidate all... debt_consolidation my loan 212xx MD 10.33 1.0 Jul-98 0.0 14.0 NaN 16.0 0.0 8224 16.3 33.0 f 0.0 0.0 5631.67 5623.85 4580.89 1050.78 0.0 0.00 0.0000 Nov-14 563.23 NaN Oct-14 0.0 14.0 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 42592.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50300.0 NaN NaN NaN 3.0 1998-07-01 233.0 36 2014-01-01 47.0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
466281 466281 466281 9684700 11536848 22000 22000 22000.0 60 months 19.97 582.50 D D5 Chief of Interpretation (Park Ranger) 10+ years MORTGAGE 78000.0 Verified Jan-14 Charged Off n https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 377xx TN 18.45 0.0 Jun-97 5.0 NaN 116.0 18.0 1.0 18238 46.3 30.0 f 0.0 0.0 4677.92 4677.92 1837.04 2840.88 0.0 0.00 0.0000 Dec-14 17.50 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 221830.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 39400.0 NaN NaN NaN 10.0 1997-06-01 246.0 60 2014-01-01 47.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0

43236 rows × 209 columns

In [8]:
pd.options.display.max_rows = None
loan_data_defaults.isnull().sum()
Out[8]:
Unnamed: 0                                                             0
Unnamed: 0.1                                                           0
id                                                                     0
member_id                                                              0
loan_amnt                                                              0
funded_amnt                                                            0
funded_amnt_inv                                                        0
term                                                                   0
int_rate                                                               0
installment                                                            0
grade                                                                  0
sub_grade                                                              0
emp_title                                                           3287
emp_length                                                          2337
home_ownership                                                         0
annual_inc                                                             0
verification_status                                                    0
issue_d                                                                0
loan_status                                                            0
pymnt_plan                                                             0
url                                                                    0
desc                                                               27396
purpose                                                                0
title                                                                  3
zip_code                                                               0
addr_state                                                             0
dti                                                                    0
delinq_2yrs                                                            0
earliest_cr_line                                                       3
inq_last_6mths                                                         0
mths_since_last_delinq                                             23950
mths_since_last_record                                             37821
open_acc                                                               0
pub_rec                                                                0
revol_bal                                                              0
revol_util                                                            53
total_acc                                                              0
initial_list_status                                                    0
out_prncp                                                              0
out_prncp_inv                                                          0
total_pymnt                                                            0
total_pymnt_inv                                                        0
total_rec_prncp                                                        0
total_rec_int                                                          0
total_rec_late_fee                                                     0
recoveries                                                             0
collection_recovery_fee                                                0
last_pymnt_d                                                         376
last_pymnt_amnt                                                        0
next_pymnt_d                                                       42475
last_credit_pull_d                                                     6
collections_12_mths_ex_med                                            28
mths_since_last_major_derog                                        35283
policy_code                                                            0
application_type                                                       0
annual_inc_joint                                                   43236
dti_joint                                                          43236
verification_status_joint                                          43236
acc_now_delinq                                                         0
tot_coll_amt                                                       10780
tot_cur_bal                                                        10780
open_acc_6m                                                        43236
open_il_6m                                                         43236
open_il_12m                                                        43236
open_il_24m                                                        43236
mths_since_rcnt_il                                                 43236
total_bal_il                                                       43236
il_util                                                            43236
open_rv_12m                                                        43236
open_rv_24m                                                        43236
max_bal_bc                                                         43236
all_util                                                           43236
total_rev_hi_lim                                                       0
inq_fi                                                             43236
total_cu_tl                                                        43236
inq_last_12m                                                       43236
emp_length_int                                                         0
earliest_cr_line_date                                                  3
mths_since_earliest_cr_line                                            0
term_int                                                               0
issue_d_date                                                           0
mths_since_issue_d                                                     0
grade:A                                                                0
grade:B                                                                0
grade:C                                                                0
grade:D                                                                0
grade:E                                                                0
grade:F                                                                0
grade:G                                                                0
sub_grade:A1                                                           0
sub_grade:A2                                                           0
sub_grade:A3                                                           0
sub_grade:A4                                                           0
sub_grade:A5                                                           0
sub_grade:B1                                                           0
sub_grade:B2                                                           0
sub_grade:B3                                                           0
sub_grade:B4                                                           0
sub_grade:B5                                                           0
sub_grade:C1                                                           0
sub_grade:C2                                                           0
sub_grade:C3                                                           0
sub_grade:C4                                                           0
sub_grade:C5                                                           0
sub_grade:D1                                                           0
sub_grade:D2                                                           0
sub_grade:D3                                                           0
sub_grade:D4                                                           0
sub_grade:D5                                                           0
sub_grade:E1                                                           0
sub_grade:E2                                                           0
sub_grade:E3                                                           0
sub_grade:E4                                                           0
sub_grade:E5                                                           0
sub_grade:F1                                                           0
sub_grade:F2                                                           0
sub_grade:F3                                                           0
sub_grade:F4                                                           0
sub_grade:F5                                                           0
sub_grade:G1                                                           0
sub_grade:G2                                                           0
sub_grade:G3                                                           0
sub_grade:G4                                                           0
sub_grade:G5                                                           0
home_ownership:ANY                                                     0
home_ownership:MORTGAGE                                                0
home_ownership:NONE                                                    0
home_ownership:OTHER                                                   0
home_ownership:OWN                                                     0
home_ownership:RENT                                                    0
verification_status:Not Verified                                       0
verification_status:Source Verified                                    0
verification_status:Verified                                           0
loan_status:Charged Off                                                0
loan_status:Current                                                    0
loan_status:Default                                                    0
loan_status:Does not meet the credit policy. Status:Charged Off        0
loan_status:Does not meet the credit policy. Status:Fully Paid         0
loan_status:Fully Paid                                                 0
loan_status:In Grace Period                                            0
loan_status:Late (16-30 days)                                          0
loan_status:Late (31-120 days)                                         0
purpose:car                                                            0
purpose:credit_card                                                    0
purpose:debt_consolidation                                             0
purpose:educational                                                    0
purpose:home_improvement                                               0
purpose:house                                                          0
purpose:major_purchase                                                 0
purpose:medical                                                        0
purpose:moving                                                         0
purpose:other                                                          0
purpose:renewable_energy                                               0
purpose:small_business                                                 0
purpose:vacation                                                       0
purpose:wedding                                                        0
addr_state:AK                                                          0
addr_state:AL                                                          0
addr_state:AR                                                          0
addr_state:AZ                                                          0
addr_state:CA                                                          0
addr_state:CO                                                          0
addr_state:CT                                                          0
addr_state:DC                                                          0
addr_state:DE                                                          0
addr_state:FL                                                          0
addr_state:GA                                                          0
addr_state:HI                                                          0
addr_state:IA                                                          0
addr_state:ID                                                          0
addr_state:IL                                                          0
addr_state:IN                                                          0
addr_state:KS                                                          0
addr_state:KY                                                          0
addr_state:LA                                                          0
addr_state:MA                                                          0
addr_state:MD                                                          0
addr_state:ME                                                          0
addr_state:MI                                                          0
addr_state:MN                                                          0
addr_state:MO                                                          0
addr_state:MS                                                          0
addr_state:MT                                                          0
addr_state:NC                                                          0
addr_state:NE                                                          0
addr_state:NH                                                          0
addr_state:NJ                                                          0
addr_state:NM                                                          0
addr_state:NV                                                          0
addr_state:NY                                                          0
addr_state:OH                                                          0
addr_state:OK                                                          0
addr_state:OR                                                          0
addr_state:PA                                                          0
addr_state:RI                                                          0
addr_state:SC                                                          0
addr_state:SD                                                          0
addr_state:TN                                                          0
addr_state:TX                                                          0
addr_state:UT                                                          0
addr_state:VA                                                          0
addr_state:VT                                                          0
addr_state:WA                                                          0
addr_state:WI                                                          0
addr_state:WV                                                          0
addr_state:WY                                                          0
initial_list_status:f                                                  0
initial_list_status:w                                                  0
good_bad                                                               0
dtype: int64
In [9]:
# We fill the missing values with zeroes.
loan_data_defaults['mths_since_last_delinq'].fillna(0, inplace = True)
loan_data_defaults['mths_since_last_record'].fillna(0, inplace=True)
C:\Users\delga\anaconda3\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
In [10]:
# We calculate the dependent variable for the LGD model, the recovery rate, and add to the default dataframe
loan_data_defaults['recovery_rate'] = loan_data_defaults['recoveries'] / loan_data_defaults['funded_amnt']
loan_data_defaults['recovery_rate'].describe()
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[10]:
count    43236.000000
mean         0.060820
std          0.089770
min          0.000000
25%          0.000000
50%          0.029466
75%          0.114044
max          1.220774
Name: recovery_rate, dtype: float64
In [11]:
formatted_mean = "{:.4f}".format(loan_data_defaults['recovery_rate'].mean())

print("Total Defaulted Loans                 : " ,loan_data_defaults['recovery_rate'].count())
print("Mean Recovery Rate on Defaulted Loans : " ,formatted_mean)
Total Defaulted Loans                 :  43236
Mean Recovery Rate on Defaulted Loans :  0.0608
In [12]:
loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] > 1, 
                                               1, loan_data_defaults['recovery_rate'])
loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] < 0, 
                                               0, loan_data_defaults['recovery_rate'])
# We set recovery rates that are greater than 1 to 1 and recovery rates that are less than 0 to 0.
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
In [13]:
loan_data_defaults['CCF'] = (loan_data_defaults['funded_amnt'] - loan_data_defaults['total_rec_prncp']) / loan_data_defaults['funded_amnt']
# We calculate the dependent variable for the EAD model: credit conversion factor.
# It is the ratio of the difference of the amount used at the moment of default to the total funded amount.
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [14]:
loan_data_defaults['CCF'].describe()
# Shows some descriptive statisics for the values of a column.
Out[14]:
count    43236.000000
mean         0.735952
std          0.200742
min          0.000438
25%          0.632088
50%          0.789908
75%          0.888543
max          1.000000
Name: CCF, dtype: float64
In [15]:
loan_data_defaults.to_csv('loan_data_defaults.csv')
# We save the data to a CSV file.
In [16]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [17]:
plt.title('Distribution Recovery Rate',fontsize=20)
plt.hist(loan_data_defaults['recovery_rate'], bins = 100);
# We plot a histogram of a variable with 50 bins.
In [18]:
plt.title('Distribution CCF',fontsize=20)
plt.hist(loan_data_defaults['CCF'], bins = 100);
# We plot a histogram of a variable with 100 bins.
In [19]:
loan_data_defaults['recovery_rate_0_1'] = np.where(loan_data_defaults['recovery_rate'] == 0, 0, 1)
loan_data_defaults['recovery_rate_0_1'].head()
# We create a new variable which is 0 if recovery rate is 0 and 1 otherwise.
C:\Users\delga\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
Out[19]:
1     1
8     1
9     1
12    1
14    1
Name: recovery_rate_0_1, dtype: int32
In [20]:
loan_data_defaults['recovery_rate_0_1'].tail()
Out[20]:
466254    0
466256    0
466276    1
466277    0
466281    0
Name: recovery_rate_0_1, dtype: int32

8.1. Defining Dependent Variable for LGD and EAD Models


In [21]:
from sklearn.model_selection import train_test_split
In [22]:
# LGD model stage 1 datasets: recovery rate 0 or greater than 0.
lgd_inputs_stage_1_train, lgd_inputs_stage_1_test, lgd_targets_stage_1_train, lgd_targets_stage_1_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['recovery_rate_0_1'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
Preparing the Inputs
In [23]:
features_all = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:MORTGAGE',
'home_ownership:NONE',
'home_ownership:OTHER',
'home_ownership:OWN',
'home_ownership:RENT',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:car',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:educational',
'purpose:home_improvement',
'purpose:house',
'purpose:major_purchase',
'purpose:medical',
'purpose:moving',
'purpose:other',
'purpose:renewable_energy',
'purpose:small_business',
'purpose:vacation',
'purpose:wedding',
'initial_list_status:f',
'initial_list_status:w',
'term_int',
'emp_length_int',
'mths_since_issue_d',
'mths_since_earliest_cr_line',
'funded_amnt',
'int_rate',
'installment',
'annual_inc',
'dti',
'delinq_2yrs',
'inq_last_6mths',
'mths_since_last_delinq',
'mths_since_last_record',
'open_acc',
'pub_rec',
'total_acc',
'acc_now_delinq',
'total_rev_hi_lim']
# List of all independent variables for the models.
In [24]:
features_reference_cat = ['grade:G',
'home_ownership:RENT',
'verification_status:Verified',
'purpose:credit_card',
'initial_list_status:f']
# List of the dummy variable reference categories. 
In [25]:
lgd_inputs_stage_1_train = lgd_inputs_stage_1_train[features_all]
# Here we keep only the variables we need for the model.
In [26]:
lgd_inputs_stage_1_train = lgd_inputs_stage_1_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [27]:
lgd_inputs_stage_1_train.isnull().sum()
# Check for missing values. We check whether the value of each row for each column is missing or not,
# then sum accross columns.
Out[27]:
grade:A                                0
grade:B                                0
grade:C                                0
grade:D                                0
grade:E                                0
grade:F                                0
home_ownership:MORTGAGE                0
home_ownership:NONE                    0
home_ownership:OTHER                   0
home_ownership:OWN                     0
verification_status:Not Verified       0
verification_status:Source Verified    0
purpose:car                            0
purpose:debt_consolidation             0
purpose:educational                    0
purpose:home_improvement               0
purpose:house                          0
purpose:major_purchase                 0
purpose:medical                        0
purpose:moving                         0
purpose:other                          0
purpose:renewable_energy               0
purpose:small_business                 0
purpose:vacation                       0
purpose:wedding                        0
initial_list_status:w                  0
term_int                               0
emp_length_int                         0
mths_since_issue_d                     0
mths_since_earliest_cr_line            0
funded_amnt                            0
int_rate                               0
installment                            0
annual_inc                             0
dti                                    0
delinq_2yrs                            0
inq_last_6mths                         0
mths_since_last_delinq                 0
mths_since_last_record                 0
open_acc                               0
pub_rec                                0
total_acc                              0
acc_now_delinq                         0
total_rev_hi_lim                       0
dtype: int64
Estimating the Model
In [28]:
# P values for sklearn logistic regression.

# Class to display p-values for logistic regression in sklearn.

from sklearn import linear_model
import scipy.stats as stat

class LogisticRegression_with_p_values:
    
    def __init__(self,*args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)

    def fit(self,X,y):
        self.model.fit(X,y)
        
        #### Get p-values for the fitted model ####
        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
        Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
        p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
        
        self.coef_ = self.model.coef_
        self.intercept_ = self.model.intercept_
        #self.z_scores = z_scores
        self.p_values = p_values
        #self.sigma_estimates = sigma_estimates
        #self.F_ij = F_ij
In [29]:
reg_lgd_st_1 = LogisticRegression_with_p_values()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_1.fit(lgd_inputs_stage_1_train, lgd_targets_stage_1_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
C:\Users\delga\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [30]:
feature_name = lgd_inputs_stage_1_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
In [31]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_1.p_values
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
Out[31]:
Feature name Coefficients p_values
0 Intercept -1.491132e-04 NaN
1 grade:A -1.806131e-05 9.998289e-01
2 grade:B -1.007435e-04 9.988205e-01
3 grade:C -1.867464e-04 9.977367e-01
4 grade:D 2.154426e-05 9.997493e-01
5 grade:E 1.856218e-05 9.998042e-01
6 grade:F 1.001171e-04 9.990811e-01
7 home_ownership:MORTGAGE -4.783431e-05 9.984942e-01
8 home_ownership:NONE 1.214327e-06 9.999988e-01
9 home_ownership:OTHER 5.544467e-07 9.999989e-01
10 home_ownership:OWN -7.193095e-06 9.998607e-01
11 verification_status:Not Verified -1.103123e-04 9.970601e-01
12 verification_status:Source Verified -2.384667e-04 9.930049e-01
13 purpose:car 1.596011e-05 9.998903e-01
14 purpose:debt_consolidation -2.185229e-04 9.942409e-01
15 purpose:educational 9.575423e-07 9.999973e-01
16 purpose:home_improvement 1.415813e-05 9.998020e-01
17 purpose:house 1.520248e-05 9.999149e-01
18 purpose:major_purchase 2.382484e-05 9.997795e-01
19 purpose:medical 1.926731e-06 9.999854e-01
20 purpose:moving 4.028908e-06 9.999731e-01
21 purpose:other 5.031387e-05 9.992368e-01
22 purpose:renewable_energy 4.301377e-06 9.999899e-01
23 purpose:small_business 6.649088e-05 9.992567e-01
24 purpose:vacation 5.392512e-06 9.999701e-01
25 purpose:wedding 1.419223e-05 9.999226e-01
26 initial_list_status:w -1.047834e-03 9.689110e-01
27 term_int -3.451490e-03 1.676719e-01
28 emp_length_int -5.371616e-04 8.607217e-01
29 mths_since_issue_d 1.902701e-02 8.120148e-105
30 mths_since_earliest_cr_line -1.276420e-03 1.436173e-18
31 funded_amnt 4.328177e-05 1.486584e-05
32 int_rate 1.289677e-04 9.822643e-01
33 installment -9.036366e-04 3.598916e-03
34 annual_inc -5.885637e-07 8.207896e-02
35 dti -7.542635e-03 3.511480e-06
36 delinq_2yrs -7.240915e-05 9.960501e-01
37 inq_last_6mths 2.245953e-04 9.809502e-01
38 mths_since_last_delinq -1.082779e-03 3.246562e-02
39 mths_since_last_record -2.154423e-03 4.783218e-04
40 open_acc -2.972121e-03 3.635158e-01
41 pub_rec -5.971938e-05 9.987920e-01
42 total_acc -7.135397e-03 8.431064e-07
43 acc_now_delinq 4.634389e-06 9.999788e-01
44 total_rev_hi_lim -4.551291e-06 2.613813e-11
In [32]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_1.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table
Out[32]:
Feature name Coefficients p_values
0 Intercept -1.491132e-04 NaN
1 grade:A -1.806131e-05 9.998289e-01
2 grade:B -1.007435e-04 9.988205e-01
3 grade:C -1.867464e-04 9.977367e-01
4 grade:D 2.154426e-05 9.997493e-01
5 grade:E 1.856218e-05 9.998042e-01
6 grade:F 1.001171e-04 9.990811e-01
7 home_ownership:MORTGAGE -4.783431e-05 9.984942e-01
8 home_ownership:NONE 1.214327e-06 9.999988e-01
9 home_ownership:OTHER 5.544467e-07 9.999989e-01
10 home_ownership:OWN -7.193095e-06 9.998607e-01
11 verification_status:Not Verified -1.103123e-04 9.970601e-01
12 verification_status:Source Verified -2.384667e-04 9.930049e-01
13 purpose:car 1.596011e-05 9.998903e-01
14 purpose:debt_consolidation -2.185229e-04 9.942409e-01
15 purpose:educational 9.575423e-07 9.999973e-01
16 purpose:home_improvement 1.415813e-05 9.998020e-01
17 purpose:house 1.520248e-05 9.999149e-01
18 purpose:major_purchase 2.382484e-05 9.997795e-01
19 purpose:medical 1.926731e-06 9.999854e-01
20 purpose:moving 4.028908e-06 9.999731e-01
21 purpose:other 5.031387e-05 9.992368e-01
22 purpose:renewable_energy 4.301377e-06 9.999899e-01
23 purpose:small_business 6.649088e-05 9.992567e-01
24 purpose:vacation 5.392512e-06 9.999701e-01
25 purpose:wedding 1.419223e-05 9.999226e-01
26 initial_list_status:w -1.047834e-03 9.689110e-01
27 term_int -3.451490e-03 1.676719e-01
28 emp_length_int -5.371616e-04 8.607217e-01
29 mths_since_issue_d 1.902701e-02 8.120148e-105
30 mths_since_earliest_cr_line -1.276420e-03 1.436173e-18
31 funded_amnt 4.328177e-05 1.486584e-05
32 int_rate 1.289677e-04 9.822643e-01
33 installment -9.036366e-04 3.598916e-03
34 annual_inc -5.885637e-07 8.207896e-02
35 dti -7.542635e-03 3.511480e-06
36 delinq_2yrs -7.240915e-05 9.960501e-01
37 inq_last_6mths 2.245953e-04 9.809502e-01
38 mths_since_last_delinq -1.082779e-03 3.246562e-02
39 mths_since_last_record -2.154423e-03 4.783218e-04
40 open_acc -2.972121e-03 3.635158e-01
41 pub_rec -5.971938e-05 9.987920e-01
42 total_acc -7.135397e-03 8.431064e-07
43 acc_now_delinq 4.634389e-06 9.999788e-01
44 total_rev_hi_lim -4.551291e-06 2.613813e-11
Testing the Model
In [33]:
lgd_inputs_stage_1_test = lgd_inputs_stage_1_test[features_all]
# Here we keep only the variables we need for the model.
In [34]:
lgd_inputs_stage_1_test = lgd_inputs_stage_1_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [35]:
y_hat_test_lgd_stage_1 = reg_lgd_st_1.model.predict(lgd_inputs_stage_1_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
In [36]:
y_hat_test_lgd_stage_1
Out[36]:
array([1, 1, 0, ..., 1, 1, 1])
In [37]:
y_hat_test_proba_lgd_stage_1 = reg_lgd_st_1.model.predict_proba(lgd_inputs_stage_1_test)
# Calculates the predicted probability values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
In [38]:
y_hat_test_proba_lgd_stage_1
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.
Out[38]:
array([[0.41294934, 0.58705066],
       [0.38572094, 0.61427906],
       [0.5178263 , 0.4821737 ],
       ...,
       [0.45880386, 0.54119614],
       [0.40552032, 0.59447968],
       [0.47566543, 0.52433457]])
In [39]:
y_hat_test_proba_lgd_stage_1 = y_hat_test_proba_lgd_stage_1[: ][: , 1]
# Here we take all the arrays in the array, and from each array, we take all rows, and only the element with index 1,
# that is, the second element.
# In other words, we take only the probabilities for being 1.
In [40]:
y_hat_test_proba_lgd_stage_1
Out[40]:
array([0.58705066, 0.61427906, 0.4821737 , ..., 0.54119614, 0.59447968,
       0.52433457])
In [41]:
lgd_targets_stage_1_test_temp = lgd_targets_stage_1_test
In [42]:
lgd_targets_stage_1_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.
In [43]:
df_actual_predicted_probs = pd.concat([lgd_targets_stage_1_test_temp, pd.DataFrame(y_hat_test_proba_lgd_stage_1)], axis = 1)
# Concatenates two dataframes.
In [44]:
df_actual_predicted_probs.columns = ['lgd_targets_stage_1_test', 'y_hat_test_proba_lgd_stage_1']
In [45]:
df_actual_predicted_probs.index = lgd_inputs_stage_1_test.index
# Makes the index of one dataframe equal to the index of another dataframe.
In [46]:
df_actual_predicted_probs.head()
Out[46]:
lgd_targets_stage_1_test y_hat_test_proba_lgd_stage_1
178928 1 0.587051
69814 1 0.614279
101396 0 0.482174
463268 1 0.555024
253729 0 0.411905
Estimating the Аccuracy of the Мodel
In [47]:
import itertools
import numpy as np
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='CONFUSION MATRIX',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float')
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.3f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "white")

    plt.tight_layout()
    plt.ylabel('*TRUE LABEL*', fontsize=14)
    plt.xlabel('*PREDICTED LABEL*', fontsize=14)
    plt.show()
In [48]:
tr = 0.5
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test_lgd_stage_1'] = np.where(df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'] > tr, 1, 0)
In [ ]:
 
In [49]:
pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted'])
# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.
Out[49]:
Predicted 0 1
Actual
0 1036 2726
1 683 4203
In [50]:
cm_lgd_N = confusion_matrix(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                          df_actual_predicted_probs['y_hat_test_lgd_stage_1'])
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_lgd_N, classes,
                          normalize=False,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)
Confusion matrix, without normalization
[[1036 2726]
 [ 683 4203]]
In [51]:
pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
# Here we divide each value of the table by the total number of observations,
# thus getting percentages, or, rates.
Out[51]:
Predicted 0 1
Actual
0 0.119796 0.315217
1 0.078978 0.486008
In [52]:
cm_lgd_pc = pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                        df_actual_predicted_probs['y_hat_test_lgd_stage_1'], 
                        rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
cm_arr = np.array(cm_lgd_pc)
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_arr, classes,
                          normalize=True,
                          title='CONFUSION MATRIX - Threshold = 0.5',
                          cmap=plt.cm.RdYlGn)
Normalized confusion matrix
[[0.11979648 0.31521739]
 [0.0789778  0.48600833]]
In [53]:
(pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]
# Here we calculate Accuracy of the model, which is the sum of the diagonal rates.
Out[53]:
0.605804810360777
In [54]:
from sklearn.metrics import roc_curve, roc_auc_score
In [55]:
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['lgd_targets_stage_1_test'], 
                                 df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.
# we store each of the three arrays in a separate variable.
In [56]:
plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False positive rate')
# We name the x-axis "False positive rate".
plt.ylabel('True positive rate')
# We name the x-axis "True positive rate".
plt.title('ROC curve')
# We name the graph "ROC curve".
Out[56]:
Text(0.5, 1.0, 'ROC curve')
In [57]:
AUROC = roc_auc_score(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC
Out[57]:
0.650978133446841
Saving the Model
In [58]:
import pickle
In [59]:
pickle.dump(reg_lgd_st_1, open('lgd_model_stage_1.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.

8.3. Stage 2 LGD: Linear Regression


In [60]:
lgd_stage_2_data = loan_data_defaults[loan_data_defaults['recovery_rate_0_1'] == 1]
# Here we take only rows where the original recovery rate variable is greater than one,
# i.e. where the indicator variable we created is equal to 1.
In [61]:
# LGD model stage 2 datasets: how much more than 0 is the recovery rate
lgd_inputs_stage_2_train, lgd_inputs_stage_2_test, lgd_targets_stage_2_train, lgd_targets_stage_2_test = train_test_split(lgd_stage_2_data.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), lgd_stage_2_data['recovery_rate'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
In [62]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
In [63]:
# Since the p-values are obtained through certain statistics, we need the 'stat' module from scipy.stats
import scipy.stats as stat

# Since we are using an object oriented language such as Python, we can simply define our own 
# LinearRegression class (the same one from sklearn)
# By typing the code below we will ovewrite a part of the class with one that includes p-values
# Here's the full source code of the ORIGINAL class: https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/linear_model/base.py#L362


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """
    
    # nothing changes in __init__
    def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=1):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.n_jobs = n_jobs

    
    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)
        
        # Calculate SSE (sum of squared errors)
        # and SE (standard error)
        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])

        # compute the t-statistic for each feature
        self.t = self.coef_ / se
        # find the p-value for each feature
        self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
        return self
In [64]:
import scipy.stats as stat

class LinearRegression(linear_model.LinearRegression):
    def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=1):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.n_jobs = n_jobs
    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)
        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])
        self.t = self.coef_ / se
        self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
        return self
In [65]:
lgd_inputs_stage_2_train = lgd_inputs_stage_2_train[features_all]
# Here we keep only the variables we need for the model.
In [66]:
lgd_inputs_stage_2_train = lgd_inputs_stage_2_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [67]:
reg_lgd_st_2 = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_2.fit(lgd_inputs_stage_2_train, lgd_targets_stage_2_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
Out[67]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [68]:
feature_name = lgd_inputs_stage_2_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
In [69]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values.round(3)
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
Out[69]:
Feature name Coefficients p_values
0 Intercept 2.406858e-01 NaN
1 grade:A -6.826892e-02 0.000
2 grade:B -5.083556e-02 0.000
3 grade:C -3.748066e-02 0.000
4 grade:D -2.717310e-02 0.000
5 grade:E -1.315941e-02 0.002
6 grade:F -5.260168e-03 0.275
7 home_ownership:MORTGAGE 2.832212e-03 0.061
8 home_ownership:NONE 1.459035e-01 0.000
9 home_ownership:OTHER -9.475922e-03 0.644
10 home_ownership:OWN 5.000678e-03 0.040
11 verification_status:Not Verified 1.056585e-03 0.553
12 verification_status:Source Verified -1.009915e-03 0.535
13 purpose:car -2.995960e-03 0.634
14 purpose:debt_consolidation 8.206319e-05 0.965
15 purpose:educational 7.625467e-02 0.000
16 purpose:home_improvement -3.702374e-03 0.273
17 purpose:house -3.786803e-03 0.620
18 purpose:major_purchase 2.914439e-03 0.538
19 purpose:medical 1.078825e-02 0.074
20 purpose:moving 1.398692e-02 0.039
21 purpose:other 4.841345e-03 0.109
22 purpose:renewable_energy 2.420645e-02 0.142
23 purpose:small_business 6.212343e-04 0.869
24 purpose:vacation -3.002398e-03 0.731
25 purpose:wedding 2.034853e-02 0.006
26 initial_list_status:w 1.464671e-02 0.000
27 term_int 3.316229e-04 0.020
28 emp_length_int 8.727462e-05 0.635
29 mths_since_issue_d -1.521649e-03 0.000
30 mths_since_earliest_cr_line 3.418678e-05 0.000
31 funded_amnt -2.186999e-07 0.699
32 int_rate -2.544714e-03 0.000
33 installment -1.037621e-05 0.557
34 annual_inc 6.389841e-08 0.001
35 dti 1.775655e-04 0.069
36 delinq_2yrs 1.757943e-03 0.050
37 inq_last_6mths 1.274095e-03 0.018
38 mths_since_last_delinq -1.094747e-06 0.971
39 mths_since_last_record -5.558083e-05 0.181
40 open_acc -1.196505e-03 0.000
41 pub_rec 3.447322e-03 0.208
42 total_acc 4.766629e-04 0.000
43 acc_now_delinq 4.278394e-03 0.658
44 total_rev_hi_lim 2.263456e-07 0.000
In [70]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_2.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values.round(3)
summary_table
Out[70]:
Feature name Coefficients p_values
0 Intercept 2.406858e-01 NaN
1 grade:A -6.826892e-02 0.000
2 grade:B -5.083556e-02 0.000
3 grade:C -3.748066e-02 0.000
4 grade:D -2.717310e-02 0.000
5 grade:E -1.315941e-02 0.002
6 grade:F -5.260168e-03 0.275
7 home_ownership:MORTGAGE 2.832212e-03 0.061
8 home_ownership:NONE 1.459035e-01 0.000
9 home_ownership:OTHER -9.475922e-03 0.644
10 home_ownership:OWN 5.000678e-03 0.040
11 verification_status:Not Verified 1.056585e-03 0.553
12 verification_status:Source Verified -1.009915e-03 0.535
13 purpose:car -2.995960e-03 0.634
14 purpose:debt_consolidation 8.206319e-05 0.965
15 purpose:educational 7.625467e-02 0.000
16 purpose:home_improvement -3.702374e-03 0.273
17 purpose:house -3.786803e-03 0.620
18 purpose:major_purchase 2.914439e-03 0.538
19 purpose:medical 1.078825e-02 0.074
20 purpose:moving 1.398692e-02 0.039
21 purpose:other 4.841345e-03 0.109
22 purpose:renewable_energy 2.420645e-02 0.142
23 purpose:small_business 6.212343e-04 0.869
24 purpose:vacation -3.002398e-03 0.731
25 purpose:wedding 2.034853e-02 0.006
26 initial_list_status:w 1.464671e-02 0.000
27 term_int 3.316229e-04 0.020
28 emp_length_int 8.727462e-05 0.635
29 mths_since_issue_d -1.521649e-03 0.000
30 mths_since_earliest_cr_line 3.418678e-05 0.000
31 funded_amnt -2.186999e-07 0.699
32 int_rate -2.544714e-03 0.000
33 installment -1.037621e-05 0.557
34 annual_inc 6.389841e-08 0.001
35 dti 1.775655e-04 0.069
36 delinq_2yrs 1.757943e-03 0.050
37 inq_last_6mths 1.274095e-03 0.018
38 mths_since_last_delinq -1.094747e-06 0.971
39 mths_since_last_record -5.558083e-05 0.181
40 open_acc -1.196505e-03 0.000
41 pub_rec 3.447322e-03 0.208
42 total_acc 4.766629e-04 0.000
43 acc_now_delinq 4.278394e-03 0.658
44 total_rev_hi_lim 2.263456e-07 0.000
Stage 2 – Linear Regression Evaluation
In [71]:
lgd_inputs_stage_2_test = lgd_inputs_stage_2_test[features_all]
# Here we keep only the variables we need for the model.
In [72]:
lgd_inputs_stage_2_test = lgd_inputs_stage_2_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [73]:
lgd_inputs_stage_2_test.columns.values
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
Out[73]:
array(['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
       'home_ownership:MORTGAGE', 'home_ownership:NONE',
       'home_ownership:OTHER', 'home_ownership:OWN',
       'verification_status:Not Verified',
       'verification_status:Source Verified', 'purpose:car',
       'purpose:debt_consolidation', 'purpose:educational',
       'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'initial_list_status:w', 'term_int', 'emp_length_int',
       'mths_since_issue_d', 'mths_since_earliest_cr_line', 'funded_amnt',
       'int_rate', 'installment', 'annual_inc', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'total_acc',
       'acc_now_delinq', 'total_rev_hi_lim'], dtype=object)
In [74]:
y_hat_test_lgd_stage_2 = reg_lgd_st_2.predict(lgd_inputs_stage_2_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
In [75]:
lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test
In [76]:
lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.
In [77]:
pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.
Out[77]:
recovery_rate 0
recovery_rate 1.000000 0.307996
0 0.307996 1.000000
In [78]:
corr_mat = pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()

corr_arr = np.array(corr_mat)
classes = ['   ', '   ']
plot_confusion_matrix(corr_arr, classes,
                          normalize=True,
                          title='CORRELATION MATRIX - Act Vs Pred Recov Rates',
                          cmap=plt.cm.RdYlGn)
Normalized confusion matrix
[[1.        0.3079956]
 [0.3079956 1.       ]]
In [79]:
sns.distplot(lgd_targets_stage_2_test - y_hat_test_lgd_stage_2)
# We plot the distribution of the residuals.
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x16f90b74cf8>
In [80]:
pickle.dump(reg_lgd_st_2, open('lgd_model_stage_2.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.

8.4. LGD: Combining Stages 1 and 2


In [81]:
y_hat_test_lgd_stage_2_all = reg_lgd_st_2.predict(lgd_inputs_stage_1_test)
In [82]:
y_hat_test_lgd_stage_2_all
Out[82]:
array([0.1193906 , 0.09605635, 0.13367631, ..., 0.12078611, 0.11587422,
       0.15667447])
In [83]:
y_hat_test_lgd = y_hat_test_lgd_stage_1 * y_hat_test_lgd_stage_2_all
# Here we combine the predictions of the models from the two stages.
In [84]:
pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.
Out[84]:
0
count 8648.000000
mean 0.086175
std 0.049851
min -0.007634
25% 0.061983
50% 0.100503
75% 0.122541
max 0.236973
In [85]:
pd.DataFrame(y_hat_test_lgd).sum()/pd.DataFrame(y_hat_test_lgd).count()
Out[85]:
0    0.086175
dtype: float64
In [86]:
y_hat_test_lgd = np.where(y_hat_test_lgd < 0, 0, y_hat_test_lgd)
y_hat_test_lgd = np.where(y_hat_test_lgd > 1, 1, y_hat_test_lgd)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.
In [87]:
pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.
Out[87]:
0
count 8648.000000
mean 0.086177
std 0.049848
min 0.000000
25% 0.061983
50% 0.100503
75% 0.122541
max 0.236973

8.5 EAD Computation

Estimation and Interpretation
In [88]:
# EAD model datasets
ead_inputs_train, ead_inputs_test, ead_targets_train, ead_targets_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['CCF'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
In [89]:
ead_inputs_train.columns.values
Out[89]:
array(['Unnamed: 0', 'Unnamed: 0.1', 'id', 'member_id', 'loan_amnt',
       'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'emp_length_int', 'earliest_cr_line_date',
       'mths_since_earliest_cr_line', 'term_int', 'issue_d_date',
       'mths_since_issue_d', 'grade:A', 'grade:B', 'grade:C', 'grade:D',
       'grade:E', 'grade:F', 'grade:G', 'sub_grade:A1', 'sub_grade:A2',
       'sub_grade:A3', 'sub_grade:A4', 'sub_grade:A5', 'sub_grade:B1',
       'sub_grade:B2', 'sub_grade:B3', 'sub_grade:B4', 'sub_grade:B5',
       'sub_grade:C1', 'sub_grade:C2', 'sub_grade:C3', 'sub_grade:C4',
       'sub_grade:C5', 'sub_grade:D1', 'sub_grade:D2', 'sub_grade:D3',
       'sub_grade:D4', 'sub_grade:D5', 'sub_grade:E1', 'sub_grade:E2',
       'sub_grade:E3', 'sub_grade:E4', 'sub_grade:E5', 'sub_grade:F1',
       'sub_grade:F2', 'sub_grade:F3', 'sub_grade:F4', 'sub_grade:F5',
       'sub_grade:G1', 'sub_grade:G2', 'sub_grade:G3', 'sub_grade:G4',
       'sub_grade:G5', 'home_ownership:ANY', 'home_ownership:MORTGAGE',
       'home_ownership:NONE', 'home_ownership:OTHER',
       'home_ownership:OWN', 'home_ownership:RENT',
       'verification_status:Not Verified',
       'verification_status:Source Verified',
       'verification_status:Verified', 'loan_status:Charged Off',
       'loan_status:Current', 'loan_status:Default',
       'loan_status:Does not meet the credit policy. Status:Charged Off',
       'loan_status:Does not meet the credit policy. Status:Fully Paid',
       'loan_status:Fully Paid', 'loan_status:In Grace Period',
       'loan_status:Late (16-30 days)', 'loan_status:Late (31-120 days)',
       'purpose:car', 'purpose:credit_card', 'purpose:debt_consolidation',
       'purpose:educational', 'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'addr_state:AK', 'addr_state:AL', 'addr_state:AR', 'addr_state:AZ',
       'addr_state:CA', 'addr_state:CO', 'addr_state:CT', 'addr_state:DC',
       'addr_state:DE', 'addr_state:FL', 'addr_state:GA', 'addr_state:HI',
       'addr_state:IA', 'addr_state:ID', 'addr_state:IL', 'addr_state:IN',
       'addr_state:KS', 'addr_state:KY', 'addr_state:LA', 'addr_state:MA',
       'addr_state:MD', 'addr_state:ME', 'addr_state:MI', 'addr_state:MN',
       'addr_state:MO', 'addr_state:MS', 'addr_state:MT', 'addr_state:NC',
       'addr_state:NE', 'addr_state:NH', 'addr_state:NJ', 'addr_state:NM',
       'addr_state:NV', 'addr_state:NY', 'addr_state:OH', 'addr_state:OK',
       'addr_state:OR', 'addr_state:PA', 'addr_state:RI', 'addr_state:SC',
       'addr_state:SD', 'addr_state:TN', 'addr_state:TX', 'addr_state:UT',
       'addr_state:VA', 'addr_state:VT', 'addr_state:WA', 'addr_state:WI',
       'addr_state:WV', 'addr_state:WY', 'initial_list_status:f',
       'initial_list_status:w'], dtype=object)
In [90]:
ead_inputs_train = ead_inputs_train[features_all]
# Here we keep only the variables we need for the model.
In [91]:
ead_inputs_train = ead_inputs_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [92]:
reg_ead = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_ead.fit(ead_inputs_train, ead_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
Out[92]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [93]:
feature_name = ead_inputs_train.columns.values
In [94]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
Out[94]:
Feature name Coefficients p_values
0 Intercept 1.109746e+00 NaN
1 grade:A -3.030033e-01 0.000000e+00
2 grade:B -2.364277e-01 0.000000e+00
3 grade:C -1.720232e-01 0.000000e+00
4 grade:D -1.198470e-01 1.970424e-12
5 grade:E -6.768713e-02 1.918578e-03
6 grade:F -2.045907e-02 2.748685e-01
7 home_ownership:MORTGAGE -6.343341e-03 6.050271e-02
8 home_ownership:NONE -5.539064e-03 9.092582e-05
9 home_ownership:OTHER -2.426052e-03 6.436926e-01
10 home_ownership:OWN -1.619582e-03 3.963089e-02
11 verification_status:Not Verified 5.339510e-05 5.528332e-01
12 verification_status:Source Verified 8.967822e-03 5.354622e-01
13 purpose:car 7.904787e-04 6.340924e-01
14 purpose:debt_consolidation 1.264922e-02 9.646959e-01
15 purpose:educational 9.643587e-02 5.368894e-09
16 purpose:home_improvement 1.923044e-02 2.729279e-01
17 purpose:house 1.607120e-02 6.200015e-01
18 purpose:major_purchase 2.984917e-02 5.376877e-01
19 purpose:medical 3.962479e-02 7.391253e-02
20 purpose:moving 4.577630e-02 3.865040e-02
21 purpose:other 3.706744e-02 1.089028e-01
22 purpose:renewable_energy 7.212969e-02 1.423251e-01
23 purpose:small_business 5.128674e-02 8.692143e-01
24 purpose:vacation 1.874863e-02 7.311861e-01
25 purpose:wedding 4.350522e-02 5.539872e-03
26 initial_list_status:w 1.318126e-02 4.662937e-15
27 term_int 4.551882e-03 2.042660e-02
28 emp_length_int -1.591478e-03 6.350976e-01
29 mths_since_issue_d -4.305274e-03 0.000000e+00
30 mths_since_earliest_cr_line -3.634030e-05 1.087757e-04
31 funded_amnt 2.212126e-06 6.992397e-01
32 int_rate -1.172652e-02 1.887379e-14
33 installment -6.865607e-05 5.568684e-01
34 annual_inc 5.021816e-09 1.308549e-03
35 dti 2.832769e-04 6.897933e-02
36 delinq_2yrs 4.833234e-04 5.043858e-02
37 inq_last_6mths 1.131678e-02 1.819423e-02
38 mths_since_last_delinq -1.965980e-04 9.708935e-01
39 mths_since_last_record -5.085639e-05 1.809354e-01
40 open_acc -2.142130e-03 1.365854e-09
41 pub_rec 6.782062e-03 2.079574e-01
42 total_acc 4.518110e-04 5.133116e-08
43 acc_now_delinq 9.974801e-03 6.583494e-01
44 total_rev_hi_lim 2.166527e-07 2.592628e-08
In [95]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_ead.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table
Out[95]:
Feature name Coefficients p_values
0 Intercept 1.109746e+00 NaN
1 grade:A -3.030033e-01 0.000000e+00
2 grade:B -2.364277e-01 0.000000e+00
3 grade:C -1.720232e-01 0.000000e+00
4 grade:D -1.198470e-01 0.000000e+00
5 grade:E -6.768713e-02 0.000000e+00
6 grade:F -2.045907e-02 4.428795e-03
7 home_ownership:MORTGAGE -6.343341e-03 2.632464e-03
8 home_ownership:NONE -5.539064e-03 9.318931e-01
9 home_ownership:OTHER -2.426052e-03 9.335820e-01
10 home_ownership:OWN -1.619582e-03 6.366112e-01
11 verification_status:Not Verified 5.339510e-05 9.828295e-01
12 verification_status:Source Verified 8.967822e-03 7.828941e-05
13 purpose:car 7.904787e-04 9.330252e-01
14 purpose:debt_consolidation 1.264922e-02 5.898438e-07
15 purpose:educational 9.643587e-02 1.801025e-06
16 purpose:home_improvement 1.923044e-02 4.873543e-05
17 purpose:house 1.607120e-02 1.653651e-01
18 purpose:major_purchase 2.984917e-02 2.197793e-05
19 purpose:medical 3.962479e-02 5.238263e-06
20 purpose:moving 4.577630e-02 2.987383e-06
21 purpose:other 3.706744e-02 0.000000e+00
22 purpose:renewable_energy 7.212969e-02 8.889877e-03
23 purpose:small_business 5.128674e-02 0.000000e+00
24 purpose:vacation 1.874863e-02 1.152702e-01
25 purpose:wedding 4.350522e-02 2.032121e-04
26 initial_list_status:w 1.318126e-02 6.115181e-09
27 term_int 4.551882e-03 0.000000e+00
28 emp_length_int -1.591478e-03 4.404626e-10
29 mths_since_issue_d -4.305274e-03 0.000000e+00
30 mths_since_earliest_cr_line -3.634030e-05 2.742071e-03
31 funded_amnt 2.212126e-06 7.225181e-03
32 int_rate -1.172652e-02 0.000000e+00
33 installment -6.865607e-05 7.429261e-03
34 annual_inc 5.021816e-09 8.574696e-01
35 dti 2.832769e-04 3.632507e-02
36 delinq_2yrs 4.833234e-04 6.946456e-01
37 inq_last_6mths 1.131678e-02 0.000000e+00
38 mths_since_last_delinq -1.965980e-04 3.220434e-06
39 mths_since_last_record -5.085639e-05 3.291896e-01
40 open_acc -2.142130e-03 4.218847e-15
41 pub_rec 6.782062e-03 4.252750e-02
42 total_acc 4.518110e-04 1.902931e-04
43 acc_now_delinq 9.974801e-03 5.012787e-01
44 total_rev_hi_lim 2.166527e-07 8.196014e-05
Model Validation
In [96]:
ead_inputs_test = ead_inputs_test[features_all]
# Here we keep only the variables we need for the model.
In [97]:
ead_inputs_test = ead_inputs_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [98]:
ead_inputs_test.columns.values
Out[98]:
array(['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
       'home_ownership:MORTGAGE', 'home_ownership:NONE',
       'home_ownership:OTHER', 'home_ownership:OWN',
       'verification_status:Not Verified',
       'verification_status:Source Verified', 'purpose:car',
       'purpose:debt_consolidation', 'purpose:educational',
       'purpose:home_improvement', 'purpose:house',
       'purpose:major_purchase', 'purpose:medical', 'purpose:moving',
       'purpose:other', 'purpose:renewable_energy',
       'purpose:small_business', 'purpose:vacation', 'purpose:wedding',
       'initial_list_status:w', 'term_int', 'emp_length_int',
       'mths_since_issue_d', 'mths_since_earliest_cr_line', 'funded_amnt',
       'int_rate', 'installment', 'annual_inc', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'total_acc',
       'acc_now_delinq', 'total_rev_hi_lim'], dtype=object)
In [99]:
y_hat_test_ead = reg_ead.predict(ead_inputs_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
In [100]:
ead_targets_test_temp = ead_targets_test
In [101]:
ead_targets_test_temp = ead_targets_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.
In [102]:
pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.
Out[102]:
CCF 0
CCF 1.000000 0.530654
0 0.530654 1.000000
In [103]:
corr_mat_2 = pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()

corr_arr_2 = np.array(corr_mat_2)
classes = ['   ', '   ']
plot_confusion_matrix(corr_arr_2, classes,
                          normalize=True,
                          title='CORRELATION MATRIX - Act Vs Pred EAD',
                          cmap=plt.cm.RdYlGn)
Normalized confusion matrix
[[1.         0.53065383]
 [0.53065383 1.        ]]
In [104]:
sns.distplot(ead_targets_test - y_hat_test_ead)
# We plot the distribution of the residuals.
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x16f90b89748>
In [105]:
(ead_targets_test - y_hat_test_ead).mean()
Out[105]:
-0.002040148452007224
In [106]:
pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.
Out[106]:
0
count 8648.000000
mean 0.736013
std 0.105194
min 0.384774
25% 0.661553
50% 0.731750
75% 0.810625
max 1.161088
In [107]:
y_hat_test_ead = np.where(y_hat_test_ead < 0, 0, y_hat_test_ead)
y_hat_test_ead = np.where(y_hat_test_ead > 1, 1, y_hat_test_ead)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.
In [108]:
pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.
Out[108]:
0
count 8648.000000
mean 0.735992
std 0.105127
min 0.384774
25% 0.661553
50% 0.731750
75% 0.810625
max 1.000000
In [109]:
pd.DataFrame(y_hat_test_ead).sum()/pd.DataFrame(y_hat_test_ead).count()
Out[109]:
0    0.735992
dtype: float64
In [ ]:
 

9. Expected Credit Loss

9.1. ECL Calculation


In [110]:
loan_data_preprocessed.head()
Out[110]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int earliest_cr_line_date mths_since_earliest_cr_line term_int issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad
0 0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 NaN NaN 3.0 0.0 13648 83.7 9.0 f 0.0 0.0 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5000.0 NaN NaN NaN 10.0 1985-01-01 395.0 36 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 NaN NaN 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 1999-04-01 224.0 60 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2 2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 NaN NaN 2.0 0.0 2956 98.5 10.0 f 0.0 0.0 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2400.0 NaN NaN NaN 10.0 2001-11-01 193.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
3 3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 NaN 10.0 0.0 5598 21.0 37.0 f 0.0 0.0 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 10.0 1996-02-01 262.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
4 4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 NaN 15.0 0.0 27783 53.9 38.0 f 766.9 766.9 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3000.0 NaN NaN NaN 1.0 1996-01-01 263.0 60 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
In [111]:
loan_data_preprocessed['mths_since_last_delinq'].fillna(0, inplace = True)
# We fill the missing values with zeroes.
In [112]:
loan_data_preprocessed['mths_since_last_record'].fillna(0, inplace = True)
# We fill the missing values with zeroes.
In [113]:
loan_data_preprocessed_lgd_ead = loan_data_preprocessed[features_all]
# Here we keep only the variables we need for the model.
In [114]:
loan_data_preprocessed_lgd_ead = loan_data_preprocessed_lgd_ead.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
In [115]:
loan_data_preprocessed['recovery_rate_st_1'] = reg_lgd_st_1.model.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 1 LGD model and calculate predicted values.
In [116]:
loan_data_preprocessed['recovery_rate_st_2'] = reg_lgd_st_2.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 2 LGD model and calculate predicted values.
In [117]:
loan_data_preprocessed['recovery_rate'] = loan_data_preprocessed['recovery_rate_st_1'] * loan_data_preprocessed['recovery_rate_st_2']
# We combine the predicted values from the stage 1 predicted model and the stage 2 predicted model
# to calculate the final estimated recovery rate.
In [118]:
loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] < 0, 0, loan_data_preprocessed['recovery_rate'])
loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] > 1, 1, loan_data_preprocessed['recovery_rate'])
# We set estimated recovery rates that are greater than 1 to 1 and  estimated recovery rates that are less than 0 to 0.
In [119]:
loan_data_preprocessed['LGD'] = 1 - loan_data_preprocessed['recovery_rate']
# We calculate estimated LGD. Estimated LGD equals 1 - estimated recovery rate.
In [120]:
loan_data_preprocessed['LGD'].describe()
# Shows some descriptive statisics for the values of a column.
Out[120]:
count    466285.000000
mean          0.921094
std           0.057400
min           0.659786
25%           0.874298
50%           0.899998
75%           1.000000
max           1.000000
Name: LGD, dtype: float64
In [121]:
loan_data_preprocessed['CCF'] = reg_ead.predict(loan_data_preprocessed_lgd_ead)
# We apply the EAD model to calculate estimated credit conversion factor.
In [122]:
loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] < 0, 0, loan_data_preprocessed['CCF'])
loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] > 1, 1, loan_data_preprocessed['CCF'])
# We set estimated CCF that are greater than 1 to 1 and  estimated CCF that are less than 0 to 0.
In [123]:
loan_data_preprocessed['EAD'] = loan_data_preprocessed['CCF'] * loan_data_preprocessed_lgd_ead['funded_amnt']
# We calculate estimated EAD. Estimated EAD equals estimated CCF multiplied by funded amount.
In [124]:
loan_data_preprocessed['EAD'].describe()
# Shows some descriptive statisics for the values of a column.
Out[124]:
count    466285.000000
mean      10814.846760
std        6935.184562
min         190.347372
25%        5495.101413
50%        9208.479591
75%       14692.844549
max       35000.000000
Name: EAD, dtype: float64
In [125]:
loan_data_preprocessed.head()
Out[125]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int earliest_cr_line_date mths_since_earliest_cr_line term_int issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad recovery_rate_st_1 recovery_rate_st_2 recovery_rate LGD CCF EAD
0 0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 0.0 0.0 3.0 0.0 13648 83.7 9.0 f 0.0 0.0 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5000.0 NaN NaN NaN 10.0 1985-01-01 395.0 36 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.086271 0.086271 0.913729 0.589922 2949.608449
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 0.0 0.0 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 1999-04-01 224.0 60 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0.084518 0.084518 0.915482 0.777773 1944.433378
2 2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 0.0 0.0 2.0 0.0 2956 98.5 10.0 f 0.0 0.0 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2400.0 NaN NaN NaN 10.0 2001-11-01 193.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.080516 0.080516 0.919484 0.658306 1579.934302
3 3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 0.0 10.0 0.0 5598 21.0 37.0 f 0.0 0.0 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 10.0 1996-02-01 262.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.095076 0.095076 0.904924 0.660656 6606.559612
4 4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 0.0 15.0 0.0 27783 53.9 38.0 f 766.9 766.9 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3000.0 NaN NaN NaN 1.0 1996-01-01 263.0 60 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.088547 0.088547 0.911453 0.708211 2124.631667
In [126]:
loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
# We import data to apply the PD model.
In [127]:
loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
# We import data to apply the PD model.
In [128]:
loan_data_inputs_pd = pd.concat([loan_data_inputs_train, loan_data_inputs_test], axis = 0)
# We concatenate the two dataframes along the rows.
In [129]:
loan_data_inputs_pd.shape
Out[129]:
(466285, 324)
In [130]:
loan_data_inputs_pd.head()
Out[130]:
Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int term_int earliest_cr_line_date mths_since_earliest_cr_line issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w home_ownership:RENT_OTHER_NONE_ANY addr_state:ND addr_state:ND_NE_IA_NV_FL_HI_AL addr_state:NM_VA addr_state:OK_TN_MO_LA_MD_NC addr_state:UT_KY_AZ_NJ addr_state:AR_MI_PA_OH_MN addr_state:RI_MA_DE_SD_IN addr_state:GA_WA_OR addr_state:WI_MT addr_state:IL_CT addr_state:KS_SC_CO_VT_AK_MS addr_state:WV_NH_WY_DC_ME_ID purpose:educ__sm_b__wedd__ren_en__mov__house purpose:oth__med__vacation purpose:major_purch__car__home_impr term:36 term:60 emp_length:0 emp_length:1 emp_length:2-4 emp_length:5-6 emp_length:7-9 emp_length:10 mths_since_issue_d_factor mths_since_issue_d:<38 mths_since_issue_d:38-39 mths_since_issue_d:40-41 mths_since_issue_d:42-48 mths_since_issue_d:49-52 mths_since_issue_d:53-64 mths_since_issue_d:65-84 mths_since_issue_d:>84 int_rate_factor int_rate:<9.548 int_rate:9.548-12.025 int_rate:12.025-15.74 int_rate:15.74-20.281 int_rate:>20.281 funded_amnt_factor mths_since_earliest_cr_line_factor mths_since_earliest_cr_line:<140 mths_since_earliest_cr_line:141-164 mths_since_earliest_cr_line:165-247 mths_since_earliest_cr_line:248-270 mths_since_earliest_cr_line:271-352 mths_since_earliest_cr_line:>352 delinq_2yrs:0 delinq_2yrs:1-3 delinq_2yrs:>=4 inq_last_6mths:0 inq_last_6mths:1-2 inq_last_6mths:3-6 inq_last_6mths:>6 open_acc:0 open_acc:1-3 open_acc:4-12 open_acc:13-17 open_acc:18-22 open_acc:23-25 open_acc:26-30 open_acc:>=31 pub_rec:0-2 pub_rec:3-4 pub_rec:>=5 total_acc_factor total_acc:<=27 total_acc:28-51 total_acc:>=52 acc_now_delinq:0 acc_now_delinq:>=1 total_rev_hi_lim_factor total_rev_hi_lim:<=5K total_rev_hi_lim:5K-10K total_rev_hi_lim:10K-20K total_rev_hi_lim:20K-30K total_rev_hi_lim:30K-40K total_rev_hi_lim:40K-55K total_rev_hi_lim:55K-95K total_rev_hi_lim:>95K installment_factor annual_inc_factor annual_inc:<20K annual_inc:20K-30K annual_inc:30K-40K annual_inc:40K-50K annual_inc:50K-60K annual_inc:60K-70K annual_inc:70K-80K annual_inc:80K-90K annual_inc:90K-100K annual_inc:100K-120K annual_inc:120K-140K annual_inc:>140K mths_since_last_delinq:Missing mths_since_last_delinq:0-3 mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 dti_factor dti:<=1.4 dti:1.4-3.5 dti:3.5-7.7 dti:7.7-10.5 dti:10.5-16.1 dti:16.1-20.3 dti:20.3-21.7 dti:21.7-22.4 dti:22.4-35 dti:>35 mths_since_last_record:Missing mths_since_last_record:0-2 mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86
427211 427211 12796369 14818505 24000 24000 24000.0 36 months 8.90 762.08 A A5 Supervisor inventory management 3 years MORTGAGE 77000.0 Source Verified Mar-14 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 03/12/14 > I have 5 credit... debt_consolidation Debt consolidation 295xx SC 21.91 0.0 Dec-86 1.0 NaN NaN 20.0 0.0 30489 53.5 32.0 f 10098.30 10098.30 16765.76000 16765.76 13901.70 2864.06 0.00 0.00 0.000 Jan-16 762.08 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 348253.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 57000.0 NaN NaN NaN 3.0 36 1986-12-01 372.0 2014-03-01 45.0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 (43.2, 45.0] 0 0 0 1 0 0 0 0 (8.722, 9.135] 1 0 0 0 0 (23960.0, 24650.0] (363.94, 375.68] 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 (30.0, 33.0] 0 1 0 1 0 (54999.994, 59999.994] 0 0 0 0 0 0 1 0 (740.716, 768.603] (73294.82, 144693.64] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 (21.595, 21.995] 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
206088 206088 1439740 1691948 10000 10000 10000.0 36 months 14.33 343.39 C C1 mizuho corporate bank 6 years MORTGAGE 112000.0 Not Verified Aug-12 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 07/23/12 > I was looking f... debt_consolidation Credit card consolidation 070xx NJ 7.49 1.0 Dec-97 2.0 18.0 NaN 15.0 0.0 15836 53.1 38.0 f 0.00 0.00 12357.02066 12357.02 10000.00 2357.02 0.00 0.00 0.000 Aug-15 355.11 NaN Jul-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 6.0 36 1997-12-01 240.0 2012-08-01 64.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 (63.0, 64.8] 0 0 0 0 0 1 0 0 (14.089, 14.502] 0 0 1 0 0 (9470.0, 10160.0] (234.8, 246.54] 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 (36.0, 39.0] 0 1 0 1 0 (9999.999, 14999.998] 0 1 0 0 0 0 0 0 (322.42, 350.307] (73294.82, 144693.64] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 (7.198, 7.598] 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
136020 136020 5214749 6556909 20425 20425 20425.0 36 months 8.90 648.56 A A5 Internal Medicine of Griffin 10+ years MORTGAGE 84000.0 Verified Jun-13 Current n https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Lend Club 302xx GA 14.83 0.0 Jul-91 1.0 46.0 NaN 9.0 0.0 29813 89.5 20.0 f 3183.62 3183.62 20090.40000 20090.40 17241.38 2849.02 0.00 0.00 0.000 Jan-16 648.56 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 385187.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 33300.0 NaN NaN NaN 10.0 36 1991-07-01 317.0 2013-06-01 54.0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 (52.2, 54.0] 0 0 0 0 0 1 0 0 (8.722, 9.135] 1 0 0 0 0 (19820.0, 20510.0] (316.98, 328.72] 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 (18.0, 21.0] 1 0 0 1 0 (29999.997, 34999.996] 0 0 0 0 1 0 0 0 (629.171, 657.057] (73294.82, 144693.64] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 (14.796, 15.196] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
412305 412305 13827698 15890016 17200 17200 17200.0 36 months 16.59 609.73 D D1 Administrative Assistant 7 years RENT 43000.0 Source Verified Apr-14 Late (31-120 days) n https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 015xx MA 13.68 1.0 Oct-98 3.0 6.0 NaN 9.0 0.0 7523 60.2 13.0 f 9459.50 9459.50 11615.36000 11615.36 7740.50 3844.37 30.49 0.00 0.000 Jan-16 640.22 Feb-16 Jan-16 0.0 50.0 1 INDIVIDUAL NaN NaN NaN 0.0 0.0 22958.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12500.0 NaN NaN NaN 7.0 36 1998-10-01 230.0 2014-04-01 44.0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 (43.2, 45.0] 0 0 0 1 0 0 0 0 (16.566, 16.978] 0 0 0 1 0 (17060.0, 17750.0] (223.06, 234.8] 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 (12.0, 15.0] 1 0 0 1 0 (9999.999, 14999.998] 0 0 1 0 0 0 0 0 (601.284, 629.171] (-5243.882, 73294.82] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 (13.597, 13.997] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
36159 36159 422455 496525 8400 8400 7450.0 36 months 12.84 282.40 C C2 Bank of A 5 years MORTGAGE 94000.0 Verified Jul-09 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Unexpectd California tuition hike - Need help ... educational Student Loan 913xx CA 22.54 0.0 Jul-98 1.0 NaN NaN 14.0 0.0 65621 81.5 30.0 f 0.00 0.00 5422.21000 4808.80 3566.58 1231.84 0.00 623.79 28.256 Dec-10 282.40 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8400.0 NaN NaN NaN 5.0 36 1998-07-01 233.0 2009-07-01 101.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 (100.8, 102.6] 0 0 0 0 0 0 0 1 (12.438, 12.85] 0 0 1 0 0 (8090.0, 8780.0] (223.06, 234.8] 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 (27.0, 30.0] 0 1 0 1 0 (5000.0, 9999.999] 0 1 0 0 0 0 0 0 (266.648, 294.534] (73294.82, 144693.64] 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 (22.394, 22.794] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
In [ ]:
 
In [131]:
features_all_pd = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86']
In [132]:
ref_categories_pd = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
In [133]:
loan_data_inputs_pd_temp = loan_data_inputs_pd[features_all_pd]
# Here we keep only the variables we need for the model.
In [134]:
loan_data_inputs_pd_temp = loan_data_inputs_pd_temp.drop(ref_categories_pd, axis = 1)
# Here we remove the dummy variable reference categories.
In [135]:
loan_data_inputs_pd_temp.shape
Out[135]:
(466285, 84)
In [136]:
import pickle
In [137]:
reg_pd = pickle.load(open('pd_model.sav', 'rb'))
# We import the PD model, stored in the 'pd_model.sav' file.
In [138]:
reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.
Out[138]:
array([0.02958535, 0.09214789, 0.03735891, ..., 0.02678653, 0.04020847,
       0.04763345])
In [139]:
loan_data_inputs_pd['PD'] = reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.
In [140]:
loan_data_inputs_pd['PD'].head()
Out[140]:
427211    0.029585
206088    0.092148
136020    0.037359
412305    0.204329
36159     0.200845
Name: PD, dtype: float64
In [141]:
loan_data_inputs_pd['PD'].describe()
# Shows some descriptive statisics for the values of a column.
Out[141]:
count    466285.000000
mean          0.109307
std           0.070917
min           0.007314
25%           0.056064
50%           0.093493
75%           0.146558
max           0.635822
Name: PD, dtype: float64
In [142]:
loan_data_preprocessed_new = pd.concat([loan_data_preprocessed, loan_data_inputs_pd], axis = 1)
# We concatenate the dataframes where we calculated LGD and EAD and the dataframe where we calculated PD along the columns.
In [143]:
loan_data_preprocessed_new.shape
Out[143]:
(466285, 540)
In [144]:
loan_data_preprocessed_new.head()
Out[144]:
Unnamed: 0 Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int earliest_cr_line_date mths_since_earliest_cr_line term_int issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w good_bad recovery_rate_st_1 recovery_rate_st_2 recovery_rate LGD CCF EAD Unnamed: 0.1 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m emp_length_int term_int earliest_cr_line_date mths_since_earliest_cr_line issue_d_date mths_since_issue_d grade:A grade:B grade:C grade:D grade:E grade:F grade:G sub_grade:A1 sub_grade:A2 sub_grade:A3 sub_grade:A4 sub_grade:A5 sub_grade:B1 sub_grade:B2 sub_grade:B3 sub_grade:B4 sub_grade:B5 sub_grade:C1 sub_grade:C2 sub_grade:C3 sub_grade:C4 sub_grade:C5 sub_grade:D1 sub_grade:D2 sub_grade:D3 sub_grade:D4 sub_grade:D5 sub_grade:E1 sub_grade:E2 sub_grade:E3 sub_grade:E4 sub_grade:E5 sub_grade:F1 sub_grade:F2 sub_grade:F3 sub_grade:F4 sub_grade:F5 sub_grade:G1 sub_grade:G2 sub_grade:G3 sub_grade:G4 sub_grade:G5 home_ownership:ANY home_ownership:MORTGAGE home_ownership:NONE home_ownership:OTHER home_ownership:OWN home_ownership:RENT verification_status:Not Verified verification_status:Source Verified verification_status:Verified loan_status:Charged Off loan_status:Current loan_status:Default loan_status:Does not meet the credit policy. Status:Charged Off loan_status:Does not meet the credit policy. Status:Fully Paid loan_status:Fully Paid loan_status:In Grace Period loan_status:Late (16-30 days) loan_status:Late (31-120 days) purpose:car purpose:credit_card purpose:debt_consolidation purpose:educational purpose:home_improvement purpose:house purpose:major_purchase purpose:medical purpose:moving purpose:other purpose:renewable_energy purpose:small_business purpose:vacation purpose:wedding addr_state:AK addr_state:AL addr_state:AR addr_state:AZ addr_state:CA addr_state:CO addr_state:CT addr_state:DC addr_state:DE addr_state:FL addr_state:GA addr_state:HI addr_state:IA addr_state:ID addr_state:IL addr_state:IN addr_state:KS addr_state:KY addr_state:LA addr_state:MA addr_state:MD addr_state:ME addr_state:MI addr_state:MN addr_state:MO addr_state:MS addr_state:MT addr_state:NC addr_state:NE addr_state:NH addr_state:NJ addr_state:NM addr_state:NV addr_state:NY addr_state:OH addr_state:OK addr_state:OR addr_state:PA addr_state:RI addr_state:SC addr_state:SD addr_state:TN addr_state:TX addr_state:UT addr_state:VA addr_state:VT addr_state:WA addr_state:WI addr_state:WV addr_state:WY initial_list_status:f initial_list_status:w home_ownership:RENT_OTHER_NONE_ANY addr_state:ND addr_state:ND_NE_IA_NV_FL_HI_AL addr_state:NM_VA addr_state:OK_TN_MO_LA_MD_NC addr_state:UT_KY_AZ_NJ addr_state:AR_MI_PA_OH_MN addr_state:RI_MA_DE_SD_IN addr_state:GA_WA_OR addr_state:WI_MT addr_state:IL_CT addr_state:KS_SC_CO_VT_AK_MS addr_state:WV_NH_WY_DC_ME_ID purpose:educ__sm_b__wedd__ren_en__mov__house purpose:oth__med__vacation purpose:major_purch__car__home_impr term:36 term:60 emp_length:0 emp_length:1 emp_length:2-4 emp_length:5-6 emp_length:7-9 emp_length:10 mths_since_issue_d_factor mths_since_issue_d:<38 mths_since_issue_d:38-39 mths_since_issue_d:40-41 mths_since_issue_d:42-48 mths_since_issue_d:49-52 mths_since_issue_d:53-64 mths_since_issue_d:65-84 mths_since_issue_d:>84 int_rate_factor int_rate:<9.548 int_rate:9.548-12.025 int_rate:12.025-15.74 int_rate:15.74-20.281 int_rate:>20.281 funded_amnt_factor mths_since_earliest_cr_line_factor mths_since_earliest_cr_line:<140 mths_since_earliest_cr_line:141-164 mths_since_earliest_cr_line:165-247 mths_since_earliest_cr_line:248-270 mths_since_earliest_cr_line:271-352 mths_since_earliest_cr_line:>352 delinq_2yrs:0 delinq_2yrs:1-3 delinq_2yrs:>=4 inq_last_6mths:0 inq_last_6mths:1-2 inq_last_6mths:3-6 inq_last_6mths:>6 open_acc:0 open_acc:1-3 open_acc:4-12 open_acc:13-17 open_acc:18-22 open_acc:23-25 open_acc:26-30 open_acc:>=31 pub_rec:0-2 pub_rec:3-4 pub_rec:>=5 total_acc_factor total_acc:<=27 total_acc:28-51 total_acc:>=52 acc_now_delinq:0 acc_now_delinq:>=1 total_rev_hi_lim_factor total_rev_hi_lim:<=5K total_rev_hi_lim:5K-10K total_rev_hi_lim:10K-20K total_rev_hi_lim:20K-30K total_rev_hi_lim:30K-40K total_rev_hi_lim:40K-55K total_rev_hi_lim:55K-95K total_rev_hi_lim:>95K installment_factor annual_inc_factor annual_inc:<20K annual_inc:20K-30K annual_inc:30K-40K annual_inc:40K-50K annual_inc:50K-60K annual_inc:60K-70K annual_inc:70K-80K annual_inc:80K-90K annual_inc:90K-100K annual_inc:100K-120K annual_inc:120K-140K annual_inc:>140K mths_since_last_delinq:Missing mths_since_last_delinq:0-3 mths_since_last_delinq:4-30 mths_since_last_delinq:31-56 mths_since_last_delinq:>=57 dti_factor dti:<=1.4 dti:1.4-3.5 dti:3.5-7.7 dti:7.7-10.5 dti:10.5-16.1 dti:16.1-20.3 dti:20.3-21.7 dti:21.7-22.4 dti:22.4-35 dti:>35 mths_since_last_record:Missing mths_since_last_record:0-2 mths_since_last_record:3-20 mths_since_last_record:21-31 mths_since_last_record:32-80 mths_since_last_record:81-86 mths_since_last_record:>86 PD
0 0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 0.0 0.0 3.0 0.0 13648 83.7 9.0 f 0.0 0.0 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5000.0 NaN NaN NaN 10.0 1985-01-01 395.0 36 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.086271 0.086271 0.913729 0.589922 2949.608449 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 NaN NaN 3.0 0.0 13648 83.7 9.0 f 0.0 0.0 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5000.0 NaN NaN NaN 10.0 36 1985-01-01 395.0 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 (70.2, 72.0] 0 0 0 0 0 0 1 0 (10.374, 10.786] 0 1 0 0 0 (4640.0, 5330.0] (387.42, 399.16] 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 (6.0, 9.0] 1 0 0 1 0 (5000.0, 9999.999] 1 0 0 0 0 0 0 0 (155.102, 182.988] (-5243.882, 73294.82] 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 (27.593, 27.993] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0.164761
1 1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 0.0 0.0 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 1999-04-01 224.0 60 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0.084518 0.084518 0.915482 0.777773 1944.433378 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 NaN NaN 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2500.0 NaN NaN NaN 0.0 60 1999-04-01 224.0 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 (70.2, 72.0] 0 0 0 0 0 0 1 0 (14.914, 15.327] 0 0 1 0 0 (1880.0, 2570.0] (223.06, 234.8] 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 (3.0, 6.0] 1 0 0 1 0 (-9999.999, 5000.0] 1 0 0 0 0 0 0 0 (43.556, 71.443] (-5243.882, 73294.82] 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 (0.8, 1.2] 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0.282340
2 2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 0.0 0.0 2.0 0.0 2956 98.5 10.0 f 0.0 0.0 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2400.0 NaN NaN NaN 10.0 2001-11-01 193.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.080516 0.080516 0.919484 0.658306 1579.934302 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 NaN NaN 2.0 0.0 2956 98.5 10.0 f 0.0 0.0 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2400.0 NaN NaN NaN 10.0 36 2001-11-01 193.0 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 (70.2, 72.0] 0 0 0 0 0 0 1 0 (15.74, 16.153] 0 0 0 1 0 (1880.0, 2570.0] (187.84, 199.58] 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 (9.0, 12.0] 1 0 0 1 0 (-9999.999, 5000.0] 1 0 0 0 0 0 0 0 (71.443, 99.329] (-5243.882, 73294.82] 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 (8.398, 8.798] 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0.229758
3 3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 0.0 10.0 0.0 5598 21.0 37.0 f 0.0 0.0 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 10.0 1996-02-01 262.0 36 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.095076 0.095076 0.904924 0.660656 6606.559612 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 NaN 10.0 0.0 5598 21.0 37.0 f 0.0 0.0 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10000.0 NaN NaN NaN 10.0 36 1996-02-01 262.0 2011-12-01 72.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 (70.2, 72.0] 0 0 0 0 0 0 1 0 (13.263, 13.676] 0 0 1 0 0 (9470.0, 10160.0] (258.28, 270.02] 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 (36.0, 39.0] 0 1 0 1 0 (9999.999, 14999.998] 0 1 0 0 0 0 0 0 (322.42, 350.307] (-5243.882, 73294.82] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 (19.995, 20.395] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.208891
4 4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 0.0 15.0 0.0 27783 53.9 38.0 f 766.9 766.9 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3000.0 NaN NaN NaN 1.0 1996-01-01 263.0 60 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0.088547 0.088547 0.911453 0.708211 2124.631667 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 NaN 15.0 0.0 27783 53.9 38.0 f 766.9 766.9 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3000.0 NaN NaN NaN 1.0 60 1996-01-01 263.0 2011-12-01 72.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 (70.2, 72.0] 0 0 0 0 0 0 1 0 (12.438, 12.85] 0 0 1 0 0 (2570.0, 3260.0] (258.28, 270.02] 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 (36.0, 39.0] 0 1 0 1 0 (-9999.999, 5000.0] 1 0 0 0 0 0 0 0 (43.556, 71.443] (73294.82, 144693.64] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 (17.596, 17.996] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.129555
In [145]:
loan_data_preprocessed_new['EL'] = loan_data_preprocessed_new['PD'] * loan_data_preprocessed_new['LGD'] * loan_data_preprocessed_new['EAD']
# We calculate Expected Loss. EL = PD * LGD * EAD.
In [146]:
loan_data_preprocessed_new['EL'].describe()
# Shows some descriptive statisics for the values of a column.
Out[146]:
count    466285.000000
mean       1076.294727
std        1090.970241
min           9.542825
25%         355.816284
50%         706.183752
75%        1396.046226
max       11909.918457
Name: EL, dtype: float64
In [147]:
Output = loan_data_preprocessed_new[['funded_amnt', 'PD', 'LGD', 'EAD', 'EL']]
Output = Output.loc[:,~Output.columns.duplicated()]
Output.head()
Out[147]:
funded_amnt PD LGD EAD EL
0 5000 0.164761 0.913729 2949.608449 444.052967
1 2500 0.282340 0.915482 1944.433378 502.591700
2 2400 0.229758 0.919484 1579.934302 333.775488
3 10000 0.208891 0.904924 6606.559612 1248.839565
4 3000 0.129555 0.911453 2124.631667 250.883310
In [148]:
EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()
Weight = EAD_LGD/EAD_LGD.sum()
Output['Weight'] = Weight
Wtd_PD = Output['Weight'] * Output['PD']
Output['Wtd_PD'] = Wtd_PD
Output.head()
Out[148]:
funded_amnt PD LGD EAD EL Weight Wtd_PD
0 5000 0.164761 0.913729 2949.608449 444.052967 5.841951e-07 9.625228e-08
1 2500 0.282340 0.915482 1944.433378 502.591700 3.858504e-07 1.089410e-07
2 2400 0.229758 0.919484 1579.934302 333.775488 3.148903e-07 7.234869e-08
3 10000 0.208891 0.904924 6606.559612 1248.839565 1.295877e-06 2.706966e-07
4 3000 0.129555 0.911453 2124.631667 250.883310 4.197532e-07 5.438110e-08
In [149]:
Output['Wtd_PD'].sum() 
Out[149]:
0.1087824629037051
In [150]:
Output['Wtd_PD'].sum()  
Out[150]:
0.1087824629037051
In [151]:
EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()
Out[151]:
4613428243.825036
In [152]:
EAD_LGD.sum() * Output['Wtd_PD'].sum()
Out[152]:
501860086.79280233
In [153]:
Output['EAD'].sum()
Out[153]:
5042800821.555224
In [154]:
EAD_LGD.sum()
Out[154]:
4613428243.825036
In [155]:
EAD_LGD.sum()/Output['EAD'].sum()  * Output['EAD'].sum() * Output['Wtd_PD'].sum()
Out[155]:
501860086.79280233
In [156]:
EAD_LGD.sum()/Output['EAD'].sum()
Out[156]:
0.9148543452490023
In [157]:
RR = 1- (EAD_LGD.sum()/Output['EAD'].sum())
RR
Out[157]:
0.08514565475099767
In [158]:
import math
Ave_PD = Output['Wtd_PD'].sum() 
Ass_corr_1 = 0.04* (1-math.exp(-50*Ave_PD)) / (1-math.exp(-50))
Ass_corr_2 = 0.0039*(1-(1-math.exp(-50*Ave_PD)/(1-math.exp(-50))))
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr
Out[158]:
0.03984320722969012
In [159]:
from scipy.stats import norm
Ave_PD
Eco_Scen = 0.70
a = norm.ppf(Ave_PD) 
b = norm.ppf(Eco_Scen)
Cor_Coef = Ass_corr

c = (a + math.sqrt(Cor_Coef)*b)/ math.sqrt(1 - Cor_Coef)
PD_Corr = norm.cdf(c, loc = 0, scale = 1)
PD_Corr
Out[159]:
0.12475750738254893

9.2. ECL Model Output

In [160]:
title = "CREDIT CARD RECEIVABLES PORTFOLIO - RISK SUMMARY"
bolded_title = "\033[34;1;4m" + title + "\033[0m"
formatted_PD  = "{:.4f}".format(Output['Wtd_PD'].sum())
formatted_EAD = "{:,.0f}".format(Output['EAD'].sum())
formatted_RR = "{:.4f}".format(1- (EAD_LGD.sum()/Output['EAD'].sum()))
formatted_ECL= "{:,.0f}".format(Output['EL'].sum())
formatted_ECL_pc= "{:.4f}".format(Output['EL'].sum()
                                  /Output['funded_amnt'].sum())
formatted_FA= "{:,.0f}".format(Output['funded_amnt'].sum())
formatted_AC= "{:,.4f}".format(Ass_corr)
formatted_Eco= "{:,.3f}".format(Eco_Scen)
formatted_PD_Corr  = "{:,.4f}".format(PD_Corr)
formatted_ECL_Corr  = "{:,.0f}".format(EAD_LGD.sum()/Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr)
formatted_ECL_Corr_pc= "{:.4f}".format(EAD_LGD.sum()/
                                       Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr/Output['funded_amnt'].sum())
print("            ")
print("            ", bolded_title)
print("            ")
print("------------------------------------------------------------------")
print("Current Funded Amount                         : " , formatted_FA)
print("------------------------------------------------------------------")
print("Weighted Average Probability of Default       : " , formatted_PD)
print("Expected Exposure at Default                  : " , formatted_EAD)
print("Expected Recovery Rate                        : " , formatted_RR)
print("\033[1mExpected Credit Loss Assuming Independence\033[0m    : " , "\033[1m" + formatted_ECL + "\033[0m")
print("\033[1mECL Assuming Independence ÷ Funded Amount\033[0m     : " , "\033[1m" + formatted_ECL_pc + "\033[0m")
print("------------------------------------------------------------------")
print("Asset Correlation                             : " , formatted_AC)
print("Economic Scenario                             : " , formatted_Eco)
print("Portfolio Prob. Default Assuming Correlation  : " , formatted_PD_Corr)
print("\033[1mExpected Credit Loss Assuming Correlation\033[0m     : " , "\033[1m" + formatted_ECL_Corr + "\033[0m")
print("\033[1mECL Assuming Correlation ÷ Funded Amount\033[0m      : " , "\033[1m" + formatted_ECL_Corr_pc + "\033[0m")
print("__________________________________________________________________")
print("            ")
            
             CREDIT CARD RECEIVABLES PORTFOLIO - RISK SUMMARY
            
------------------------------------------------------------------
Current Funded Amount                         :  6,664,052,450
------------------------------------------------------------------
Weighted Average Probability of Default       :  0.1088
Expected Exposure at Default                  :  5,042,800,822
Expected Recovery Rate                        :  0.0851
Expected Credit Loss Assuming Independence    :  501,860,087
ECL Assuming Independence ÷ Funded Amount     :  0.0753
------------------------------------------------------------------
Asset Correlation                             :  0.0398
Economic Scenario                             :  0.700
Portfolio Prob. Default Assuming Correlation  :  0.1248
Expected Credit Loss Assuming Correlation     :  575,559,808
ECL Assuming Correlation ÷ Funded Amount      :  0.0864
__________________________________________________________________