IFRS 9.5.5 introduces an impairment model for financial assets based on Expected Credit Losses (ECL), which requires entities to recognize a loss allowance prior to loss materialization, utilizing forward-looking and historical information. IFRS 9.5.5.1 stipulates that an entity “shall recognise a loss allowance for expected credit losses on a financial asset that is measured in accordance with paragraphs 4.1.2”, that is, financial assets “measured at amortised cost” held “to collect contractual cash flows” and whose “contractual terms of the financial asset give rise on specified dates to cash flows that are solely payments of principal and interest on the principal amount outstanding”. Referring to the impairment model’s input data, IFRS 9.5.5.4 expects entities to consider “all reasonable and supportable information, including that which is forward-looking”.
The ECL for Trade Receivables that contain a "significant financing component"1 under IFRS 15, such as credit card receivables, can be measured under the “Simplified Approach”. In contratst with the "General Approach", the Simplified Approach allows entities to recognise lifetime expected losses on all these assets without the need to identify significant increases in credit risk. In any case because the maturities will typically be 12 months or less, the credit loss for 12-month and lifetime ECLs would be the same. IFRS 9.5.5.15 states that "an entity shall always measure the loss allowance at an amount equal to lifetime expected credit losses for...trade receivables or contract assets that result from transactions that are within the scope of IFRS 15, and that…contain a significant financing component in accordance with IFRS 15, if the entity chooses as its accounting policy to measure the loss allowance at an amount equal to lifetime expected credit losses."
Lifetime expected credit loss is the discounted value of expected credit losses that result from probable default events over the expected life of a financial instrument. IFRS 9.5.5.17 clarifies that "An entity shall measure expected credit losses of a financial instrument in a way that reflects: (a)an unbiased and probability-weighted amount that is determined by evaluating a range of possible outcomes; (b)the time value of money. "The term ‘default’ is not defined in IFRS 9. IFRS 9:B5.5.37 states that a definition of default should be "consistent with the definition used for internal credit risk management purposes". Entities will need to consider the requirements of this paragraph where it states there is a "rebuttable presumption that default does not occur later than when a financial asset is 90 days past due unless an entity has reasonable and supportable information to demonstrate that a more lagging default criterion is more appropriate".
IFRS 9.5.5.19 indicates that the maximum expected life is generally understood as the contractual life: "The maximum period to consider when measuring expected credit losses is the maximum contractual period (including extension options) over which the entity is exposed to credit risk and not a longer period". The expected period of exposure is more subjective. IFRS 9:B5.5.40 states that when dtermining expected life "an entity should consider factors such as historical information and experience about:(a) the period over which the entity was exposed to credit risk on similar financial instruments;(b) the length of time for related defaults to occur on similar financial instruments following a significant increase in credit risk; and (c) the credit risk management actions that an entity expects to take once the credit risk on the financial instrument has increased, such as the reduction or removal of undrawn limits."
With specific reference to revolving credit facilities, IFRS 9:B5.5.39 prevails upon the entity to apply dicretionary judgement regarding the time horizon of credit exposure. Where financial instruments include both a loan and an undrawn commitment component (such as credit cards and overdraft facilities), the contractual ability to demand repayment and cancel the undrawn commitment does not necessarily limit the exposure to credit losses beyond the contractual period. For those financial instruments, management should measure ECL over the period that the entity is exposed to credit risk and ECL would not be mitigated by credit risk management actions, even if that period extends beyond the maximum contractual period. In the Illustrative Examples, IFRS 9:IE60 provides further guidance on which factors should be taken into consideration when determining size and time horizon of credit exposure: "At the reporting date the outstanding balance on the credit card portfolio is CU60,000 and the available undrawn facility is CU40,000. Bank A determines the expected life of the portfolio by estimating the period over which it expectsto be exposed to credit risk on the facilities at the reporting date, taking into account: (a) the period over which it was exposed to credit risk on a similar portfolio of credit cards; (b) the length of time for related defaults to occur on similar financial instruments; and (c) past events that led to credit risk management actions because of an increase in credit risk on similar financial instruments, such as the reduction or removal of undrawn credit limits.
1 A significant financing component exists if the timing of payments agreed to by the parties to the contract (either explicitly or implicitly) provides the customer or the entity with a significant benefit of financing the transfer of goods or services to the customer. [IFRS 15:60]
- The ECL calculation model should calculate an unbiased and probability-weighted amount to be presented as an impairment to the book value of the financial assets in the Balance sheet.
- This unbiased and probability weighted amount is the difference between the present value of cashflows due under contract and the present value of cashflows that an entity expects to receive.
- The Expected Credit Loss determined by the probability of default, the size of the exposure to defaulting customers, the expected recoverable amount in the event of default and the discount rate applied.
- The estimated size of the exposure is necessarily related to the expectations on the customers drawdown of the undrawn commitment component over a defined time frame. The time frame will be governed by subjective evaluations focussing on how long it will take the entity to identify and take remedial action in relation to problem credit
- The Lifetime Expected Credit Losses will have to incorporate the term structure of the default probability of the assets. In other words, the hazard rate or default intensity, which connotes an instantaneous rate of failure, should be used along with the exponential distribution to compute the cumulative probability of default for a given time horizon.
- The entity should apply a granular and dynamic approach for portfolio segmentation by grouping financial assets
based on shared credit characteristics.
- As with all such forward-looking models, expected loss should be taken into consideration the expected loss at a aggregate portfolio level which generally involves incorporating some expectation of the effect of correlation between the constituent assets.
The future value of Lifetime Expected Credit Loss of portfolio at future time is defined as a function of the probability of default , expected exposure at the time of default and the size of the expected loss in the event of default . The present value of this future is obtained by discounting it at the Expected Interest Rate of the portfolio assets, . Thus:
is the hazard rate or default intensity. More precisely, it is the (instantaneous) probability of default, , over an infinitesimally small time interval :
The estimation of default probabilities of each credit portfolio constituent is achieved with the logit model, which employs the technique of logistic transformation to generate a sigmoid function bounded by 0 and 1.:
Where is a linear regression function of the form:
Where are parameters that are estimated statistically and are scores, ratios and other explanatory variables for obligor , transformed into binary "dummy" variables.
is the average cumulative probability of default of the portfolio over , that is, the output of the cumulative default time distribution at time horizon , where denotes the weighted average lifetime of the credit portfolio:
The Vasicek Model offers an elegant solution allowing the computation of a portfolio default rate , , which integrates the impact of (negative) assumptions about future economic conditions and the effect of the correlation between the portfolio assets. The model takes three inputs:
* The weighted average standalone probability of default, denoted by
* The average correlation of portfolio assets with the broader economy, denoted by ;
* A common systematic economic factor (such as GDP growth , general levels of credit quality etc.) denoted by
The default rate for an asymptotic portfolio, having estimated the average default probability, the default correlation parameter and the common market factor, is given by:
is a standard normal variable,~, representing the assumed severity of economic downturn. The higher the probability of default, the greater the correlation coefficient and the larger the assumed market downturn, the smaller the distance from default, the closer to default and the higher the associated default rate for the portfolio.
It may make more intuitive sense if the variable is restated in terms of the inverse of the standard normal cumulative distribution and a probability input ranging from 0.5 to 0,999, where the higher the input value, the more severe the assumed economic downturn. This results in:
The correlation coefficient, , can be obtained by adapting the Basel II IRB risk-weighted formula for corporate exposures, which is based on the Vasicek model and which prescribes that correlations are bounded by upper and lower limits and are function of the probability of default weighted average. For credit card default correlations, we employ the empirical study of Crook et al1 to set the lower bounds at 0.396% and the upper bound at 4% and assume that correlation is an increasing function of the default probability:
A "Two-stage" LGD model is implemented. The "Stage 1" model is a classification model to predict whether the loan will have a recovery rate (RR) greater than zero. The "Stage 2" model a regression-type model to predict the value of the recovered amount of when the recovery rate is expected to be positive. The predicted recovery is the expected value of the two combined models, that is, the product of a binary value representing the event of recovery and the expected recovery value. So, for obligor , predicted will be either:
Or:
Where is the predicted amount of postive RR obtained from a multivariate linear regression, is the probability of a postive RR obtained from a multivariate logistic regression assuming some threshold and is the obligor-specific recovery rate.
LGD is therefore:
For credit card portfolios, EAD estimation is bedevilled by the revolving nature of the credit line which poses challenges to predicting the exposure at default time. Additional borrowings in the period prior to default means taking the current balance for non-defaulted customers does not produce a sufficiently conservative enough estimate of the amount drawn by the time of default. One solution is to use historic data to derive a Credit Conversion Factor (CCF) which is the proportion of the current undrawn amount that will likely be drawn down at time of default. The dependent variable in the regression analysis will be:
So, for obligor , predicted will be:
Where is the obligor-specific CCF multiplier obtained by applying the multivariate linear regression function to the obligor's data.
2 J. Crook & T. Bellotti (2012) Asset correlations for credit card defaults, Applied Financial Economics, 22:2, 87-95
To avoid any suggestion of the selective usage of raw data and the gaming of model results, the procedure for treating raw data should be transparent and rigourous. For example:
import numpy as np
import pandas as pd
# 1) Retrieve loan data into dataframe
loan_data = pd.read_csv('loan_data_2007_2014.csv')
# 2) Convert string values to integers where necessary. First removing text...
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\+ years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')
#...then converting string datatype to numeric datatype
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])
# 2) Convert string values to integers where necessary, replacing text with empty space
loan_data['term_int'] = pd.to_numeric(loan_data['term'].str.replace(' months', ''))
# 3) Convert string points in time to numeric periods of time where necessary.First converting to datetime format...
loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format = '%b-%y')
#...then converting to a new passage of time variable
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2017-12-01')
- loan_data['earliest_cr_line_date'])
/ np.timedelta64(1, 'M')))
loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format = '%b-%y')
loan_data['mths_since_issue_d'] = round(pd.to_numeric((pd.to_datetime('2017-12-01')
- loan_data['issue_d_date'])
/ np.timedelta64(1, 'M')))
# 4) Transform all discrete variables into dummy variables and concatenate in single dataframe
loan_data_dummies = [pd.get_dummies(loan_data['grade'], prefix = 'grade', prefix_sep = ':'),
pd.get_dummies(loan_data['sub_grade'], prefix = 'sub_grade', prefix_sep = ':'),
pd.get_dummies(loan_data['home_ownership'], prefix = 'home_ownership', prefix_sep = ':'),
pd.get_dummies(loan_data['verification_status'], prefix = 'verification_status', prefix_sep = ':'),
pd.get_dummies(loan_data['loan_status'], prefix = 'loan_status', prefix_sep = ':'),
pd.get_dummies(loan_data['purpose'], prefix = 'purpose', prefix_sep = ':'),
pd.get_dummies(loan_data['addr_state'], prefix = 'addr_state', prefix_sep = ':'),
pd.get_dummies(loan_data['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':')]
loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)
# 5) Incorporate new dummy variables into master dataframe
loan_data = pd.concat([loan_data, loan_data_dummies], axis = 1)
# 6) Replace missing values with appropriate alternative value or remove from dataset
loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'], inplace=True) # other variable
loan_data['annual_inc'].fillna(loan_data['annual_inc'].mean(), inplace=True) # mean value
loan_data['mths_since_earliest_cr_line'].fillna(0, inplace=True) # zero value
loan_data['acc_now_delinq'].fillna(0, inplace=True) # zero value
loan_data['total_acc'].fillna(0, inplace=True) # zero value
loan_data['pub_rec'].fillna(0, inplace=True) # zero value
loan_data['open_acc'].fillna(0, inplace=True) # zero value
loan_data['inq_last_6mths'].fillna(0, inplace=True) # zero value
loan_data['delinq_2yrs'].fillna(0, inplace=True) # zero value
loan_data['emp_length_int'].fillna(0, inplace=True) # zero value
# To remove null values from dataset:
#indices = loan_data[loan_data['person _ emp_ length'].isnull()].index
#loan_data.drop(indices, inplace=True)
# 7) Search for errors/anomalies/outliers in the dataset. Remove or replace
pd.crosstab(loan_data['home_ownership'],
loan_data['emp_length_int'],
values=loan_data['mths_since_earliest_cr_line'],
aggfunc='min').round(2)
loan_data['mths_since_earliest_cr_line'].describe()
# Replace all negative values in dataset with max.
loan_data['mths_since_earliest_cr_line'][loan_data['mths_since_earliest_cr_line']
< 0] = loan_data['mths_since_earliest_cr_line'].max()
# Remove all negative values from dataset
#indices = loan_data[cr _ loan['person _ emp_ length'] < 0].index
#loan_data.drop(indices, inplace=True)
The data should be divided into training and testing datasets. All discrete and continous feature variables should be transformed into dummy variables. The initial transformation of the feature variables of the training dataset into narrow categories of arbitrary size is referred to as "fine classing". The process of creating new, refined and usually enlarged categories based on the initial ones are refined is a process known as "coarse classing".
A metric called 'Weight of Evidence' is employed to this end. The objective is to lower the number of dummy variables. Weight of evidence shows to what extent each of the different categories of an independent variable explains the dependent variable. The objective is to obtain categories with a similar WOE. Ideally, each category (bin) should have at least 5% of the observations. Each category (bin) should be non-zero for both non-events and events. The should be monotonic, i.e. either growing or decreasing with the groupings.
The formula for is:
The steps to calculate are:
# Define dependent 'Default' variable and add to loan_data dataframe
loan_data['good_bad'] = np.where(loan_data['loan_status'].isin(['Charged Off', 'Default',
'Does not meet the credit policy. Status:Charged Off',
'Late (31-120 days)']), 0, 1)
# Imports the libraries we need.
from sklearn.model_selection import train_test_split
cr_inp_train, cr_inp_test, cr_tgt_train, cr_tgt_test = train_test_split(loan_data.drop('good_bad', axis = 1),
loan_data['good_bad'],
test_size = 0.2,
random_state = 42)
# WoE function for discrete unordered variables
# The function takes 3 arguments: a feature dataframe, a string, and a target dataframe.
# The function returns a dataframe as a result.
def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):
df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
df = df.iloc[:, [0, 1, 3]]
df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
df['n_good'] = df['prop_good'] * df['n_obs']
df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
df = df.sort_values(['WoE'])
df = df.reset_index(drop = True)
df['diff_prop_good'] = df['prop_good'].diff().abs()
df['diff_WoE'] = df['WoE'].diff().abs()
df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
# NOTE ON GROUPBY
# Groups the data according to a criterion contained in one column (1st = Grade)
# Does not turn the names of the values of the criterion into index if as_index = False
# Aggregates the data in another column (Good_bd) to these groups, using a selected function (mean)
# Syntax: Produces Pandas DataFrame >>> df.groupby('month')[['duration']].sum()
# WoE function for ordered discrete and continuous variables
def woe_ordered_continuous(df, discrete_variabe_name, good_bad_variable_df):
df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
df = df.iloc[:, [0, 1, 3]]
df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
df['n_good'] = df['prop_good'] * df['n_obs']
df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
#df = df.sort_values(['WoE'])
#df = df.reset_index(drop = True)
df['diff_prop_good'] = df['prop_good'].diff().abs()
df['diff_WoE'] = df['WoE'].diff().abs()
df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
# NOTE: We order the results by the values of a different column.
# WoE Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Imports the libraries we need.
sns.set()
# We set the default style of the graphs to the seaborn style.
# Below we define a graphing function that takes 2 arguments: a WoE dataframe and a number to rotate x labels
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
x = np.array(df_WoE.iloc[:, 0].apply(str))
# Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
y = df_WoE['WoE']
plt.figure(figsize=(18, 6))
plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
plt.xlabel(df_WoE.columns[0])
# Names the x-axis with the name of the column with index 0.
plt.ylabel('Weight of Evidence')
# Names the y-axis 'Weight of Evidence'.
plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
# Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
plt.xticks(rotation = rotation_of_x_axis_labels)
# Rotates the labels of the x-axis a predefined number of degrees.
##### Procedure will be run twice. Once with training data and once with testing data #####
# New dataframe with training/test inputs and targets
df_inputs_prepr = cr_inp_train
df_targets_prepr = cr_tgt_train
#df_inputs_prepr = cr_inp_test
#df_targets_prepr = cr_tgt_test
df_targets_prepr
df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_temp
plot_by_woe(df_temp)
df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# Using woE we combine residential status categories.
df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership:RENT'],
df_inputs_prepr['home_ownership:OTHER'],
df_inputs_prepr['home_ownership:NONE'],
df_inputs_prepr['home_ownership:ANY']])
# IF a region does not feature in the address (state) column, then it should be added and assigned zero values
if ['addr_state:ND'] in df_inputs_prepr.columns.values:
pass
else:
df_inputs_prepr['addr_state:ND'] = 0
# Using woE we combine region categories.
df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_inputs_prepr['addr_state:ND'], df_inputs_prepr['addr_state:NE'],
df_inputs_prepr['addr_state:IA'], df_inputs_prepr['addr_state:NV'],
df_inputs_prepr['addr_state:FL'], df_inputs_prepr['addr_state:HI'],
df_inputs_prepr['addr_state:AL']])
df_inputs_prepr['addr_state:NM_VA'] = sum([df_inputs_prepr['addr_state:NM'], df_inputs_prepr['addr_state:VA']])
df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_inputs_prepr['addr_state:OK'], df_inputs_prepr['addr_state:TN'],
df_inputs_prepr['addr_state:MO'], df_inputs_prepr['addr_state:LA'],
df_inputs_prepr['addr_state:MD'], df_inputs_prepr['addr_state:NC']])
df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_inputs_prepr['addr_state:UT'], df_inputs_prepr['addr_state:KY'],
df_inputs_prepr['addr_state:AZ'], df_inputs_prepr['addr_state:NJ']])
df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_inputs_prepr['addr_state:AR'], df_inputs_prepr['addr_state:MI'],
df_inputs_prepr['addr_state:PA'], df_inputs_prepr['addr_state:OH'],
df_inputs_prepr['addr_state:MN']])
df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_inputs_prepr['addr_state:RI'], df_inputs_prepr['addr_state:MA'],
df_inputs_prepr['addr_state:DE'], df_inputs_prepr['addr_state:SD'],
df_inputs_prepr['addr_state:IN']])
df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_inputs_prepr['addr_state:GA'], df_inputs_prepr['addr_state:WA'],
df_inputs_prepr['addr_state:OR']])
df_inputs_prepr['addr_state:WI_MT'] = sum([df_inputs_prepr['addr_state:WI'], df_inputs_prepr['addr_state:MT']])
df_inputs_prepr['addr_state:IL_CT'] = sum([df_inputs_prepr['addr_state:IL'], df_inputs_prepr['addr_state:CT']])
df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_inputs_prepr['addr_state:KS'], df_inputs_prepr['addr_state:SC'],
df_inputs_prepr['addr_state:CO'], df_inputs_prepr['addr_state:VT'],
df_inputs_prepr['addr_state:AK'], df_inputs_prepr['addr_state:MS']])
df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_inputs_prepr['addr_state:WV'], df_inputs_prepr['addr_state:NH'],
df_inputs_prepr['addr_state:WY'], df_inputs_prepr['addr_state:DC'],
df_inputs_prepr['addr_state:ME'], df_inputs_prepr['addr_state:ID']])
# Using WoE we combine purpose categories.
df_inputs_prepr['purpose:educ__sm_b__wedd__ren_en__mov__house'] = sum([df_inputs_prepr['purpose:educational'], df_inputs_prepr['purpose:small_business'],
df_inputs_prepr['purpose:wedding'], df_inputs_prepr['purpose:renewable_energy'],
df_inputs_prepr['purpose:moving'], df_inputs_prepr['purpose:house']])
df_inputs_prepr['purpose:oth__med__vacation'] = sum([df_inputs_prepr['purpose:other'], df_inputs_prepr['purpose:medical'],
df_inputs_prepr['purpose:vacation']])
df_inputs_prepr['purpose:major_purch__car__home_impr'] = sum([df_inputs_prepr['purpose:major_purchase'], df_inputs_prepr['purpose:car'],
df_inputs_prepr['purpose:home_improvement']])
df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)
# We create the following categories: '0', '1', '2 - 4', '5 - 6', '7 - 9', '10'
# '0' will be the reference category
df_inputs_prepr['emp_length:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length:5-6'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 7)), 1, 0)
df_inputs_prepr['emp_length:7-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['emp_length:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)
# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['mths_since_issue_d_factor'] = pd.cut(df_inputs_prepr['mths_since_issue_d'], 50)
# Here we perform coarse -classing: we create the following categories:
# < 38, 38 - 39, 40 - 41, 42 - 48, 49 - 52, 53 - 64, 65 - 84, > 84.
df_inputs_prepr['mths_since_issue_d:<38'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38)), 1, 0)
df_inputs_prepr['mths_since_issue_d:38-39'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(38, 40)), 1, 0)
df_inputs_prepr['mths_since_issue_d:40-41'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(40, 42)), 1, 0)
df_inputs_prepr['mths_since_issue_d:42-48'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(42, 49)), 1, 0)
df_inputs_prepr['mths_since_issue_d:49-52'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(49, 53)), 1, 0)
df_inputs_prepr['mths_since_issue_d:53-64'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(53, 65)), 1, 0)
df_inputs_prepr['mths_since_issue_d:65-84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(65, 85)), 1, 0)
df_inputs_prepr['mths_since_issue_d:>84'] = np.where(df_inputs_prepr['mths_since_issue_d'].isin(range(85,
int(df_inputs_prepr['mths_since_issue_d'].max()))), 1, 0)
# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)
# Here we perform coarse -classing: we create the following categories:
# '< 9.548', '9.548 - 12.025', '12.025 - 15.74', '15.74 - 20.281', '> 20.281'
df_inputs_prepr['int_rate:<9.548'] = np.where((df_inputs_prepr['int_rate'] <= 9.548), 1, 0)
df_inputs_prepr['int_rate:9.548-12.025'] = np.where((df_inputs_prepr['int_rate'] > 9.548) & (df_inputs_prepr['int_rate'] <= 12.025), 1, 0)
df_inputs_prepr['int_rate:12.025-15.74'] = np.where((df_inputs_prepr['int_rate'] > 12.025) & (df_inputs_prepr['int_rate'] <= 15.74), 1, 0)
df_inputs_prepr['int_rate:15.74-20.281'] = np.where((df_inputs_prepr['int_rate'] > 15.74) & (df_inputs_prepr['int_rate'] <= 20.281), 1, 0)
df_inputs_prepr['int_rate:>20.281'] = np.where((df_inputs_prepr['int_rate'] > 20.281), 1, 0)
# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['funded_amnt_factor'] = pd.cut(df_inputs_prepr['funded_amnt'], 50)
# We retain these categories
# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 50)
# Here we perform coarse-classing: we create the following categories:
#< 140, # 141 - 164, # 165 - 247, # 248 - 270, # 271 - 352, # > 352
df_inputs_prepr['mths_since_earliest_cr_line:<140'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:141-164'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140, 165)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:165-247'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(165, 248)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:248-270'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(248, 271)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:271-352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(271, 353)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(353, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)
# Here we perform coarse-classing: we create the following categories:
# Categories: 0, 1-3, >=4
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1-3'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 1) & (df_inputs_prepr['delinq_2yrs'] <= 3), 1, 0)
df_inputs_prepr['delinq_2yrs:>=4'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 9), 1, 0)
# Categories: 0, 1 - 2, 3 - 6, > 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 6), 1, 0)
df_inputs_prepr['inq_last_6mths:>6'] = np.where((df_inputs_prepr['inq_last_6mths'] > 6), 1, 0)
# Categories: '0', '1-3', '4-12', '13-17', '18-22', '23-25', '26-30', '>30'
df_inputs_prepr['open_acc:0'] = np.where((df_inputs_prepr['open_acc'] == 0), 1, 0)
df_inputs_prepr['open_acc:1-3'] = np.where((df_inputs_prepr['open_acc'] >= 1) & (df_inputs_prepr['open_acc'] <= 3), 1, 0)
df_inputs_prepr['open_acc:4-12'] = np.where((df_inputs_prepr['open_acc'] >= 4) & (df_inputs_prepr['open_acc'] <= 12), 1, 0)
df_inputs_prepr['open_acc:13-17'] = np.where((df_inputs_prepr['open_acc'] >= 13) & (df_inputs_prepr['open_acc'] <= 17), 1, 0)
df_inputs_prepr['open_acc:18-22'] = np.where((df_inputs_prepr['open_acc'] >= 18) & (df_inputs_prepr['open_acc'] <= 22), 1, 0)
df_inputs_prepr['open_acc:23-25'] = np.where((df_inputs_prepr['open_acc'] >= 23) & (df_inputs_prepr['open_acc'] <= 25), 1, 0)
df_inputs_prepr['open_acc:26-30'] = np.where((df_inputs_prepr['open_acc'] >= 26) & (df_inputs_prepr['open_acc'] <= 30), 1, 0)
df_inputs_prepr['open_acc:>=31'] = np.where((df_inputs_prepr['open_acc'] >= 31), 1, 0)
# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['pub_rec:0-2'] = np.where((df_inputs_prepr['pub_rec'] >= 0) & (df_inputs_prepr['pub_rec'] <= 2), 1, 0)
df_inputs_prepr['pub_rec:3-4'] = np.where((df_inputs_prepr['pub_rec'] >= 3) & (df_inputs_prepr['pub_rec'] <= 4), 1, 0)
df_inputs_prepr['pub_rec:>=5'] = np.where((df_inputs_prepr['pub_rec'] >= 5), 1, 0)
# Here we perform fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 50)
# # Here we perform coarse-classing: we create the following categories: '<=27', '28-51', '>51'
df_inputs_prepr['total_acc:<=27'] = np.where((df_inputs_prepr['total_acc'] <= 27), 1, 0)
df_inputs_prepr['total_acc:28-51'] = np.where((df_inputs_prepr['total_acc'] >= 28) & (df_inputs_prepr['total_acc'] <= 51), 1, 0)
df_inputs_prepr['total_acc:>=52'] = np.where((df_inputs_prepr['total_acc'] >= 52), 1, 0)
# Coarse classed. Categories: '0', '>=1'
df_inputs_prepr['acc_now_delinq:0'] = np.where((df_inputs_prepr['acc_now_delinq'] == 0), 1, 0)
df_inputs_prepr['acc_now_delinq:>=1'] = np.where((df_inputs_prepr['acc_now_delinq'] >= 1), 1, 0)
# Fine classed. Categories: Evenly split into 2000 bins
df_inputs_prepr['total_rev_hi_lim_factor'] = pd.cut(df_inputs_prepr['total_rev_hi_lim'], 2000)
# Coarse classed. Categories: <=5K', '5K-10K', '10K-20K', '20K-30K', '30K-40K', '40K-55K', '55K-95K', '>95K'
df_inputs_prepr['total_rev_hi_lim:<=5K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] <= 5000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:5K-10K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 5000) & (df_inputs_prepr['total_rev_hi_lim'] <= 10000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:10K-20K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 10000) & (df_inputs_prepr['total_rev_hi_lim'] <= 20000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:20K-30K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 20000) & (df_inputs_prepr['total_rev_hi_lim'] <= 30000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:30K-40K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 30000) & (df_inputs_prepr['total_rev_hi_lim'] <= 40000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:40K-55K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 40000) & (df_inputs_prepr['total_rev_hi_lim'] <= 55000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:55K-95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 55000) & (df_inputs_prepr['total_rev_hi_lim'] <= 95000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:>95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 95000), 1, 0)
# Fine classed. Categories: Evenly split into 50 bins
df_inputs_prepr['installment_factor'] = pd.cut(df_inputs_prepr['installment'], 50)
# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 100)
# Coarse classed. We split income in 10 equal categories, each with width of 15k.
df_inputs_prepr['annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)
# Categories: Missing, 0-3, 4-30, 31-56, >=57
df_inputs_prepr['mths_since_last_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_last_delinq'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_delinq:0-3'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 0) & (df_inputs_prepr['mths_since_last_delinq'] <= 3), 1, 0)
df_inputs_prepr['mths_since_last_delinq:4-30'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 4) & (df_inputs_prepr['mths_since_last_delinq'] <= 30), 1, 0)
df_inputs_prepr['mths_since_last_delinq:31-56'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 31) & (df_inputs_prepr['mths_since_last_delinq'] <= 56), 1, 0)
df_inputs_prepr['mths_since_last_delinq:>=57'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 57), 1, 0)
# Fine classed. Categories: Evenly split into 100 bins
df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 100)
# Categories:
df_inputs_prepr['dti:<=1.4'] = np.where((df_inputs_prepr['dti'] <= 1.4), 1, 0)
df_inputs_prepr['dti:1.4-3.5'] = np.where((df_inputs_prepr['dti'] > 1.4) & (df_inputs_prepr['dti'] <= 3.5), 1, 0)
df_inputs_prepr['dti:3.5-7.7'] = np.where((df_inputs_prepr['dti'] > 3.5) & (df_inputs_prepr['dti'] <= 7.7), 1, 0)
df_inputs_prepr['dti:7.7-10.5'] = np.where((df_inputs_prepr['dti'] > 7.7) & (df_inputs_prepr['dti'] <= 10.5), 1, 0)
df_inputs_prepr['dti:10.5-16.1'] = np.where((df_inputs_prepr['dti'] > 10.5) & (df_inputs_prepr['dti'] <= 16.1), 1, 0)
df_inputs_prepr['dti:16.1-20.3'] = np.where((df_inputs_prepr['dti'] > 16.1) & (df_inputs_prepr['dti'] <= 20.3), 1, 0)
df_inputs_prepr['dti:20.3-21.7'] = np.where((df_inputs_prepr['dti'] > 20.3) & (df_inputs_prepr['dti'] <= 21.7), 1, 0)
df_inputs_prepr['dti:21.7-22.4'] = np.where((df_inputs_prepr['dti'] > 21.7) & (df_inputs_prepr['dti'] <= 22.4), 1, 0)
df_inputs_prepr['dti:22.4-35'] = np.where((df_inputs_prepr['dti'] > 22.4) & (df_inputs_prepr['dti'] <= 35), 1, 0)
df_inputs_prepr['dti:>35'] = np.where((df_inputs_prepr['dti'] > 35), 1, 0)
# Categories: 'Missing', '0-2', '3-20', '21-31', '32-80', '81-86', '>86'
df_inputs_prepr['mths_since_last_record:Missing'] = np.where((df_inputs_prepr['mths_since_last_record'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_record:0-2'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 0) & (df_inputs_prepr['mths_since_last_record'] <= 2), 1, 0)
df_inputs_prepr['mths_since_last_record:3-20'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 3) & (df_inputs_prepr['mths_since_last_record'] <= 20), 1, 0)
df_inputs_prepr['mths_since_last_record:21-31'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 21) & (df_inputs_prepr['mths_since_last_record'] <= 31), 1, 0)
df_inputs_prepr['mths_since_last_record:32-80'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 32) & (df_inputs_prepr['mths_since_last_record'] <= 80), 1, 0)
df_inputs_prepr['mths_since_last_record:81-86'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 81) & (df_inputs_prepr['mths_since_last_record'] <= 86), 1, 0)
df_inputs_prepr['mths_since_last_record:>86'] = np.where((df_inputs_prepr['mths_since_last_record'] > 86), 1, 0)
# View metadata
df_inputs_prepr.info()
##### Store training inputs in dataframe #####
cr_inp_train = df_inputs_prepr
##### Store test inputs in dataframe
#cr_inp_test = df_inputs_prepr
##### Save training data to CSV file #####
cr_inp_train.to_csv('cr_inp_train.csv')
cr_tgt_train.to_csv('cr_tgt_train.csv')
##### Save test data to CSV file #####
#cr_inp_test.to_csv('cr_inp_test.csv')
#cr_tgt_test.to_csv('cr_tgt_test.csv')
Having performed an initial filtration of predictor variables, a preliminary model is run with these variables. Care should be taken to remove one dummy for each original variable to avoid the so-called dummy variable trap.
loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
loan_data_targets_train = pd.read_csv('cr_tgt_train.csv', index_col = 0)
loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
loan_data_targets_test = pd.read_csv('cr_tgt_test.csv', index_col = 0)
# Select a limited set of input variables in a new dataframe.
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'delinq_2yrs:0',
'delinq_2yrs:1-3',
'delinq_2yrs:>=4',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'open_acc:0',
'open_acc:1-3',
'open_acc:4-12',
'open_acc:13-17',
'open_acc:18-22',
'open_acc:23-25',
'open_acc:26-30',
'open_acc:>=31',
'pub_rec:0-2',
'pub_rec:3-4',
'pub_rec:>=5',
'total_acc:<=27',
'total_acc:28-51',
'total_acc:>=52',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'total_rev_hi_lim:<=5K',
'total_rev_hi_lim:5K-10K',
'total_rev_hi_lim:10K-20K',
'total_rev_hi_lim:20K-30K',
'total_rev_hi_lim:30K-40K',
'total_rev_hi_lim:40K-55K',
'total_rev_hi_lim:55K-95K',
'total_rev_hi_lim:>95K',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]
# Here we store the names of the reference category dummy variables in a list.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'delinq_2yrs:>=4',
'inq_last_6mths:>6',
'open_acc:0',
'pub_rec:0-2',
'total_acc:<=27',
'acc_now_delinq:0',
'total_rev_hi_lim:<=5K',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
# Drop the variables with variable names in the list with reference categories to avoid dummy variable trap
inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Create an instance of an object from the 'LogisticRegression' class with specified parameters
reg = LogisticRegression(solver='lbfgs', max_iter=200,)
# Sets the pandas dataframe options to display all columns/ rows.
#pd.options.display.max_rows = None
# Estimates the coefficients of the object from the 'LogisticRegression' class
# np.ravel(training_labels) is be required to convert the target data into a 1D numpy array
reg.fit(inputs_train, np.ravel(loan_data_targets_train))
# Displays the intercept contain in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.intercept_
# Displays the coefficients contained in the estimated ("fitted") object from the 'LogisticRegression' class.
reg.coef_
feature_name = inputs_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
summary_table.head()
Having fitted the preliminary model, the p-values of the beta coefficients of the feature variables should be analysed to ascertain their statistical significance and to determine if they should be retained or discarded.
# P values for sklearn logistic regression.
# Class to display p-values for logistic regression in sklearn.
from sklearn import linear_model
import scipy.stats as stat
class LogisticRegression_with_p_values:
def __init__(self,*args,**kwargs):#,**kwargs):
self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
def fit(self,X,y):
self.model.fit(X,y)
#### Get p-values for the fitted model ####
denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
denom = np.tile(denom,(X.shape[1],1)).T
F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
self.coef_ = self.model.coef_
self.intercept_ = self.model.intercept_
#self.z_scores = z_scores
self.p_values = p_values
#self.sigma_estimates = sigma_estimates
#self.F_ij = F_ij
reg = LogisticRegression_with_p_values()
# We create an instance of an object from the newly created 'LogisticRegression_with_p_values()' class.
reg.fit(inputs_train, loan_data_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
# Same as above.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = reg.p_values
# Add the intercept for completeness.
p_values = np.append(np.nan, np.array(p_values))
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' var.
summary_table['p_values'] = p_values
summary_table.head()
# We are going to remove some features, the coefficients for all or almost all of the dummy variables for which,
# are not tatistically significant.
# We do that by specifying another list of dummy variables as reference categories, and a list of variables to remove.
# Then, we are going to drop the two datasets from the original list of dummy variables.
# Variables
inputs_train_with_ref_cat = loan_data_inputs_train.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
inputs_train = inputs_train_with_ref_cat.drop(ref_categories, axis = 1)
inputs_train.head()
# Here we run a new model.
reg2 = LogisticRegression_with_p_values()
reg2.fit(inputs_train, loan_data_targets_train)
feature_name = inputs_train.columns.values
# Results for our final PD model.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg2.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg2.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table.head()
import pickle
#pickle.dump(),takes two arguments: the object you want to pickle and the file to which the object has to be saved.
# To open the file for writing, simply use the open() function. The first argument should be the name of your file.
# The second argument is 'wb'. The w means that you'll be writing to the file, and b refers to binary mode
# Here we export our model to a 'SAV' file with file name 'pd_model.sav'.
pickle.dump(reg2, open('pd_model1.sav', 'wb'))
# Here, from the dataframe with inputs for testing, we keep the same variables that we used in our final PD model.
inputs_test_with_ref_cat = loan_data_inputs_test.loc[: , ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86',
]]
# And here, in the list below, we keep the variable names for the reference categories,
# only for the variables we used in our final PD model.
ref_categories = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
inputs_test = inputs_test_with_ref_cat.drop(ref_categories, axis = 1)
inputs_test.head()
# Calculates the predicted binary values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model
# Output values > 0.5 = 1; Output values < 0.5 = 0;
y_hat_test = reg2.model.predict(inputs_test)
y_hat_test
loan_data_targets_test_temp = loan_data_targets_test
loan_data_targets_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.
# Concatenates two dataframes.
df_actual_predicted = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test)], axis = 1)
# Names Columns
df_actual_predicted.columns = ['loan_data_targets_test', 'y_hat_test (0.5)']
# Makes the index of one dataframe equal to the index of another dataframe.
df_actual_predicted.index = loan_data_inputs_test.index
df_actual_predicted.head()
import itertools
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(cm, classes,
normalize=False,
title='CONFUSION MATRIX',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float')
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=20)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.3f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "white")
plt.tight_layout()
plt.ylabel('*TRUE LABEL*', fontsize=14)
plt.xlabel('*PREDICTED LABEL*', fontsize=14)
plt.show()
cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
normalize=False,
title='CONFUSION MATRIX - Threshold = 0.5',
cmap=plt.cm.RdYlGn)
cm = confusion_matrix(loan_data_targets_test_temp, y_hat_test)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm, classes,
normalize=False,
title='CONFUSION MATRIX - Threshold = 0.5',
cmap=plt.cm.RdYlGn)
# Actual vs Predicted Binary target variables (where 0,5 is cutoff for predicted default/non-default)
from sklearn.metrics import accuracy_score
print("Accuracy (Out-of-Sample, threshold=0.5): ", accuracy_score(loan_data_targets_test_temp, y_hat_test))
# Calculates the predicted probability values for the dependent variable (targets)
# based on the out of sample values of the independent variables (inputs) and the coefficients of the refined model.
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.
y_hat_test_proba = reg2.model.predict_proba(inputs_test)
y_hat_test_proba = y_hat_test_proba[:][:,1]
y_hat_test_proba
df_actual_predicted_probs = pd.concat([loan_data_targets_test_temp, pd.DataFrame(y_hat_test_proba)], axis = 1)
df_actual_predicted_probs.columns = ['loan_data_targets_test', 'y_hat_test_proba']
df_actual_predicted_probs.index = loan_data_inputs_test.index
df_actual_predicted_probs.head()
df_actual_predicted_probs.head()
import matplotlib.pyplot as plt
plt.hist(df_actual_predicted_probs['y_hat_test_proba'], bins=50)
plt.title('Probability Distribution - No Default', fontsize=20)
plt.show()
tr = 0.9
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)
# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.
cm_df = pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'],
df_actual_predicted_probs['y_hat_test'],
rownames = ['Actual'], colnames = ['Predicted'])
# Confusion Matrix as numpy array
cm_arr = np.array(cm_df)
cm_arr
# Confusion Matrix normalized by number of observations
cm_arr_norm = np.array([[cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])),
cm_arr[0,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))],
[cm_arr[1,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])),
cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))]])
cm_arr_norm
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr, classes,
normalize=False,
title='CONFUSION MATRIX - Threshold = 0.9',
cmap=plt.cm.RdYlGn)
classes = ['Defaulted', 'Not Defaulted']
plot_confusion_matrix(cm_arr_norm, classes,
normalize=True,
title='NORM. CONFUSION MATRIX - Threshold = 0.9',
cmap=plt.cm.RdYlGn)
print("Accuracy (Out-of-Sample, threshold=0.9): ", cm_arr[0,0]/(sum(cm_arr[0,:])+sum(cm_arr[1,:]))
+ cm_arr[1,1]/(sum(cm_arr[0,:])+sum(cm_arr[1,:])))
Model performance is evaluated taking into consideration the shape of the ROC curve, the Area under the ROC Cuve and the Gini Coefficient for the Testing (Out of Sample) Data. </font>
from sklearn.metrics import roc_curve, roc_auc_score
roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Here we store each of the three arrays in a separate variable.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False Pos rate (% of Bad Loans Incorr. classified)')
# We name the x-axis "False positive rate".
plt.ylabel('True Pos rate (% of Good Loans Corr. Classified)')
# We name the x-axis "True positive rate".
plt.title('ROC curve',fontsize=20)
# We name the graph "ROC curve".
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
optimal_tpr = tpr[optimal_idx]
optimal_fpr = fpr[optimal_idx]
optimal_threshold, optimal_tpr, optimal_fpr
print("Optimal Threshold of : ", optimal_threshold)
print("At Index : ", optimal_idx)
print("With Optimal True Pos Rate of : ", optimal_tpr)
print("And Optimal False Pos Rate of : ", optimal_fpr)
j_scores = tpr-fpr
j_ordered = sorted(zip(fpr, tpr, j_scores, thresholds))
j_ordered_df = pd.DataFrame(data=j_ordered, columns=['FPR', 'TPR', 'TPR-FPR','Thresholds'])
j_ordered_df.head()
j_ordered_df.tail()
AUROC = roc_auc_score(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC
df_actual_predicted_probs = df_actual_predicted_probs.sort_values('y_hat_test_proba')
# Sorts a dataframe by the values of a specific column.
df_actual_predicted_probs.head()
df_actual_predicted_probs.tail()
df_actual_predicted_probs = df_actual_predicted_probs.reset_index()
# We reset the index of a dataframe and overwrite it.
df_actual_predicted_probs.head()
df_actual_predicted_probs['Cumulative N Population'] = df_actual_predicted_probs.index + 1
# We calculate the cumulative number of all observations.
# We use the new index for that. Since indexing in ython starts from 0, we add 1 to each index.
df_actual_predicted_probs['Cumulative N Good'] = df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'good', which is the cumulative sum of the column with actual observations.
df_actual_predicted_probs['Cumulative N Bad'] = df_actual_predicted_probs['Cumulative N Population'] - df_actual_predicted_probs['loan_data_targets_test'].cumsum()
# We calculate cumulative number of 'bad', which is
# the difference between the cumulative number of all observations and cumulative number of 'good' for each row.
df_actual_predicted_probs.head()
df_actual_predicted_probs['Cumulative Perc Population'] = df_actual_predicted_probs['Cumulative N Population'] / (df_actual_predicted_probs.shape[0])
# We calculate the cumulative percentage of all observations.
df_actual_predicted_probs['Cumulative Perc Good'] = df_actual_predicted_probs['Cumulative N Good'] / df_actual_predicted_probs['loan_data_targets_test'].sum()
# We calculate cumulative percentage of 'good'.
df_actual_predicted_probs['Cumulative Perc Bad'] = df_actual_predicted_probs['Cumulative N Bad'] / (df_actual_predicted_probs.shape[0] - df_actual_predicted_probs['loan_data_targets_test'].sum())
# We calculate the cumulative percentage of 'bad'.
df_actual_predicted_probs.head()
df_actual_predicted_probs.tail()
# Plot Prob of Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative Perc Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,
plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)
# Plot Prob of No Default of Population
x = 1-(df_actual_predicted_probs['y_hat_test_proba'])
plt.scatter(df_actual_predicted_probs['Cumulative N Population'], x)
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,
plt.xlabel('Cumulative N Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Probability Default')
# We name the y-axis "Cumulative % Bad".
plt.title('Probability Default - Portfolio Constituents',fontsize=20)
# We name the graph "Gini".
# Plot Gini
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Bad'])
# We plot the cumulative percentage of all along the x-axis and the cumulative percentage 'bad' along the y-axis,
# thus plotting the Gini curve.
plt.plot(df_actual_predicted_probs['Cumulative Perc Population'], df_actual_predicted_probs['Cumulative Perc Population'],
linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('Cumulative % Observed Population')
# We name the x-axis "Cumulative % Population".
plt.ylabel('Cumulative % Observed Bad')
# We name the y-axis "Cumulative % Bad".
plt.title('Gini',fontsize=20)
# We name the graph "Gini".
Gini = AUROC * 2 - 1
# Here we calculate Gini from AUROC.
Gini
Model performance is evaluated taking into consideration the KS Coefficient for the Testing (Out of Sample) Data which measures the maximum difference between the cumulative distribution functions of observed good and bad borrowers with respect to the estimated probabilities of "Good" according to the model. The greater the difference, the better the model.
# Plot KS
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Bad'], color = 'r')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'bad' along the y-axis,
# colored in red.
plt.plot(df_actual_predicted_probs['y_hat_test_proba'], df_actual_predicted_probs['Cumulative Perc Good'], color = 'b')
# We plot the predicted (estimated) probabilities along the x-axis and the cumulative percentage 'good' along the y-axis,
# colored in red.
plt.xlabel('Estimated Probability for being Good')
# We name the x-axis "Estimated Probability for being Good".
plt.ylabel('Cumulative %')
# We name the y-axis "Cumulative %".
plt.legend(['Cumulative Perc Bad', 'Cumulative Perc Good'])
plt.title('Kolmogorov-Smirnov',fontsize=20)
# We name the graph "Kolmogorov-Smirnov".
KS = max(df_actual_predicted_probs['Cumulative Perc Bad'] - df_actual_predicted_probs['Cumulative Perc Good'])
# We calculate KS from the data. It is the maximum of the difference between the cumulative percentage of 'bad'
# and the cumulative percentage of 'good'.
print("KS Coefficient: ", KS)
#pd.options.display.max_columns = None
# Sets the pandas dataframe options to display all columns/ rows.
inputs_test_with_ref_cat.head()
summary_table.head()
y_hat_test_proba
summary_table.head()
ref_categories
df_ref_categories = pd.DataFrame(ref_categories, columns = ['Feature name'])
# We create a new dataframe with one column. Its values are the values from the 'reference_categories' list.
# We name it 'Feature name'.
df_ref_categories['Coefficients'] = 0
# We create a second column, called 'Coefficients', which contains only 0 values.
df_ref_categories['p_values'] = np.nan
# We create a third column, called 'p_values', with contains only NaN values.
df_ref_categories.head()
df_scorecard = pd.concat([summary_table, df_ref_categories])
# Concatenates two dataframes.
df_scorecard = df_scorecard.reset_index()
# We reset the index of a dataframe.
df_scorecard
df_scorecard['Original feature name'] = df_scorecard['Feature name'].str.split(':').str[0]
# We create a new column, called 'Original feature name', which contains the value of the 'Feature name' column,
# up to the column symbol.
df_scorecard
min_score = 300
max_score = 850
df_scorecard.groupby('Original feature name')['Coefficients'].min()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
min_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].min().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the minimum values.
min_sum_coef
df_scorecard.groupby('Original feature name')['Coefficients'].max()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
max_sum_coef = df_scorecard.groupby('Original feature name')['Coefficients'].max().sum()
# Up to the 'min()' method everything is the same as in te line above.
# Then, we aggregate further and sum all the maximum values.
max_sum_coef
df_scorecard['Score - Calculation'] = df_scorecard['Coefficients'] * (max_score - min_score) / (max_sum_coef - min_sum_coef)
# We multiply the value of the 'Coefficients' column by the ration of the differences between
# maximum score and minimum score and maximum sum of coefficients and minimum sum of cefficients.
df_scorecard
df_scorecard['Score - Calculation'][0] = ((df_scorecard['Coefficients'][0] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score
# We divide the difference of the value of the 'Coefficients' column and the minimum sum of coefficients by
# the difference of the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we multiply that by the difference between the maximum score and the minimum score.
# Then, we add minimum score.
df_scorecard.head()
df_scorecard['Score - Preliminary'] = df_scorecard['Score - Calculation'].round()
# We round the values of the 'Score - Calculation' column.
df_scorecard.head()
min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel
max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Preliminary'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel
# One has to be subtracted from the maximum score for one original variable. Which one? We'll evaluate based on differences.
df_scorecard['Difference'] = df_scorecard['Score - Preliminary'] - df_scorecard['Score - Calculation']
df_scorecard.head()
df_scorecard['Score - Final'] = df_scorecard['Score - Preliminary']
df_scorecard['Score - Final'][77] = 16
df_scorecard.head()
min_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].min().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their minimum.
# Sums all minimum values.
min_sum_score_prel
max_sum_score_prel = df_scorecard.groupby('Original feature name')['Score - Final'].max().sum()
# Groups the data by the values of the 'Original feature name' column.
# Aggregates the data in the 'Coefficients' column, calculating their maximum.
# Sums all maximum values.
max_sum_score_prel
inputs_test_with_ref_cat.head()
df_scorecard.head()
inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat
inputs_test_with_ref_cat_w_intercept.insert(0, 'Intercept', 1)
# We insert a column in the dataframe, with an index of 0, that is, in the beginning of the dataframe.
# The name of that column is 'Intercept', and its values are 1s.
inputs_test_with_ref_cat_w_intercept.head()
inputs_test_with_ref_cat_w_intercept = inputs_test_with_ref_cat_w_intercept[df_scorecard['Feature name'].values]
# Here, from the 'inputs_test_with_ref_cat_w_intercept' dataframe, we keep only the columns with column names,
# exactly equal to the row values of the 'Feature name' column from the 'df_scorecard' dataframe.
inputs_test_with_ref_cat_w_intercept.head()
scorecard_scores = df_scorecard['Score - Final']
inputs_test_with_ref_cat_w_intercept.shape
scorecard_scores.shape
scorecard_scores = scorecard_scores.values.reshape(102, 1)
scorecard_scores.shape
y_scores = inputs_test_with_ref_cat_w_intercept.dot(scorecard_scores)
# Here we multiply the values of each row of the dataframe by the values of each column of the variable,
# which is an argument of the 'dot' method, and sum them. It's essentially the sum of the products.
y_scores.head()
y_scores.tail()
sum_coef_from_score = ((y_scores - min_score) / (max_score - min_score)) * (max_sum_coef - min_sum_coef) + min_sum_coef
# We divide the difference between the scores and the minimum score by
# the difference between the maximum score and the minimum score.
# Then, we multiply that by the difference between the maximum sum of coefficients and the minimum sum of coefficients.
# Then, we add the minimum sum of coefficients.
y_hat_proba_from_score = np.exp(sum_coef_from_score) / (np.exp(sum_coef_from_score) + 1)
# Here we divide an exponent raised to sum of coefficients from score by
# an exponent raised to sum of coefficients from score plus one.
y_hat_proba_from_score.head()
y_hat_test_proba[0: 5]
df_actual_predicted_probs['y_hat_test_proba'].head()
# We need the confusion matrix again.
#np.where(np.squeeze(np.array(loan_data_targets_test)) == np.where(y_hat_test_proba >= tr, 1, 0), 1, 0).sum() / loan_data_targets_test.shape[0]
tr = 0.9
df_actual_predicted_probs['y_hat_test'] = np.where(df_actual_predicted_probs['y_hat_test_proba'] > tr, 1, 0)
#df_actual_predicted_probs['loan_data_targets_test'] == np.where(df_actual_predicted_probs['y_hat_test_proba'] >= tr, 1, 0)
pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted'])
pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
(pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]
from sklearn.metrics import roc_curve, roc_auc_score
roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['loan_data_targets_test'], df_actual_predicted_probs['y_hat_test_proba'])
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.plot(fpr, tpr)
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
thresholds
thresholds.shape
df_cutoffs = pd.concat([pd.DataFrame(thresholds), pd.DataFrame(fpr), pd.DataFrame(tpr)], axis = 1)
# We concatenate 3 dataframes along the columns.
df_cutoffs.columns = ['thresholds', 'fpr', 'tpr']
# We name the columns of the dataframe 'thresholds', 'fpr', and 'tpr'.
df_cutoffs.head()
df_cutoffs['thresholds'][0] = 1 - 1 / np.power(10, 16)
# Let the first threshold (the value of the thresholds column with index 0) be equal to a number, very close to 1
# but smaller than 1, say 1 - 1 / 10 ^ 16.
df_cutoffs['Score'] = ((np.log(df_cutoffs['thresholds'] / (1 - df_cutoffs['thresholds'])) - min_sum_coef) * ((max_score - min_score) / (max_sum_coef - min_sum_coef)) + min_score).round()
# The score corresponsing to each threshold equals:
# The the difference between the natural logarithm of the ratio of the threshold and 1 minus the threshold and
# the minimum sum of coefficients
# multiplied by
# the sum of the minimum score and the ratio of the difference between the maximum score and minimum score and
# the difference between the maximum sum of coefficients and the minimum sum of coefficients.
df_cutoffs.head()
df_cutoffs['Score'][0] = max_score
df_cutoffs.head()
df_cutoffs.tail()
# We define a function called 'n_approved' which assigns a value of 1 if a predicted probability
# is greater than the parameter p, which is a threshold, and a value of 0, if it is not.
# Then it sums the column.
# Thus, if given any percentage values, the function will return
# the number of rows wih estimated probabilites greater than the threshold.
def n_approved(p):
return np.where(df_actual_predicted_probs['y_hat_test_proba'] >= p, 1, 0).sum()
df_cutoffs['N Approved'] = df_cutoffs['thresholds'].apply(n_approved)
# Assuming that all credit applications above a given probability of being 'good' will be approved,
# when we apply the 'n_approved' function to a threshold, it will return the number of approved applications.
# Thus, here we calculate the number of approved appliations for al thresholds.
df_cutoffs['N Rejected'] = df_actual_predicted_probs['y_hat_test_proba'].shape[0] - df_cutoffs['N Approved']
# Then, we calculate the number of rejected applications for each threshold.
# It is the difference between the total number of applications and the approved applications for that threshold.
df_cutoffs['Approval Rate'] = df_cutoffs['N Approved'] / df_actual_predicted_probs['y_hat_test_proba'].shape[0]
# Approval rate equalts the ratio of the approved applications and all applications.
df_cutoffs['Rejection Rate'] = 1 - df_cutoffs['Approval Rate']
# Rejection rate equals one minus approval rate.
df_cutoffs.head()
df_cutoffs.tail()
df_cutoffs.iloc[5000: 5200, ]
# Here we display the dataframe with cutoffs form line with index 5000 to line with index 5200.
df_cutoffs.iloc[1000: 1200, ]
# Here we display the dataframe with cutoffs form line with index 1000 to line with index 1200.
inputs_train_with_ref_cat.to_csv('inputs_train_with_ref_cat.csv')
df_scorecard.to_csv('df_scorecard.csv')
import numpy as np
import pandas as pd
# Import data.
loan_data_preprocessed_backup = pd.read_csv('loan_data_2007_2014_preprocessed.csv')
loan_data_preprocessed = loan_data_preprocessed_backup.copy()
loan_data_preprocessed.columns.values
# Displays all column names.
loan_data_preprocessed.head()
pd.options.display.max_columns = None
loan_data_preprocessed
# Create a series of Boolean values indicating whether loan is recognised as "Charged Off"
loan_data_preprocessed['loan_status'].isin(['Charged Off','Does not meet the credit policy. Status:Charged Off'])
# Creata a Dataframe with data only for those accounts recognized as "Charged Off"
loan_data_defaults = loan_data_preprocessed[loan_data_preprocessed['loan_status'].isin(['Charged Off',
'Does not meet the credit policy. Status:Charged Off'])]
loan_data_defaults
pd.options.display.max_rows = None
loan_data_defaults.isnull().sum()
# We fill the missing values with zeroes.
loan_data_defaults['mths_since_last_delinq'].fillna(0, inplace = True)
loan_data_defaults['mths_since_last_record'].fillna(0, inplace=True)
# We calculate the dependent variable for the LGD model, the recovery rate, and add to the default dataframe
loan_data_defaults['recovery_rate'] = loan_data_defaults['recoveries'] / loan_data_defaults['funded_amnt']
loan_data_defaults['recovery_rate'].describe()
formatted_mean = "{:.4f}".format(loan_data_defaults['recovery_rate'].mean())
print("Total Defaulted Loans : " ,loan_data_defaults['recovery_rate'].count())
print("Mean Recovery Rate on Defaulted Loans : " ,formatted_mean)
loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] > 1,
1, loan_data_defaults['recovery_rate'])
loan_data_defaults['recovery_rate'] = np.where(loan_data_defaults['recovery_rate'] < 0,
0, loan_data_defaults['recovery_rate'])
# We set recovery rates that are greater than 1 to 1 and recovery rates that are less than 0 to 0.
loan_data_defaults['CCF'] = (loan_data_defaults['funded_amnt'] - loan_data_defaults['total_rec_prncp']) / loan_data_defaults['funded_amnt']
# We calculate the dependent variable for the EAD model: credit conversion factor.
# It is the ratio of the difference of the amount used at the moment of default to the total funded amount.
loan_data_defaults['CCF'].describe()
# Shows some descriptive statisics for the values of a column.
loan_data_defaults.to_csv('loan_data_defaults.csv')
# We save the data to a CSV file.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.title('Distribution Recovery Rate',fontsize=20)
plt.hist(loan_data_defaults['recovery_rate'], bins = 100);
# We plot a histogram of a variable with 50 bins.
plt.title('Distribution CCF',fontsize=20)
plt.hist(loan_data_defaults['CCF'], bins = 100);
# We plot a histogram of a variable with 100 bins.
loan_data_defaults['recovery_rate_0_1'] = np.where(loan_data_defaults['recovery_rate'] == 0, 0, 1)
loan_data_defaults['recovery_rate_0_1'].head()
# We create a new variable which is 0 if recovery rate is 0 and 1 otherwise.
loan_data_defaults['recovery_rate_0_1'].tail()
from sklearn.model_selection import train_test_split
# LGD model stage 1 datasets: recovery rate 0 or greater than 0.
lgd_inputs_stage_1_train, lgd_inputs_stage_1_test, lgd_targets_stage_1_train, lgd_targets_stage_1_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['recovery_rate_0_1'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
features_all = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:MORTGAGE',
'home_ownership:NONE',
'home_ownership:OTHER',
'home_ownership:OWN',
'home_ownership:RENT',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:car',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:educational',
'purpose:home_improvement',
'purpose:house',
'purpose:major_purchase',
'purpose:medical',
'purpose:moving',
'purpose:other',
'purpose:renewable_energy',
'purpose:small_business',
'purpose:vacation',
'purpose:wedding',
'initial_list_status:f',
'initial_list_status:w',
'term_int',
'emp_length_int',
'mths_since_issue_d',
'mths_since_earliest_cr_line',
'funded_amnt',
'int_rate',
'installment',
'annual_inc',
'dti',
'delinq_2yrs',
'inq_last_6mths',
'mths_since_last_delinq',
'mths_since_last_record',
'open_acc',
'pub_rec',
'total_acc',
'acc_now_delinq',
'total_rev_hi_lim']
# List of all independent variables for the models.
features_reference_cat = ['grade:G',
'home_ownership:RENT',
'verification_status:Verified',
'purpose:credit_card',
'initial_list_status:f']
# List of the dummy variable reference categories.
lgd_inputs_stage_1_train = lgd_inputs_stage_1_train[features_all]
# Here we keep only the variables we need for the model.
lgd_inputs_stage_1_train = lgd_inputs_stage_1_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
lgd_inputs_stage_1_train.isnull().sum()
# Check for missing values. We check whether the value of each row for each column is missing or not,
# then sum accross columns.
# P values for sklearn logistic regression.
# Class to display p-values for logistic regression in sklearn.
from sklearn import linear_model
import scipy.stats as stat
class LogisticRegression_with_p_values:
def __init__(self,*args,**kwargs):#,**kwargs):
self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
def fit(self,X,y):
self.model.fit(X,y)
#### Get p-values for the fitted model ####
denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
denom = np.tile(denom,(X.shape[1],1)).T
F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
self.coef_ = self.model.coef_
self.intercept_ = self.model.intercept_
#self.z_scores = z_scores
self.p_values = p_values
#self.sigma_estimates = sigma_estimates
#self.F_ij = F_ij
reg_lgd_st_1 = LogisticRegression_with_p_values()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_1.fit(lgd_inputs_stage_1_train, lgd_targets_stage_1_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
feature_name = lgd_inputs_stage_1_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_1.p_values
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_1.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_1.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_1.p_values
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table
lgd_inputs_stage_1_test = lgd_inputs_stage_1_test[features_all]
# Here we keep only the variables we need for the model.
lgd_inputs_stage_1_test = lgd_inputs_stage_1_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
y_hat_test_lgd_stage_1 = reg_lgd_st_1.model.predict(lgd_inputs_stage_1_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
y_hat_test_lgd_stage_1
y_hat_test_proba_lgd_stage_1 = reg_lgd_st_1.model.predict_proba(lgd_inputs_stage_1_test)
# Calculates the predicted probability values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
y_hat_test_proba_lgd_stage_1
# This is an array of arrays of predicted class probabilities for all classes.
# In this case, the first value of every sub-array is the probability for the observation to belong to the first class, i.e. 0,
# and the second value is the probability for the observation to belong to the first class, i.e. 1.
y_hat_test_proba_lgd_stage_1 = y_hat_test_proba_lgd_stage_1[: ][: , 1]
# Here we take all the arrays in the array, and from each array, we take all rows, and only the element with index 1,
# that is, the second element.
# In other words, we take only the probabilities for being 1.
y_hat_test_proba_lgd_stage_1
lgd_targets_stage_1_test_temp = lgd_targets_stage_1_test
lgd_targets_stage_1_test_temp.reset_index(drop = True, inplace = True)
# We reset the index of a dataframe.
df_actual_predicted_probs = pd.concat([lgd_targets_stage_1_test_temp, pd.DataFrame(y_hat_test_proba_lgd_stage_1)], axis = 1)
# Concatenates two dataframes.
df_actual_predicted_probs.columns = ['lgd_targets_stage_1_test', 'y_hat_test_proba_lgd_stage_1']
df_actual_predicted_probs.index = lgd_inputs_stage_1_test.index
# Makes the index of one dataframe equal to the index of another dataframe.
df_actual_predicted_probs.head()
import itertools
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(cm, classes,
normalize=False,
title='CONFUSION MATRIX',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float')
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=20)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.3f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "white")
plt.tight_layout()
plt.ylabel('*TRUE LABEL*', fontsize=14)
plt.xlabel('*PREDICTED LABEL*', fontsize=14)
plt.show()
tr = 0.5
# We create a new column with an indicator,
# where every observation that has predicted probability greater than the threshold has a value of 1,
# and every observation that has predicted probability lower than the threshold has a value of 0.
df_actual_predicted_probs['y_hat_test_lgd_stage_1'] = np.where(df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'] > tr, 1, 0)
pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted'])
# Creates a cross-table where the actual values are displayed by rows and the predicted values by columns.
# This table is known as a Confusion Matrix.
cm_lgd_N = confusion_matrix(df_actual_predicted_probs['lgd_targets_stage_1_test'],
df_actual_predicted_probs['y_hat_test_lgd_stage_1'])
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_lgd_N, classes,
normalize=False,
title='CONFUSION MATRIX - Threshold = 0.5',
cmap=plt.cm.RdYlGn)
pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
# Here we divide each value of the table by the total number of observations,
# thus getting percentages, or, rates.
cm_lgd_pc = pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'],
df_actual_predicted_probs['y_hat_test_lgd_stage_1'],
rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]
cm_arr = np.array(cm_lgd_pc)
classes = ['No Recovery', 'Recovery']
plot_confusion_matrix(cm_arr, classes,
normalize=True,
title='CONFUSION MATRIX - Threshold = 0.5',
cmap=plt.cm.RdYlGn)
(pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[0, 0] + (pd.crosstab(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_lgd_stage_1'], rownames = ['Actual'], colnames = ['Predicted']) / df_actual_predicted_probs.shape[0]).iloc[1, 1]
# Here we calculate Accuracy of the model, which is the sum of the diagonal rates.
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(df_actual_predicted_probs['lgd_targets_stage_1_test'],
df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Returns the Receiver Operating Characteristic (ROC) Curve from a set of actual values and their predicted probabilities.
# As a result, we get three arrays: the false positive rates, the true positive rates, and the thresholds.
# we store each of the three arrays in a separate variable.
plt.plot(fpr, tpr)
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
# thus plotting the ROC curve.
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
# We plot a seconary diagonal line, with dashed line style and black color.
plt.xlabel('False positive rate')
# We name the x-axis "False positive rate".
plt.ylabel('True positive rate')
# We name the x-axis "True positive rate".
plt.title('ROC curve')
# We name the graph "ROC curve".
AUROC = roc_auc_score(df_actual_predicted_probs['lgd_targets_stage_1_test'], df_actual_predicted_probs['y_hat_test_proba_lgd_stage_1'])
# Calculates the Area Under the Receiver Operating Characteristic Curve (AUROC)
# from a set of actual values and their predicted probabilities.
AUROC
import pickle
pickle.dump(reg_lgd_st_1, open('lgd_model_stage_1.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.
lgd_stage_2_data = loan_data_defaults[loan_data_defaults['recovery_rate_0_1'] == 1]
# Here we take only rows where the original recovery rate variable is greater than one,
# i.e. where the indicator variable we created is equal to 1.
# LGD model stage 2 datasets: how much more than 0 is the recovery rate
lgd_inputs_stage_2_train, lgd_inputs_stage_2_test, lgd_targets_stage_2_train, lgd_targets_stage_2_test = train_test_split(lgd_stage_2_data.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), lgd_stage_2_data['recovery_rate'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Since the p-values are obtained through certain statistics, we need the 'stat' module from scipy.stats
import scipy.stats as stat
# Since we are using an object oriented language such as Python, we can simply define our own
# LinearRegression class (the same one from sklearn)
# By typing the code below we will ovewrite a part of the class with one that includes p-values
# Here's the full source code of the ORIGINAL class: https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/linear_model/base.py#L362
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""
# nothing changes in __init__
def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
n_jobs=1):
self.fit_intercept = fit_intercept
self.normalize = normalize
self.copy_X = copy_X
self.n_jobs = n_jobs
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
# Calculate SSE (sum of squared errors)
# and SE (standard error)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])
# compute the t-statistic for each feature
self.t = self.coef_ / se
# find the p-value for each feature
self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
return self
import scipy.stats as stat
class LinearRegression(linear_model.LinearRegression):
def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
n_jobs=1):
self.fit_intercept = fit_intercept
self.normalize = normalize
self.copy_X = copy_X
self.n_jobs = n_jobs
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])
self.t = self.coef_ / se
self.p = np.squeeze(2 * (1 - stat.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])))
return self
lgd_inputs_stage_2_train = lgd_inputs_stage_2_train[features_all]
# Here we keep only the variables we need for the model.
lgd_inputs_stage_2_train = lgd_inputs_stage_2_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
reg_lgd_st_2 = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_lgd_st_2.fit(lgd_inputs_stage_2_train, lgd_targets_stage_2_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
feature_name = lgd_inputs_stage_2_train.columns.values
# Stores the names of the columns of a dataframe in a variable.
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values.round(3)
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_lgd_st_2.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_lgd_st_2.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_lgd_st_2.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values.round(3)
summary_table
lgd_inputs_stage_2_test = lgd_inputs_stage_2_test[features_all]
# Here we keep only the variables we need for the model.
lgd_inputs_stage_2_test = lgd_inputs_stage_2_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
lgd_inputs_stage_2_test.columns.values
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
y_hat_test_lgd_stage_2 = reg_lgd_st_2.predict(lgd_inputs_stage_2_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test
lgd_targets_stage_2_test_temp = lgd_targets_stage_2_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.
pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.
corr_mat = pd.concat([lgd_targets_stage_2_test_temp, pd.DataFrame(y_hat_test_lgd_stage_2)], axis = 1).corr()
corr_arr = np.array(corr_mat)
classes = [' ', ' ']
plot_confusion_matrix(corr_arr, classes,
normalize=True,
title='CORRELATION MATRIX - Act Vs Pred Recov Rates',
cmap=plt.cm.RdYlGn)
sns.distplot(lgd_targets_stage_2_test - y_hat_test_lgd_stage_2)
# We plot the distribution of the residuals.
pickle.dump(reg_lgd_st_2, open('lgd_model_stage_2.sav', 'wb'))
# Here we export our model to a 'SAV' file with file name 'lgd_model_stage_1.sav'.
y_hat_test_lgd_stage_2_all = reg_lgd_st_2.predict(lgd_inputs_stage_1_test)
y_hat_test_lgd_stage_2_all
y_hat_test_lgd = y_hat_test_lgd_stage_1 * y_hat_test_lgd_stage_2_all
# Here we combine the predictions of the models from the two stages.
pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.
pd.DataFrame(y_hat_test_lgd).sum()/pd.DataFrame(y_hat_test_lgd).count()
y_hat_test_lgd = np.where(y_hat_test_lgd < 0, 0, y_hat_test_lgd)
y_hat_test_lgd = np.where(y_hat_test_lgd > 1, 1, y_hat_test_lgd)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.
pd.DataFrame(y_hat_test_lgd).describe()
# Shows some descriptive statisics for the values of a column.
# EAD model datasets
ead_inputs_train, ead_inputs_test, ead_targets_train, ead_targets_test = train_test_split(loan_data_defaults.drop(['good_bad', 'recovery_rate','recovery_rate_0_1', 'CCF'], axis = 1), loan_data_defaults['CCF'], test_size = 0.2, random_state = 42)
# Takes a set of inputs and a set of targets as arguments. Splits the inputs and the targets into four dataframes:
# Inputs - Train, Inputs - Test, Targets - Train, Targets - Test.
ead_inputs_train.columns.values
ead_inputs_train = ead_inputs_train[features_all]
# Here we keep only the variables we need for the model.
ead_inputs_train = ead_inputs_train.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
reg_ead = LinearRegression()
# We create an instance of an object from the 'LogisticRegression' class.
reg_ead.fit(ead_inputs_train, ead_targets_train)
# Estimates the coefficients of the object from the 'LogisticRegression' class
# with inputs (independent variables) contained in the first dataframe
# and targets (dependent variables) contained in the second dataframe.
feature_name = ead_inputs_train.columns.values
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
# Creates a dataframe with a column titled 'Feature name' and row values contained in the 'feature_name' variable.
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
# Creates a new column in the dataframe, called 'Coefficients',
# with row values the transposed coefficients from the 'LogisticRegression' object.
summary_table.index = summary_table.index + 1
# Increases the index of every row of the dataframe with 1.
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
# Assigns values of the row with index 0 of the dataframe.
summary_table = summary_table.sort_index()
# Sorts the dataframe by index.
p_values = reg_lgd_st_2.p
# We take the result of the newly added method 'p_values' and store it in a variable 'p_values'.
p_values = np.append(np.nan,np.array(p_values))
# We add the value 'NaN' in the beginning of the variable with p-values.
summary_table['p_values'] = p_values
# In the 'summary_table' dataframe, we add a new column, called 'p_values', containing the values from the 'p_values' variable.
summary_table
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficients'] = np.transpose(reg_ead.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg_ead.intercept_]
summary_table = summary_table.sort_index()
p_values = reg_ead.p
p_values = np.append(np.nan,np.array(p_values))
summary_table['p_values'] = p_values
summary_table
ead_inputs_test = ead_inputs_test[features_all]
# Here we keep only the variables we need for the model.
ead_inputs_test = ead_inputs_test.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
ead_inputs_test.columns.values
y_hat_test_ead = reg_ead.predict(ead_inputs_test)
# Calculates the predicted values for the dependent variable (targets)
# based on the values of the independent variables (inputs) supplied as an argument.
ead_targets_test_temp = ead_targets_test
ead_targets_test_temp = ead_targets_test_temp.reset_index(drop = True)
# We reset the index of a dataframe.
pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()
# We calculate the correlation between actual and predicted values.
corr_mat_2 = pd.concat([ead_targets_test_temp, pd.DataFrame(y_hat_test_ead)], axis = 1).corr()
corr_arr_2 = np.array(corr_mat_2)
classes = [' ', ' ']
plot_confusion_matrix(corr_arr_2, classes,
normalize=True,
title='CORRELATION MATRIX - Act Vs Pred EAD',
cmap=plt.cm.RdYlGn)
sns.distplot(ead_targets_test - y_hat_test_ead)
# We plot the distribution of the residuals.
(ead_targets_test - y_hat_test_ead).mean()
pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.
y_hat_test_ead = np.where(y_hat_test_ead < 0, 0, y_hat_test_ead)
y_hat_test_ead = np.where(y_hat_test_ead > 1, 1, y_hat_test_ead)
# We set predicted values that are greater than 1 to 1 and predicted values that are less than 0 to 0.
pd.DataFrame(y_hat_test_ead).describe()
# Shows some descriptive statisics for the values of a column.
pd.DataFrame(y_hat_test_ead).sum()/pd.DataFrame(y_hat_test_ead).count()
loan_data_preprocessed.head()
loan_data_preprocessed['mths_since_last_delinq'].fillna(0, inplace = True)
# We fill the missing values with zeroes.
loan_data_preprocessed['mths_since_last_record'].fillna(0, inplace = True)
# We fill the missing values with zeroes.
loan_data_preprocessed_lgd_ead = loan_data_preprocessed[features_all]
# Here we keep only the variables we need for the model.
loan_data_preprocessed_lgd_ead = loan_data_preprocessed_lgd_ead.drop(features_reference_cat, axis = 1)
# Here we remove the dummy variable reference categories.
loan_data_preprocessed['recovery_rate_st_1'] = reg_lgd_st_1.model.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 1 LGD model and calculate predicted values.
loan_data_preprocessed['recovery_rate_st_2'] = reg_lgd_st_2.predict(loan_data_preprocessed_lgd_ead)
# We apply the stage 2 LGD model and calculate predicted values.
loan_data_preprocessed['recovery_rate'] = loan_data_preprocessed['recovery_rate_st_1'] * loan_data_preprocessed['recovery_rate_st_2']
# We combine the predicted values from the stage 1 predicted model and the stage 2 predicted model
# to calculate the final estimated recovery rate.
loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] < 0, 0, loan_data_preprocessed['recovery_rate'])
loan_data_preprocessed['recovery_rate'] = np.where(loan_data_preprocessed['recovery_rate'] > 1, 1, loan_data_preprocessed['recovery_rate'])
# We set estimated recovery rates that are greater than 1 to 1 and estimated recovery rates that are less than 0 to 0.
loan_data_preprocessed['LGD'] = 1 - loan_data_preprocessed['recovery_rate']
# We calculate estimated LGD. Estimated LGD equals 1 - estimated recovery rate.
loan_data_preprocessed['LGD'].describe()
# Shows some descriptive statisics for the values of a column.
loan_data_preprocessed['CCF'] = reg_ead.predict(loan_data_preprocessed_lgd_ead)
# We apply the EAD model to calculate estimated credit conversion factor.
loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] < 0, 0, loan_data_preprocessed['CCF'])
loan_data_preprocessed['CCF'] = np.where(loan_data_preprocessed['CCF'] > 1, 1, loan_data_preprocessed['CCF'])
# We set estimated CCF that are greater than 1 to 1 and estimated CCF that are less than 0 to 0.
loan_data_preprocessed['EAD'] = loan_data_preprocessed['CCF'] * loan_data_preprocessed_lgd_ead['funded_amnt']
# We calculate estimated EAD. Estimated EAD equals estimated CCF multiplied by funded amount.
loan_data_preprocessed['EAD'].describe()
# Shows some descriptive statisics for the values of a column.
loan_data_preprocessed.head()
loan_data_inputs_train = pd.read_csv('cr_inp_train.csv', index_col = 0)
# We import data to apply the PD model.
loan_data_inputs_test = pd.read_csv('cr_inp_test.csv', index_col = 0)
# We import data to apply the PD model.
loan_data_inputs_pd = pd.concat([loan_data_inputs_train, loan_data_inputs_test], axis = 0)
# We concatenate the two dataframes along the rows.
loan_data_inputs_pd.shape
loan_data_inputs_pd.head()
features_all_pd = ['grade:A',
'grade:B',
'grade:C',
'grade:D',
'grade:E',
'grade:F',
'grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'home_ownership:OWN',
'home_ownership:MORTGAGE',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'addr_state:NM_VA',
'addr_state:NY',
'addr_state:OK_TN_MO_LA_MD_NC',
'addr_state:CA',
'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN',
'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR',
'addr_state:WI_MT',
'addr_state:TX',
'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS',
'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'purpose:credit_card',
'purpose:debt_consolidation',
'purpose:oth__med__vacation',
'purpose:major_purch__car__home_impr',
'initial_list_status:f',
'initial_list_status:w',
'term:36',
'term:60',
'emp_length:0',
'emp_length:1',
'emp_length:2-4',
'emp_length:5-6',
'emp_length:7-9',
'emp_length:10',
'mths_since_issue_d:<38',
'mths_since_issue_d:38-39',
'mths_since_issue_d:40-41',
'mths_since_issue_d:42-48',
'mths_since_issue_d:49-52',
'mths_since_issue_d:53-64',
'mths_since_issue_d:65-84',
'mths_since_issue_d:>84',
'int_rate:<9.548',
'int_rate:9.548-12.025',
'int_rate:12.025-15.74',
'int_rate:15.74-20.281',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'mths_since_earliest_cr_line:141-164',
'mths_since_earliest_cr_line:165-247',
'mths_since_earliest_cr_line:248-270',
'mths_since_earliest_cr_line:271-352',
'mths_since_earliest_cr_line:>352',
'inq_last_6mths:0',
'inq_last_6mths:1-2',
'inq_last_6mths:3-6',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'acc_now_delinq:>=1',
'annual_inc:<20K',
'annual_inc:20K-30K',
'annual_inc:30K-40K',
'annual_inc:40K-50K',
'annual_inc:50K-60K',
'annual_inc:60K-70K',
'annual_inc:70K-80K',
'annual_inc:80K-90K',
'annual_inc:90K-100K',
'annual_inc:100K-120K',
'annual_inc:120K-140K',
'annual_inc:>140K',
'dti:<=1.4',
'dti:1.4-3.5',
'dti:3.5-7.7',
'dti:7.7-10.5',
'dti:10.5-16.1',
'dti:16.1-20.3',
'dti:20.3-21.7',
'dti:21.7-22.4',
'dti:22.4-35',
'dti:>35',
'mths_since_last_delinq:Missing',
'mths_since_last_delinq:0-3',
'mths_since_last_delinq:4-30',
'mths_since_last_delinq:31-56',
'mths_since_last_delinq:>=57',
'mths_since_last_record:Missing',
'mths_since_last_record:0-2',
'mths_since_last_record:3-20',
'mths_since_last_record:21-31',
'mths_since_last_record:32-80',
'mths_since_last_record:81-86',
'mths_since_last_record:>86']
ref_categories_pd = ['grade:G',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL',
'verification_status:Verified',
'purpose:educ__sm_b__wedd__ren_en__mov__house',
'initial_list_status:f',
'term:60',
'emp_length:0',
'mths_since_issue_d:>84',
'int_rate:>20.281',
'mths_since_earliest_cr_line:<140',
'inq_last_6mths:>6',
'acc_now_delinq:0',
'annual_inc:<20K',
'dti:>35',
'mths_since_last_delinq:0-3',
'mths_since_last_record:0-2']
loan_data_inputs_pd_temp = loan_data_inputs_pd[features_all_pd]
# Here we keep only the variables we need for the model.
loan_data_inputs_pd_temp = loan_data_inputs_pd_temp.drop(ref_categories_pd, axis = 1)
# Here we remove the dummy variable reference categories.
loan_data_inputs_pd_temp.shape
import pickle
reg_pd = pickle.load(open('pd_model.sav', 'rb'))
# We import the PD model, stored in the 'pd_model.sav' file.
reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.
loan_data_inputs_pd['PD'] = reg_pd.model.predict_proba(loan_data_inputs_pd_temp)[: ][: , 0]
# We apply the PD model to caclulate estimated default probabilities.
loan_data_inputs_pd['PD'].head()
loan_data_inputs_pd['PD'].describe()
# Shows some descriptive statisics for the values of a column.
loan_data_preprocessed_new = pd.concat([loan_data_preprocessed, loan_data_inputs_pd], axis = 1)
# We concatenate the dataframes where we calculated LGD and EAD and the dataframe where we calculated PD along the columns.
loan_data_preprocessed_new.shape
loan_data_preprocessed_new.head()
loan_data_preprocessed_new['EL'] = loan_data_preprocessed_new['PD'] * loan_data_preprocessed_new['LGD'] * loan_data_preprocessed_new['EAD']
# We calculate Expected Loss. EL = PD * LGD * EAD.
loan_data_preprocessed_new['EL'].describe()
# Shows some descriptive statisics for the values of a column.
Output = loan_data_preprocessed_new[['funded_amnt', 'PD', 'LGD', 'EAD', 'EL']]
Output = Output.loc[:,~Output.columns.duplicated()]
Output.head()
EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()
Weight = EAD_LGD/EAD_LGD.sum()
Output['Weight'] = Weight
Wtd_PD = Output['Weight'] * Output['PD']
Output['Wtd_PD'] = Wtd_PD
Output.head()
Output['Wtd_PD'].sum()
Output['Wtd_PD'].sum()
EAD_LGD = Output['EAD'] * Output['LGD']
EAD_LGD.sum()
EAD_LGD.sum() * Output['Wtd_PD'].sum()
Output['EAD'].sum()
EAD_LGD.sum()
EAD_LGD.sum()/Output['EAD'].sum() * Output['EAD'].sum() * Output['Wtd_PD'].sum()
EAD_LGD.sum()/Output['EAD'].sum()
RR = 1- (EAD_LGD.sum()/Output['EAD'].sum())
RR
import math
Ave_PD = Output['Wtd_PD'].sum()
Ass_corr_1 = 0.04* (1-math.exp(-50*Ave_PD)) / (1-math.exp(-50))
Ass_corr_2 = 0.0039*(1-(1-math.exp(-50*Ave_PD)/(1-math.exp(-50))))
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr = Ass_corr_1 + Ass_corr_2
Ass_corr
from scipy.stats import norm
Ave_PD
Eco_Scen = 0.70
a = norm.ppf(Ave_PD)
b = norm.ppf(Eco_Scen)
Cor_Coef = Ass_corr
c = (a + math.sqrt(Cor_Coef)*b)/ math.sqrt(1 - Cor_Coef)
PD_Corr = norm.cdf(c, loc = 0, scale = 1)
PD_Corr
title = "CREDIT CARD RECEIVABLES PORTFOLIO - RISK SUMMARY"
bolded_title = "\033[34;1;4m" + title + "\033[0m"
formatted_PD = "{:.4f}".format(Output['Wtd_PD'].sum())
formatted_EAD = "{:,.0f}".format(Output['EAD'].sum())
formatted_RR = "{:.4f}".format(1- (EAD_LGD.sum()/Output['EAD'].sum()))
formatted_ECL= "{:,.0f}".format(Output['EL'].sum())
formatted_ECL_pc= "{:.4f}".format(Output['EL'].sum()
/Output['funded_amnt'].sum())
formatted_FA= "{:,.0f}".format(Output['funded_amnt'].sum())
formatted_AC= "{:,.4f}".format(Ass_corr)
formatted_Eco= "{:,.3f}".format(Eco_Scen)
formatted_PD_Corr = "{:,.4f}".format(PD_Corr)
formatted_ECL_Corr = "{:,.0f}".format(EAD_LGD.sum()/Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr)
formatted_ECL_Corr_pc= "{:.4f}".format(EAD_LGD.sum()/
Output['EAD'].sum()*Output['EAD'].sum()* PD_Corr/Output['funded_amnt'].sum())
print(" ")
print(" ", bolded_title)
print(" ")
print("------------------------------------------------------------------")
print("Current Funded Amount : " , formatted_FA)
print("------------------------------------------------------------------")
print("Weighted Average Probability of Default : " , formatted_PD)
print("Expected Exposure at Default : " , formatted_EAD)
print("Expected Recovery Rate : " , formatted_RR)
print("\033[1mExpected Credit Loss Assuming Independence\033[0m : " , "\033[1m" + formatted_ECL + "\033[0m")
print("\033[1mECL Assuming Independence ÷ Funded Amount\033[0m : " , "\033[1m" + formatted_ECL_pc + "\033[0m")
print("------------------------------------------------------------------")
print("Asset Correlation : " , formatted_AC)
print("Economic Scenario : " , formatted_Eco)
print("Portfolio Prob. Default Assuming Correlation : " , formatted_PD_Corr)
print("\033[1mExpected Credit Loss Assuming Correlation\033[0m : " , "\033[1m" + formatted_ECL_Corr + "\033[0m")
print("\033[1mECL Assuming Correlation ÷ Funded Amount\033[0m : " , "\033[1m" + formatted_ECL_Corr_pc + "\033[0m")
print("__________________________________________________________________")
print(" ")