toad标准化评分库

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

toad标准化评分库
toad是由厚本⾦融风控团队内部孵化，后开源并坚持维护的标准化评分卡库。

其功能全⾯、性能稳健、运⾏速度快、问题反馈后维护迅速、深受同⾏喜爱。

如果有些⼩伙伴没有⼀些标准化的信⽤评分开发⼯具或者企业级的定制化脚本，toad应该会极⼤的节省⼤家的时间
⼀、加载所需模块
import pandas as pd
from sklearn.metrics import roc_auc_score,roc_curve,auc #常见的分类评分标准
from sklearn.model_selection import train_test_split #切分数据
from sklearn.linear_model import LogisticRegression #逻辑回归
from sklearn.model_selection import GridSearchCV as gscv #⽹格搜索
from sklearn.neighbors import KNeighborsClassifier #k近邻
import numpy as np
import glob
import math
import xgboost as xgb
import toad
pc.obj_info(toad)
ObjInfo object of :
模块：['c_utils', 'detector', 'metrics', 'scorecard', 'selection', 'stats', 'transform', 'utils', 'version']
类/对象：['ScoreCard']
函数/⽅法：['F1', 'KS', 'KS_bucket', 'VIF', 'WOE', 'detect', 'entropy', 'gini', 'quality', 'select']
属性：['ChiMerge', 'DTMerge', 'IV', 'KMeansMerge', 'QuantileMerge', 'StepMerge', 'VERSION', 'entropy_cond', 'gini_cond', 'merge']
⼆、加载数据
#加载数据path = "D:/风控模型/data/"
data_all = pd.read_csv(path+"data.txt",engine='python',index_col=False)
data_all_woe = pd.read_csv(path+"ccard_all_woe.txt",engine='python',index_col=False)
#指定不参与训练列名
ex_lis = ['uid','obs_mth','ovd_dt','samp_type','weight',
'af30_status','submit_time','bad_ind']
#参与训练列名
ft_lis = list(data_all.columns)
for i in ex_lis:
ft_lis.remove(i)
三、划分训练集和测试集
#训练集与跨时间验证集合
dev = data_all[(data_all['samp_type'] == 'dev') |
(data_all['samp_type'] == 'val') |
(data_all['samp_type'] == 'off1') ]
off = data_all[data_all['samp_type'] == 'off2']
四、EDA
探索性数据分析同时处理数值型和字符型
a = toad.detector.detect(data_all)
a.head(8)
pc.obj_info(toad.detector)
ObjInfo object of :
模块：['pd']
函数/⽅法：['countBlank', 'detect', 'getDescribe', 'getTopValues', 'isNumeric']
五、特征刷选
empty：缺失率上限
iv：信息量
corr：相关系数⼤于阈值，则删除IV⼩的特征
return_drop：返回删除特征
exclude：不参与筛选的变量名
pc.obj_info(toad.selection)
ObjInfo object of :
模块：['np', 'pd', 'stats']
类/对象：['StatsModel']
函数/⽅法：['AIC', 'AUC', 'BIC', 'KS', 'MSE', 'VIF', 'drop_corr', 'drop_empty', 'drop_iv', 'drop_var', 'drop_vif', 'select', 'split_target', 'stepwise', 'to_ndarray', 'unpack_tuple']
属性：['INTERCEPT_COLS', 'IV']
dev_slct1, drop_lst= toad.selection.select(dev,dev['bad_ind'], empty = 0.7,
iv = 0.02, corr = 0.7, return_drop=True, exclude=ex_lis)
print("keep:",dev_slct1.shape[1],
"drop empty:",len(drop_lst['empty']),
"drop iv:",len(drop_lst['iv']),
"drop corr:",len(drop_lst['corr']))
keep: 584
drop empty: 637
drop iv: 1961
drop corr: 2043
dev_slct2, drop_lst= toad.selection.select(dev_slct1,dev_slct1['bad_ind'], empty = 0.6,
iv = 0.02, corr = 0.7, return_drop=True, exclude=ex_lis)
print("keep:",dev_slct2.shape[1],
"drop empty:",len(drop_lst['empty']),
"drop iv:",len(drop_lst['iv']),
"drop corr:",len(drop_lst['corr']))
keep: 560
drop empty: 24
drop iv: 0
drop corr: 0
help(toad.selection.select)
Help on function select in module toad.selection:
select(frame, target='target', empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None)
select features by rate of empty, iv and correlation
Args:
frame (DataFrame)
target (str): target's name in dataframe
empty (number): drop the features which empty num is greater than threshold. if threshold is float, it will be use as percentage
iv (float): drop the features whose IV is less than threshold
corr (float): drop features that has the smallest IV in each groups which correlation is greater than threshold
return_drop (bool): if need to return features' name who has been dropped
exclude (array-like): list of feature name that will not be dropped
Returns:
DataFrame: selected dataframe
dict: list of dropped feature names in each step
六、分箱
先找到分箱的阈值
分箱阈值的⽅法（method）包括：'chi','dt','quantile','step','kmeans'
然后利⽤分箱阈值进⾏粗分箱。

pc.obj_info(toad.transform)
ObjInfo object of :
模块：['copy', 'math', 'np', 'pd']
类/对象：['BinsMixin', 'Combiner', 'GBDTTransformer', 'GradientBoostingClassifier', 'OneHotEncoder', 'RulesMixin', 'Transformer', 'TransformerMixin', 'WOETransformer', 'frame_exclude', 'select_dtypes']函数/⽅法：['WOE', 'bin_by_splits', 'np_count', 'probability', 'split_target', 'to_ndarray', 'wraps']
属性：['merge']
#得到切分节点
combiner = biner()
combiner.fit(dev_slct2,dev_slct2['bad_ind'],method='chi',min_samples = 0.05,
exclude=ex_lis)
#导出箱的节点
bins = combiner.export()
#根据节点实施分箱
dev_slct3 = combiner.transform(dev_slct2)
off3 = combiner.transform(off[dev_slct2.columns])
#分箱后通过画图观察,x分箱后某个字段
from toad.plot import bin_plot,badrate_plot
bin_plot(dev_slct3,x='p_ovpromise_6mth',target='bad_ind')
bin_plot(off3,x='p_ovpromise_6mth',target='bad_ind')
pc.obj_info(toad.plot)
ObjInfo object of :
模块：['np', 'pd']
函数/⽅法：['AUC', 'add_annotate', 'add_text', 'badrate_plot', 'bin_plot', 'corr_plot', 'generate_str', 'proportion_plot', 'reset_ylim', 'roc_plot', 'unpack_tuple']属性：['HEATMAP_CMAP', 'IV', 'MAX_STYLE', 'tadpole']
后2箱不单调
#查看单箱节点 [0.0, 24.0, 60.0, 100.0]
bins['p_ovpromise_6mth']
合并最后两箱
adj_bin = {'p_ovpromise_6mth': [0.0, 24.0, 60.0]}
combiner.set_rules(adj_bin)
dev_slct3 = combiner.transform(dev_slct2)
off3 = combiner.transform(off[dev_slct2.columns])
bin_plot(dev_slct3,x='p_ovpromise_6mth',target='bad_ind')
bin_plot(off3,x='p_ovpromise_6mth',target='bad_ind')
对⽐不同数据集上特征的badrate图是否有交叉
data = pd.concat([dev_slct3,off3],join='inner')
badrate_plot(data, x='samp_type', target='bad_ind', by='p_ovpromise_6mth')
没有交叉，因此该特征的分组不需要再进⾏合并。

篇幅有限，不对所有特征的精细化调整做展⽰。

接下来进⾏WOE映射#
t=toad.transform.WOETransformer()
dev_slct2_woe = t.fit_transform(dev_slct3,dev_slct3['bad_ind'], exclude=ex_lis)
off_woe = t.transform(off3[dev_slct3.columns])
data = pd.concat([dev_slct2_woe,off_woe])
通过稳定性筛选特征。

计算训练集与跨时间验证集的PSI。

删除PSI⼤于0.05的特征
#(41199, 476)
psi_df = toad.metrics.PSI(dev_slct2_woe, off_woe).sort_values(0)
psi_df = psi_df.reset_index()
psi_df = psi_df.rename(columns = {'index' : 'feature',0:'psi'})
psi005 = list(psi_df[psi_df.psi<0.05].feature)
for i in ex_lis:
if i in psi005:
pass
else:
psi005.append(i)
data = data[psi005]
dev_woe_psi = dev_slct2_woe[psi005]
off_woe_psi = off_woe[psi005]
print(data.shape)
查看这个对象
pc.obj_info(toad.metrics)
ObjInfo object of :
模块：['np', 'pd']
类/对象：['Combiner']
函数/⽅法：['AIC', 'AUC', 'BIC', 'F1', 'KS', 'KS_bucket', 'KS_by_col', 'MSE', 'PSI', 'SSE', 'bin_by_splits', 'f1_score', 'feature_splits', 'iter_df', 'ks_2samp', 'matrix', 'roc_auc_score', 'roc_curve', 'unpack_tuple']属性：['merge']
由于分箱后变量之间的共线性会变强，通过相关性再次筛选特征
#
dev_woe_psi2, drop_lst= toad.selection.select(dev_woe_psi,dev_woe_psi['bad_ind'], empty = 0.6,
iv = 0.02, corr = 0.5, return_drop=True, exclude=ex_lis)
print("keep:",dev_woe_psi2.shape[1],
"drop empty:",len(drop_lst['empty']),
"drop iv:",len(drop_lst['iv']),
"drop corr:",len(drop_lst['corr']))
keep: 85
drop empty: 0
drop iv: 56
drop corr: 335
接下来通过逐步回归进⾏最终的特征筛选。

检验⽅法（criterion）：
'aic'
'bic'
七、检验模型（estimator）
'ols': LinearRegression,
'lr': LogisticRegression,
'lasso': Lasso,
'ridge': Ridge,
#(41199, 33)
dev_woe_psi_stp = toad.selection.stepwise(dev_woe_psi2,
dev_woe_psi2['bad_ind'],
exclude = ex_lis,
direction = 'both',
criterion = 'aic',
estimator = 'ols',
intercept = False)
off_woe_psi_stp = off_woe_psi[dev_woe_psi_stp.columns]
data = pd.concat([dev_woe_psi_stp,off_woe_psi_stp])
data.shape
接下来定义双向逻辑回归和检验模型XGBoost
#定义逻辑回归
def lr_model(x,y,offx,offy,C):
model = LogisticRegression(C=C,class_weight='balanced')
model.fit(x,y)
y_pred = model.predict_proba(x)[:,1]
fpr_dev,tpr_dev,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_dev - tpr_dev).max()
print('train_ks : ',train_ks)
y_pred = model.predict_proba(offx)[:,1]
fpr_off,tpr_off,_ = roc_curve(offy,y_pred)
off_ks = abs(fpr_off - tpr_off).max()
print('off_ks : ',off_ks)
from matplotlib import pyplot as plt
plt.plot(fpr_dev,tpr_dev,label = 'train')
plt.plot(fpr_off,tpr_off,label = 'off')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()
#定义xgboost辅助判断盘⽛鞥特征交叉是否有必要
def xgb_model(x,y,offx,offy):
model = xgb.XGBClassifier(learning_rate=0.05,
n_estimators=400,
max_depth=3,
class_weight='balanced',
min_child_weight=1,
subsample=1,
objective="binary:logistic",
nthread=-1,
scale_pos_weight=1,
random_state=1,
n_jobs=-1,
reg_lambda=300)
model.fit(x,y)
print('>>>>>>>>>')
y_pred = model.predict_proba(x)[:,1]
fpr_dev,tpr_dev,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_dev - tpr_dev).max()
print('train_ks : ',train_ks)
y_pred = model.predict_proba(offx)[:,1]
fpr_off,tpr_off,_ = roc_curve(offy,y_pred)
off_ks = abs(fpr_off - tpr_off).max()
print('off_ks : ',off_ks)
from matplotlib import pyplot as plt
plt.plot(fpr_dev,tpr_dev,label = 'train')
plt.plot(fpr_off,tpr_off,label = 'off')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()
#模型训练
def c_train(data,dep='bg_result_compensate',exclude=None):
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
#变量名
lis = list(data.columns)
for i in exclude:
lis.remove(i)
data[lis] = std_scaler.fit_transform(data[lis])
devv = data[(data['samp_type']=='dev') | (data['samp_type']=='val')]
offf = data[(data['samp_type']=='off1') | (data['samp_type']=='off2') ]
x,y = devv[lis],devv[dep]
offx,offy = offf[lis],offf[dep]
#逻辑回归正向
lr_model(x,y,offx,offy,0.1)
#逻辑回归反向
lr_model(offx,offy,x,y,0.1)
#XGBoost正向
xgb_model(x,y,offx,offy)
#XGBoost反向
xgb_model(offx,offy,x,y)
在特征精细化分箱后，xgboost模型的KS明显⾼于LR，则特征交叉是有必要的。

需要返回特征⼯程过程进⾏特征交叉衍⽣。

两模型KS接近代表特征交叉对模型没有明显提升。

反向模型KS代表模型最⾼可能达到的结果。

如果反向训练集效果较差，说明跨时间验证集本⾝分布较为特殊，应当重新划分数据。

#
c_train(data,dep='bad_ind',exclude=ex_lis)
#模型训练
dep = 'bad_ind'
lis = list(data.columns)
for i in ex_lis:
lis.remove(i)
devv = data[(data['samp_type']=='dev') | (data['samp_type']=='val')]
offf = data[(data['samp_type']=='off1') | (data['samp_type']=='off2') ]
x,y = devv[lis],devv[dep]
offx,offy = offf[lis],offf[dep]
lr = LogisticRegression()
lr.fit(x,y)
分别计算：F1分数 KS值 AUC值
from toad.metrics import KS, F1, AUC
prob_dev = lr.predict_proba(x)[:,1]
print('训练集')
print('F1:', F1(prob_dev,y))
print('KS:', KS(prob_dev,y))
print('AUC:', AUC(prob_dev,y))
prob_off = lr.predict_proba(offx)[:,1]
print('跨时间')
print('F1:', F1(prob_off,offy))
print('KS:', KS(prob_off,offy))
print('AUC:', AUC(prob_off,offy))
训练集
F1: 0.30815569972196477
KS: 0.2819389063516508
AUC: 0.6908879633467695
跨时间
F1: 0.2848354792560801
KS: 0.23181102640650808
AUC: 0.6522823050763138
计算模型PSI和变量PSI，两个⾓度衡量稳定性
print('模型PSI:',toad.metrics.PSI(prob_dev,prob_off))
print('特征PSI:','\n',toad.metrics.PSI(x,offx).sort_values(0))
模型PSI: 0.022260098554531284
特征PSI:
⽣产模型KS报告
off_bucket = toad.metrics.KS_bucket(prob_off,offy,bucket=10,method='quantile')
off_bucket
⽣产评分卡。

⽀持传⼊所有的模型参数，以及Fico分数校准的基础分与pdo（point of double odds），我⼀直管pdo叫步长...orz pc.obj_info(toad.scorecard)
ObjInfo object of :
模块：['np', 'pd', 're']
类/对象：['BaseEstimator', 'BinsMixin', 'Combiner', 'LogisticRegression', 'RulesMixin', 'ScoreCard', 'WOETransformer']
函数/⽅法：['bin_by_splits', 'read_json', 'save_json', 'to_ndarray']
属性：['FACTOR_EMPTY', 'FACTOR_UNKNOWN', 'NUMBER_EMPTY', 'NUMBER_INF']
from toad.scorecard import ScoreCard
card = ScoreCard(combiner = combiner, transer = t,class_weight = 'balanced',C=0.1,base_score = 600,base_odds = 35 ,pdo = 60,rate = 2) card.fit(x,y)
final_card = card.export(to_frame = True)
final_card.head(8)。