跳到主要内容银行个人贷款违约风险预测:基于逻辑回归模型 | 极客日志PythonAI算法
银行个人贷款违约风险预测:基于逻辑回归模型
介绍基于逻辑回归模型的银行个人贷款违约风险预测系统。针对人工审核效率低、主观性强及漏判率高的问题,构建自动化评估流程。通过清洗多源数据(信贷申请、征信、社保等),进行特征工程(还款能力、意愿、稳定性),使用 SMOTE 处理类别不平衡。模型训练采用 Scikit-learn,评估指标包括 AUC-ROC 与 KS 值。最终通过 Flask 提供 API 服务,实现从申请到风险评级的全流程自动化,旨在降低违约损失并提升审核效率。
活在当下3 浏览 业务痛点:某城商行个人贷款年放款额超 500 亿元,人工审核依赖经验判断,存在三大问题:
- 效率低:单笔审核耗时 2-3 天,旺季积压严重
- 主观性强:不同审核员对'收入稳定性'等指标判断差异大,导致风险偏好不一致
- 漏判率高:历史数据显示,人工审核后首年违约率仍达 3.2%,年损失超 5000 万元
项目目标:构建逻辑回归违约风险预测模型,实现'申请信息→违约概率'的自动化评估,目标:
- 模型性能:AUC-ROC≥0.85,KS 值≥0.4(区分违约/正常客户)
- 业务效率:审核时间从 3 天缩短至 2 小时,人工干预率降低 50%
- 风险控制:首年违约率降至 2.8% 以下,年减少损失≥3000 万元
开发环境与工具链
- 语言:Python 3.9
- 数据处理:Pandas 1.5+、NumPy 1.23+、Imbalanced-learn(SMOTE)
- 模型训练:Scikit-learn 1.2+(逻辑回归)、SHAP(特征重要性解释)
- 实验跟踪:MLflow(记录参数/指标/模型)
- 服务部署:Flask 2.3+、Gunicorn(WSGI 服务器)、Docker 24.0+
- 版本控制:Git + DVC(数据版本管理)
- 监控:Prometheus(指标采集)+ Grafana(可视化)
数据准备与特征工程
(1)原数据结构(示例)
① 信贷申请表(loan_applications.csv)
[图示:信贷申请表结构]
② 央行征信数据(credit_records.csv)
[图示:征信数据结构]
③ 历史贷款数据库(historical_loans.csv,含标签)
[图示:历史贷款数据结构]
④ 第三方数据(社保/公积金,social_security.csv)
[图示:社保数据结构]
(2)数据清洗与特征工程
缺失值:
- income_monthly(月收入):自由职业者用'社保缴纳基数×行业均值'填充(如互联网行业均值 1.2 倍)
- overdue_times_1y(1 年内逾期次数):无征信记录者填 0(视为信用白户)
异常值处理:
- ebt_ratio(负债比)>1(资不抵债)视为无效,用同类职业中位数替换
- age<22 或>65(超出常规工作年龄):标记为高风险,单独分组
特征提取:
[图示:特征提取流程]
特征编码与标准化
- 互联网、公务员、自由职业类别用 LabelEncoder
- 购房装修、子女教育、创业资金用 One-Hot
- 收入、社保年限用 StandardScaler 标准化
- SMOTE 过采样处理
处理后的数据(特征矩阵)
[图示:处理后特征矩阵]
代码结构
credit_risk_prediction/
├── data/
│ ├── raw/
│ │ ├── loan_applications.csv
│ │ ├── credit_records.csv
│ │ ├── historical_loans.csv
│ │ └── social_security.csv
│ ├── processed/
│ │ └── features_train.parquet
│ └── external/
│ └── occupation_stability_map.json
├── src/
│ ├── data_processing/
│ │ ├── __init__.py
│ │ ├── clean_data.py
│ │ └── feature_engineering.py
│ ├── model/
│ │ ├── __init__.py
│ │ ├── train.py
│ │ ├── evaluate.py
│ │ ├── explain.py
│ │ └── predict.py
│ ├── api/
│ │ ├── app.py
│ │ └── schemas.py
│ └── utils/
│ ├── logger.py
│ ├── config.py
│ └── metrics.py
├── tests/
│ ├── test_feature_engineering.py
│ └── test_model.py
├── docker/
│ ├── Dockerfile
│ └── requirements.txt
├── mlruns/
├── README.md
└── requirements.txt
微信扫一扫,关注极客日志
微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
相关免费在线工具
- 加密/解密文本
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
- RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
- Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
- curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
- Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
- Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online
数据清洗与特征工程(src/data_processing/feature_engineering.py)
还款能力:业务意义,衡量借款人的还款压力,比率越高风险越大
- loan_amount / (loan_term * 12):计算基础月还款额(本金)
- / income_monthly:得到债务收入比
- debt_ratio 分箱:0-30%(0),30-50%(1),50-70%(2),70-100%(3)
- max_overdue_days 分箱:无逾期 (0),1-30 天 (1),31-90 天 (2),90 天以上 (3)
- 将职业类型映射为稳定性分数(occ_map 是预设的映射字典)
- 未映射的职业默认值为 0(最不稳定)
apps_df(申请信息) ↓ 合并 ss_df(可能为社保/工作信息) ↓ 合并 credit_df(信用信息) ↓ 合并 hist_df(历史表现,包含标签)
import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from imblearn.over_sampling import SMOTE
from src.utils.logger import get_logger
logger = get_logger(__name__)
def load_occupation_stability_map(path: str) -> dict:
"""加载职业稳定性映射表(外部配置)"""
with open(path, "r") as f:
return json.load(f)
def feature_engineering(raw_data_dir: str, external_dir: str, output_path: str):
""" 特征工程主函数:整合原始数据→清洗→特征提取→编码→标准化→过采样→保存 Args: raw_data_dir: 原始数据目录(含 4 个 CSV) external_dir: 外部数据目录(职业稳定性映射表) output_path: 处理后特征矩阵保存路径(parquet) Returns: processed_df: 处理后的特征矩阵(含标签) scaler: 标准化器(用于预测时复用) smote: SMOTE 过采样对象(记录采样参数) """
apps_df = pd.read_csv(f"{raw_data_dir}/loan_applications.csv")
credit_df = pd.read_csv(f"{raw_data_dir}/credit_records.csv")
hist_df = pd.read_csv(f"{raw_data_dir}/historical_loans.csv")
ss_df = pd.read_csv(f"{raw_data_dir}/social_security.csv")
occ_map = load_occupation_stability_map(f"{external_dir}/occupation_stability_map.json")
apps_df["income_monthly"] = apps_df.apply(lambda x: x["income_monthly"] if not pd.isna(x["income_monthly"]) else ss_df[ss_df["applicant_id"] == x["applicant_id"]]["social_security_base"].iloc[0] * 1.2 if x["occupation"] == "自由职业" and not ss_df[ss_df["applicant_id"] == x["applicant_id"]].empty else apps_df["income_monthly"].median(), axis=1)
occ_median_debt = credit_df.groupby("applicant_id").apply(lambda x: apps_df[apps_df["applicant_id"] == x.name]["occupation"].iloc[0]).reset_index().merge(apps_df[["applicant_id", "occupation"]], on="applicant_id").groupby("occupation")["debt_ratio"].median()
credit_df["debt_ratio"] = credit_df.apply(lambda x: occ_median_debt[apps_df[apps_df["applicant_id"] == x["applicant_id"]]["occupation"].iloc[0]] if x["debt_ratio"] > 1 else x["debt_ratio"], axis=1)
apps_df["debt_service_ratio"] = (apps_df["loan_amount"] / (apps_df["loan_term"] * 12)) * 1.1 / apps_df["income_monthly"]
credit_df["debt_ratio_bin"] = pd.cut(credit_df["debt_ratio"], bins=[0, 0.3, 0.5, 0.7, 1], labels=[0, 1, 2, 3])
credit_df["overdue_flag"] = (credit_df["overdue_times_1y"] > 0).astype(int)
credit_df["max_overdue_bin"] = pd.cut(credit_df["max_overdue_days"], bins=[-1, 0, 30, 90, np.inf], labels=[0, 1, 2, 3])
apps_df["occupation_stability"] = apps_df["occupation"].map(occ_map).fillna(0)
merged_df = apps_df.merge(ss_df, on="applicant_id", how="left").merge(credit_df, on="applicant_id", how="left").merge(hist_df[["applicant_id", "default_label"]], on="applicant_id", how="left")
le = LabelEncoder()
merged_df["occupation_stability_enc"] = le.fit_transform(merged_df["occupation_stability"])
ohe = OneHotEncoder(sparse_output=False, drop="first")
purpose_ohe = ohe.fit_transform(merged_df[["purpose"]])
purpose_cols = [f"purpose_{cat}" for cat in ohe.categories_[0][1:]]
purpose_df = pd.DataFrame(purpose_ohe, columns=purpose_cols)
continuous_features = ["income_monthly", "social_security_years"]
scaler = StandardScaler()
merged_df[continuous_features] = scaler.fit_transform(merged_df[continuous_features])
merged_df.rename(columns={"income_monthly": "income_norm", "social_security_years": "social_security_years_norm"}, inplace=True)
features = ["income_norm", "debt_service_ratio", "debt_ratio_bin", "overdue_flag", "occupation_stability_enc", "social_security_years_norm", "historical_default_flag"] + purpose_cols
X = merged_df[features]
y = merged_df["default_label"]
smote = SMOTE(random_state=42, sampling_strategy=1.0)
X_resampled, y_resampled = smote.fit_resample(X, y)
processed_df = pd.concat([X_resampled.reset_index(drop=True), y_resampled.reset_index(drop=True)], axis=1)
processed_df.to_parquet(output_path, index=False)
logger.info(f"特征工程完成,保存至{output_path},样本数:{len(processed_df)}(原始{y.sum()}正样本,过采样后{y_resampled.sum()}正样本)")
return processed_df, scaler, smote
模型训练与评估(src/model/train.py& evaluate.py)
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from src.utils.logger import get_logger
import mlflow
import joblib
logger = get_logger(__name__)
def train_logistic_regression(features_path: str, test_size: float = 0.2, random_state: int = 42, C: float = 1.0):
""" 训练逻辑回归模型(带 L2 正则化) Args: features_path: 特征矩阵路径(parquet) test_size: 测试集比例 random_state: 随机种子 C: 正则化强度倒数(C 越小正则化越强) Returns: model: 训练好的逻辑回归模型 X_test, y_test: 测试集特征与标签 scaler: 标准化器(从特征工程返回,此处简化为重新加载) """
df = pd.read_parquet(features_path)
X = df.drop(columns=["default_label"])
y = df["default_label"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
model = LogisticRegression(
penalty="l2",
C=C,
solver="liblinear",
class_weight="balanced",
random_state=random_state
)
model.fit(X_train, y_train)
with mlflow.start_run():
mlflow.log_param("model", "LogisticRegression")
mlflow.log_param("penalty", "l2")
mlflow.log_param("C", C)
mlflow.log_metric("train_accuracy", model.score(X_train, y_train))
mlflow.log_metric("test_accuracy", model.score(X_test, y_test))
mlflow.sklearn.log_model(model, "logistic_regression_model")
logger.info(f"模型训练完成,参数 C={C},测试集准确率={model.score(X_test, y_test):.4f}")
joblib.dump(model, "model/logistic_regression_model.pkl")
joblib.dump(scaler, "model/scaler.pkl")
return model, X_test, y_test
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
from src.utils.metrics import calculate_ks
from src.utils.logger import get_logger
logger = get_logger(__name__)
def evaluate_model(model, X_test, y_test):
""" 评估模型性能(AUC-ROC/KS/混淆矩阵) Args: model: 训练好的模型 X_test: 测试集特征 y_test: 测试集标签 Returns: metrics: 评估指标字典 验收标准:测试集 AUC-ROC≥0.85,KS≥0.4,精确率≥0.75,召回率≥0.60 """
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
auc = roc_auc_score(y_test, y_pred_proba)
ks = calculate_ks(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
metrics = {"AUC-ROC": auc, "KS": ks, "Precision": precision, "Recall": recall, "ConfusionMatrix": {"TN": tn, "FP": fp, "FN": fn, "TP": tp}}
logger.info(f"模型评估结果:{metrics}")
return metrics
def calculate_ks(y_true, y_pred_proba):
"""计算 KS 值(区分违约/正常客户的能力)"""
df = pd.DataFrame({"y_true": y_true, "y_pred_proba": y_pred_proba}).sort_values("y_pred_proba", ascending=False)
df["cum_good"] = (1 - df["y_true"]).cumsum() / (1 - df["y_true"]).sum()
df["cum_bad"] = df["y_true"].cumsum() / df["y_true"].sum()
df["ks"] = df["cum_bad"] - df["cum_good"]
return df["ks"].max()
模型服务化(Flask API,src/api/app.py)
from flask import Flask, request, jsonify
import joblib
import pandas as pd
import numpy as np
from src.data_processing.feature_engineering import feature_engineering
from src.utils.logger import get_logger
from pydantic import BaseModel, Field
logger = get_logger(__name__)
app = Flask(__name__)
model = joblib.load("model/logistic_regression_model.pkl")
scaler = joblib.load("model/scaler.pkl")
occupation_stability_map = {"公务员": 3, "事业单位": 3, "国企": 2, "民企": 1, "自由职业": 0}
class PredictionRequest(BaseModel):
applicant_id: str = Field(..., description="申请人 ID")
age: int = Field(..., ge=22, le=65, description="年龄(22-65 岁)")
occupation: str = Field(..., description="职业")
income_monthly: float = Field(..., gt=0, description="月收入(元)")
loan_amount: float = Field(..., gt=0, description="贷款金额(元)")
loan_term: int = Field(..., ge=12, le=60, description="贷款期限(月,12-60)")
purpose: str = Field(..., description="贷款用途")
overdue_times_1y: int = Field(default=0, ge=0, description="1 年内逾期次数")
debt_ratio: float = Field(default=0.0, ge=0, lt=1, description="负债比")
social_security_years: float = Field(default=0.0, ge=0, description="社保缴纳年限")
historical_default_flag: int = Field(default=0, ge=0, le=1, description="历史违约标志(0/1)")
@app.route("/predict_risk", methods=["POST"])
def predict_risk():
""" 违约风险预测 API 请求格式:JSON(符合 PredictionRequest 模型) 响应格式:JSON(违约概率、风险等级、关键影响因素) """
try:
req_data = request.get_json()
req = PredictionRequest(**req_data)
sample = {
"applicant_id": req.applicant_id,
"age": req.age,
"occupation": req.occupation,
"income_monthly": req.income_monthly,
"loan_amount": req.loan_amount,
"loan_term": req.loan_term,
"purpose": req.purpose,
"overdue_times_1y": req.overdue_times_1y,
"debt_ratio": req.debt_ratio,
"social_security_years": req.social_security_years,
"historical_default_flag": req.historical_default_flag
}
sample_df = pd.DataFrame([sample])
sample_df["debt_service_ratio"] = (sample_df["loan_amount"] / (sample_df["loan_term"] * 12)) * 1.1 / sample_df["income_monthly"]
sample_df["occupation_stability"] = sample_df["occupation"].map(occupation_stability_map).fillna(0)
sample_df[["income_norm", "social_security_years_norm"]] = scaler.transform(sample_df[["income_monthly", "social_security_years"]])
sample_df["purpose_dummy"] = 1 if sample_df["purpose"].iloc[0] == "创业资金" else 0
features = ["income_norm", "debt_service_ratio", "occupation_stability", "social_security_years_norm", "purpose_dummy", "overdue_times_1y", "debt_ratio", "historical_default_flag"]
X_sample = sample_df[features].values
default_prob = model.predict_proba(X_sample)[0][1]
risk_level = "高风险" if default_prob > 0.6 else "中风险" if default_prob > 0.3 else "低风险"
response = {
"applicant_id": req.applicant_id,
"default_probability": round(default_prob, 4),
"risk_level": risk_level,
"key_factors": [
{"factor": "负债比", "impact": "正向影响", "coefficient": 0.85},
{"factor": "社保缴纳年限", "impact": "负向影响", "coefficient": -0.62}
],
"timestamp": pd.Timestamp.now().isoformat()
}
return jsonify(response), 200
except Exception as e:
logger.error(f"预测失败:{str(e)}", exc_info=True)
return jsonify({"error": str(e)}), 400
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=False)
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
COPY model/ ./model/
EXPOSE 5000
CMD ["gunicorn", "--workers", "4", "--bind", "0.0.0.0:5000", "src.api.app:app"] # 4 个工作进程
upstream credit_risk_api {
server 10.0.0.1:5000; # 容器 1
server 10.0.0.2:5000; # 容器 2
}
server {
listen 80;
location /predict_risk {
proxy_pass http://credit_risk_api;
proxy_set_header Host $host;
}
}
curl -X POST http://credit-risk-api/predict_risk \
-H "Content-Type: application/json"\
-d '{ "applicant_id": "APP005", "age": 32, "occupation": "互联网", "income_monthly": 18000, "loan_amount": 250000, "loan_term": 48, "purpose": "购房装修", "overdue_times_1y": 1, "debt_ratio": 0.55, "social_security_years": 6, "historical_default_flag": 0 }'