Python 机器学习：基于逻辑回归和决策树的寿险续保预测 | 极客日志

PythonAI算法

Python 机器学习：基于逻辑回归和决策树的寿险续保预测

本项目利用 Python 机器学习技术构建寿险续保预测模型。通过逻辑回归和决策树算法，分析人口统计、保险特征等数据。结果显示模型准确率达 92%，AUC 约 0.97。关键影响因素包括收入水平、婚姻状况、年龄及职业。业务上建议针对高价值客户（中高收入、已婚）优化服务，对年轻未婚或老年单身群体制定差异化策略以提升续保率。

字节跳动发布于 2026/2/4更新于 2026/6/2295 浏览

1. 项目概述

本项目旨在通过数据分析和机器学习技术，深入挖掘营销保险续保的关键因素，构建续保预测模型，帮助保险公司预测用户是否会续保，探索影响用户续保行为的关键因素，识别高价值用户、可能流失的用户，从而采取针对性的营销服务，减少客户流失，提升留存率。

2. 数据集介绍

数据集包含 1000 条记录，包含以下主要特征：

人口统计特征：年龄、性别、出生地区、收入水平、教育程度、职业、婚姻状况、家庭成员数量
保险相关特征：保单类型、保单期限、保费金额、保单开始日期、保单结束日期、理赔历史（是否理赔，'Yes'或'No'）
目标变量：renewal（是否续保，'Yes'或'No'）

3. 数据探索性分析（EDA）

3.1 客户特征与续保

![图片]

3.2 保单特征与续保

3.3 保单地区续保率

![图片]

3.4 数据相关性分析

数值型特征相关性分析

从相关系数中可以发现，年龄、家庭成员数、保单金额三个特征之间具有正相关性。

# 皮尔逊相关系数（Pearson correlation coefficient）是衡量两个变量之间线性相关程度的指标，其值介于 -1 与 1 之间。值接近 1 表示正相关，接近 -1 表示负相关，接近 0 表示无相关。

![图片]

分类型特征和续保相关性分析

收入水平、婚姻状态、教育水平、理赔历史、性别等特征，和是否续保都具有显著相关性。

# 卡方检验（Chi-square test）是一种用于检验分类变量之间是否独立的统计方法。它基于观察频数和期望频数的差异来判断两个变量是否相关。 # 判断显著性：将计算出的卡方统计量与卡方分布临界值比较（根据自由度和显著性水平，通常为 0.05）。如果卡方统计量大于临界值，则拒绝零假设，认为两个变量之间存在显著关联。

Variable	Chi2	P_value	Significant
income_level	255.2955	3.66E-56	TRUE
marital_status	125.3968	5.89E-28	TRUE
education_level	18.05732	0.000428	TRUE
claim_history	11.72501	0.000617	TRUE
gender	9.345081	0.002236	TRUE

4. 逻辑回归模型分析 - 预测续保用户

使用逻辑回归模型对寿险客户是否会续保进行了预测分析，再基于模型系数解释不同客户特征对续保决策的影响。

4.1 模型性能指标

分类报告与混淆矩阵

![图片]

准确率: 92.00% - 预测的续保或不续保的结果，有 92% 的预测准确
: 94% - 预测的续保用户中，有 94% 用户实际产生续保行为

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# 设置中文字体和图形样式
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 查看列名和数据类型
df_data = pd.read_excel("policy_data.xlsx")
print(df_data.info())
# 缺失值检查
df_data.isnull().sum()

# 续保关系分析
fig,ax = plt.subplots(2,3,figsize=(20,8))
plt.subplots_adjust(left=0.05,right=0.95,bottom=0.05,top=0.95,hspace=0.5,wspace=0.3)
# 年龄与续保
sns.histplot(data=df_data,x='age',hue='renewal',kde=True,element='step',multiple='layer',ax=ax[0,0])
ax[0,0].set_title("年龄与续保")
# 性别与续保
sns.countplot(data=df_data,x='renewal',hue='gender',width=0.4,ax=ax[0,1])
ax[0,1].set_title("性别与续保")
# 收入水平与续保
relate_income_renewal = df_data.groupby(['income_level','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_income_renewal,x='income_level',y='count',hue='renewal',width=0.6,ax=ax[0,2])
ax[0,2].set_title("收入水平与续保")
# 职业与续保
relate_occupation_renewal = df_data.groupby(['occupation','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_occupation_renewal,x='count',y='occupation',hue='renewal',width=0.6,ax=ax[1,0])
ax[1,0].set_title("职业与续保")
# 教育水平与续保
relate_edu_renewal = df_data.groupby(['education_level','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_edu_renewal,x='education_level',y='count',hue='renewal',width=0.6,ax=ax[1,1])
ax[1,1].set_title("教育水平与续保")
# 婚姻状况与续保
relate_marital_renewal = df_data.groupby(['marital_status','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_marital_renewal,x='marital_status',y='count',hue='renewal',width=0.6,ax=ax[1,2])
ax[1,2].set_title("婚姻状况与续保")
plt.savefig("客户特征与续保.png")

# 保单特征与续保关系分析
fig,ax = plt.subplots(2,2,figsize=(12,6))
plt.subplots_adjust(left=0.05,right=0.95,bottom=0.05,top=0.95,hspace=0.5,wspace=0.3)
# 产品与续保关系
relate_policy_renewal = df_data.groupby(['policy_type','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_policy_renewal,x='count',y='policy_type',hue='renewal',width=0.6,ax=ax[0,0])
ax[0,0].set_title("产品与续保")
# 保单年限与续保关系
relate_term_renewal = df_data.groupby(['policy_term','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_term_renewal,y='count',x='policy_term',hue='renewal',width=0.6,ax=ax[0,1])
ax[0,1].set_title("保单年限与续保")
# 保费金额与续保
sns.histplot(data=df_data,x='premium_amount',hue='renewal',element='step',multiple='layer',kde=True,ax=ax[1,0])
ax[1,0].set_title("保费金额与续保")
# 理赔历史与续保关系
relate_claim_renewal = df_data.groupby(['claim_history','renewal']).size().reset_index(name='count')
sns.barplot(data=relate_claim_renewal,x='claim_history',y='count',hue='renewal',width=0.3,ax=ax[1,1])
ax[1,1].set_title("理赔历史与续保")
plt.savefig('保单特征与续保.png')
# 读取中国地图数据，数据来自 DataV.GeoAtlas，将其投影到 EPSG:4573
gdf = gpd.read_file('https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json').to_crs('EPSG:4573')
# 保存各个省级行政区的面积，单位万平方公里
gdf['area'] = gdf.area/1e6/1e4
# 拆分数据，最后一条数据是南海九段线
gdf = gdf[:-1]
# 准备各地续保率数据
data_rr = pd.DataFrame(df_data.groupby(['insurance_region'])['renewal'].apply(lambda x:(x=='Yes').sum()/x.count()))
# 合并数据
gdf = gdf.join(data_rr,on='name')
# 绘制保单地区续保率地图
fig,ax = plt.subplots(figsize=(15,15))
fontdict = {'family':'SimHei', 'size':8, 'color': "black",'weight': 'bold'}
gdf.plot(ax=ax,column='renewal',cmap='coolwarm',legend=True,legend_kwds={'label': "续保率", 'shrink':0.5})
# 添加省份名称和续保率标签
for index in gdf.index:
    x = gdf.iloc[index].geometry.centroid.x
    y = gdf.iloc[index].geometry.centroid.y
    t = f"{gdf.iloc[index]['name']}:{gdf.iloc[index]['renewal']*100:.0f}%"
    ax.text(x,y,t,ha='center',va='center',fontdict=fontdict)
# 关闭网格
ax.axis('off')
ax.set_title('保单地区续保率')
plt.savefig('保单地区续保率.png')

# 相关性分析
# 皮尔逊相关系数（Pearson correlation coefficient）是衡量两个变量之间线性相关程度的指标，其值介于 -1 与 1 之间。值接近 1 表示正相关，接近 -1 表示负相关，接近 0 表示无相关。
numeric_df = df_data.select_dtypes(include=['int64','float64'])
correlation = numeric_df.corr()
sns.heatmap(data=correlation,annot=True,cmap='coolwarm',linewidths=.5)
plt.savefig('correlation.png')
# 分析多个变量与续保率的关系
var_with_renewal = ['gender', 'marital_status', 'claim_history', 'income_level', 'education_level']
results = []
for var in var_with_renewal:
    # 创建列联表
    contingency_table = pd.crosstab(df_data[var], df_data['renewal'])
    # 执行卡方检验
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    results.append({
        'Variable': var,
        'Chi2': chi2,
        'P_value': p_value,
        'Significant': p_value < 0.05
    })
# 创建结果 DataFrame
results_df = pd.DataFrame(results).sort_values('P_value')
display(results_df)

# 对目标变量进行编码
# LabelEncoder() 提供的标签编码工具，用于将分类变量映射为 0 ~ n 的整数值，常用于对目标变量 y 进行编码
le = LabelEncoder().fit(df_data.renewal)
df_data["renewal_encode"] = le.transform(df_data["renewal"])
# 查看类别和编码的映射
class_to_index = {target:index for index, target in enumerate(le.classes_)}
print(class_to_index)
# 特征处理
X = df_data.drop(["policy_id","policy_end_date","renewal","renewal_encode"], axis=1)
y = df_data["renewal_encode"].astype('int')
# 提取日期为年份
X.policy_start_date = X.policy_start_date.dt.year.astype(str)
# 定义数值型和分类型特征
numeric_features = X.select_dtypes(exclude=["object"]).columns.to_list()
categorical_features = X.select_dtypes(include=["object"]).columns.to_list()
print(f"数值变量：{numeric_features}")
print(f"分类变量：{categorical_features}")

# 创建预处理转换器
transformers = [('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)]
preprocessor = ColumnTransformer(transformers=transformers)
# 创建管道
pipeline = Pipeline([('preprocessor',preprocessor),('classifier',LogisticRegression(random_state=42))])
# 定义网格参数
param_grid = [{'classifier__C':[0.001,0.01,0.1,1.0,10,100], 'classifier__penalty':['l1','l2']}]
# 创建网格搜索对象
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, verbose=1)
# 划分训练集和测试集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
# 执行网格搜索
grid_search.fit(X_train,y_train)
# 输出最佳参数和分数
print(f"最佳参数：{grid_search.best_params_}")
print(f"最佳交叉验证分数：{grid_search.best_score_}")

# 获取特征名称和权重参数
def get_feature_names_from_pipeline(grid_search):
    # 提取特征名称
    feature_names = grid_search.best_estimator_.named_steps['preprocessor'].get_feature_names_out()
    # 提取权重参数
    coefficients = grid_search.best_estimator_.named_steps['classifier'].coef_[0]
    # 创建特征系数 dataframe
    df = pd.DataFrame({'feature':feature_names,'coefficient':coefficients})
    # 按绝对值排序
    df = df.sort_values(by='coefficient',key=abs,ascending=False)
    return df
# 生成逻辑回归权重系数，并保存图表
df_coef = get_feature_names_from_pipeline(grid_search)
df_coef.to_excel('coefficient.xlsx')
# 可视化逻辑回归系数
# 取前 20 个逻辑回归系数
df_coef_top = df_coef.head(20)
# 绘制条形图
plt.figure(figsize = (10,5))
ax = sns.barplot(data=df_coef_top,y='feature',x='coefficient',palette=['red' if x<0 else 'blue' for x in df_coef_top['coefficient']])
# 添加零线作为参考
plt.axvline(x=0,color='black',linestyle='-',alpha=0.3)
# 添加数值标签
for i,value in enumerate(df_coef_top['coefficient']):
    ax.text(value,i,f'{value:.2f}',ha='left' if value>0 else 'right',va='center',)
plt.title(f"逻辑回归特征系数（权重 Top 20）")
plt.xlabel('Feature')
plt.ylabel('coefficient')
plt.tight_layout()
plt.savefig("coef_logistic.png")
plt.show()

# 使用最佳模型进行预测
best_model = grid_search.best_estimator_
pred_logic = best_model.predict(X_test)
pred_proba_logic = best_model.predict_proba(X_test)[:,1]

# 详细分类报告
print(f"详细分类报告 - 逻辑回归模型：\n{classification_report(y_test,pred_logic)}")
# 绘制混淆矩阵
cm_logic = confusion_matrix(y_test,pred_logic,labels=[0,1])
def plot_confusion_matrix(cm):
    TN,FP,FN,TP = cm.ravel()
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    #f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0
    plt.figure(figsize = (4.5,4))
    sns.heatmap(cm,fmt='d',annot=True,cmap='Blues',xticklabels=["不续保","续保"],yticklabels=["不续保","续保"])
    plt.title(f"recall:{recall:.3f}\nprecision:{precision:.3f}")
    plt.xlabel('Predict label')
    plt.ylabel('True label')
    plt.tight_layout()
    plt.savefig("confusion_matrix_logistic.png")
    plt.show()
# 绘制逻辑回归预测的混淆矩阵
print("混淆矩阵 - 逻辑回归预测：")
plot_confusion_matrix(cm_logic)
# 输出 PR 曲线
# PR 曲线展示召回率和准确率的关系，更加关注正类的识别质量，对类别不平衡数据更加敏感
# 获取 PR 曲线数值
precision_lg,recall_lg,thresholds_lg = precision_recall_curve(y_test, pred_proba_logic)
pr_auc_lg = auc(recall_lg, precision_lg)
# 绘制 ROC 曲线
plt.figure(figsize = (5,5))
plt.plot(recall_lg,precision_lg,label=f'PR Curve (AUC={pr_auc_lg:.3f})')
# 绘制最接近 0.5 的阈值点
close_idx = np.argmin(np.abs(thresholds_lg-0.5))
close_recall_lg = recall_lg[close_idx]
close_precision_lg = precision_lg[close_idx]
plt.plot(close_recall_lg,close_precision_lg,marker='o',markersize=8,fillstyle='none', label=f'Threshold=0.5 recall={close_recall_lg:.3f} precision={close_precision_lg:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve')
plt.legend()
plt.savefig("precision_recall_logistic.png")
plt.show()
# 当选择 0.5 作为阈值时，每预测出的 100 个续保用户有 93 个会续保，能识别出 95% 的真正续保用户

# 创建预处理转换器
transformers = [('num', 'passthrough', numeric_features), ('cat', OrdinalEncoder(), categorical_features)]
preprocessor = ColumnTransformer(transformers=transformers)
# 创建管道
pipeline = Pipeline([('preprocessor',preprocessor),('classifier',DecisionTreeClassifier(random_state=42))])
# 定义网格参数
param_grid = [{'classifier__max_depth':[1,3,5,7]}]
# 创建网格搜索对象
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, verbose=1)
# 划分训练集和测试集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
# 执行网格搜索
grid_search.fit(X_train,y_train)
# 输出最佳参数和分数
print(f"最佳参数：{grid_search.best_params_}")
print(f"最佳交叉验证分数：{grid_search.best_score_}")

def get_feature_names_from_pipeline(grid_search):
    # 提取特征名称
    feature_names = grid_search.best_estimator_.named_steps['preprocessor'].get_feature_names_out()
    # 提取权重参数
    importance = grid_search.best_estimator_.named_steps['classifier'].feature_importances_
    # 创建特征系数 dataframe
    df = pd.DataFrame({'feature':feature_names,'importance':importance})
    # 按绝对值排序
    df = df.sort_values(by='importance',ascending=False)
    return df
# 生成决策树重要性
df_importance = get_feature_names_from_pipeline(grid_search)
# 可视化特征重要性
plt.figure(figsize = (8,5))
ax = sns.barplot(data=df_importance[:15],x="importance",y="feature")
for i,imp in enumerate(df_importance[:15]['importance']):
    plt.text(imp,i,f'{imp:.2f}',ha='left',va='center')
plt.title('Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.show()

# 选择数值特征和重要性前 3 的分类特征
cat_feature_select = ['marital_status','education_level','occupation']
X_select = X[numeric_features+cat_feature_select]
# 使用选择后的数据生成训练集和测试集
X_select_train,X_select_text,y_select_train,y_select_test = train_test_split(X_select,y,test_size=0.2,random_state=42,stratify=y)
# 创建预处理转换器
transformers = [('num', 'passthrough', numeric_features), ('cat', OneHotEncoder(), cat_feature_select)]
preprocessor = ColumnTransformer(transformers=transformers)
# 创建管道
pipeline = Pipeline([('preprocessor',preprocessor),('classifier',DecisionTreeClassifier(random_state=42))])
# 定义网格参数
param_grid = [{'classifier__max_depth':[1,3,5,7]}]
# 创建网格搜索对象
grid_search_tree = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, verbose=1)
# 执行网格搜索
grid_search_tree.fit(X_select_train,y_select_train)
# 输出最佳参数和分数
print(f"最佳参数：{grid_search_tree.best_params_}")
print(f"最佳交叉验证分数：{grid_search_tree.best_score_}")

# 可视化决策树
best_model_tree = grid_search_tree.best_estimator_
feature_names_select = best_model_tree.named_steps['preprocessor'].get_feature_names_out().tolist()
export_graphviz(best_model_tree.named_steps['classifier'],out_file='tree.dot',class_names=['不续保','续保'], feature_names=feature_names_select, impurity=False,filled=True)
plt.figure(figsize = (10,5))
plot_tree(best_model_tree.named_steps['classifier'], feature_names=feature_names_select, class_names=['不续保','续保'], max_depth=3, filled=True, rounded=True, precision=2, fontsize=8, proportion=True)
plt.title('决策树模型')
plt.tight_layout()
plt.savefig('decision_tree.png')
plt.show()

# 使用最佳模型进行预测
best_model_tree = grid_search.best_estimator_
pred_dt = best_model_tree.predict(X_test)
pred_proba_dt = best_model_tree.predict_proba(X_test)[:,1]

# 详细分类报告
print(f"详细分类报告 - 决策树模型：\n{classification_report(y_test,pred_dt)}")
# 绘制混淆矩阵
cm_dt = confusion_matrix(y_test,pred_dt,labels=[0,1])
def plot_confusion_matrix(cm):
    TN,FP,FN,TP = cm.ravel()
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    #f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0
    plt.figure(figsize = (4.5,4))
    sns.heatmap(cm,fmt='d',annot=True,cmap='Blues',xticklabels=["不续保","续保"],yticklabels=["不续保","续保"])
    plt.title(f"recall:{recall:.3f}\nprecision:{precision:.3f}")
    plt.xlabel('Predict label')
    plt.ylabel('True label')
    plt.tight_layout()
    plt.savefig("confusion_matrix_decision_tree.png")
    plt.show()
# 绘制决策树预测的混淆矩阵
print("混淆矩阵 - 决策树预测：")
plot_confusion_matrix(cm_dt)
# 获取 PR 曲线数值
precision_dt,recall_dt,thresholds_dt = precision_recall_curve(y_test, pred_proba_dt)
pr_auc_lg = auc(recall_dt, precision_dt)
# 绘制 ROC 曲线
plt.figure(figsize = (5,5))
plt.plot(recall_dt,precision_dt,label=f'PR Curve (AUC={pr_auc_lg:.3f})')
# 绘制最接近 0.5 的阈值点
close_idx = np.argmin(np.abs(thresholds_dt-0.5))
close_recall_lg = recall_dt[close_idx]
close_precision_lg = precision_dt[close_idx]
plt.plot(close_recall_lg,close_precision_lg,marker='o',markersize=8,fillstyle='none', label=f'Threshold=0.5 recall={close_recall_lg:.3f} precision={close_precision_lg:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve')
plt.legend()
plt.savefig("precision_recall_decision_tree.png")
plt.show()

Python 机器学习：基于逻辑回归和决策树的寿险续保预测

1. 项目概述

2. 数据集介绍

3. 数据探索性分析（EDA）

3.1 客户特征与续保

3.2 保单特征与续保

3.3 保单地区续保率

3.4 数据相关性分析

4. 逻辑回归模型分析 - 预测续保用户

4.1 模型性能指标

更多推荐文章

相关免费在线工具

4.2 模型系数分析 - 哪些是影响用户续保决策的特征

4.3 业务启示

4.4 结论

5. 决策树模型分析 - 预测续保用户

5.1 模型性能指标

5.2 决策树分析 - 重要特征

5.3 决策树规则

5.4 业务启示

5.5 结论

6. 项目流程

6.1 数据探索性分析（EDA）

6.2 数据预处理

6.3 逻辑回归模型预测续保

6.4 决策树模型预测续保

更多推荐文章

相关免费在线工具

Python 机器学习：基于逻辑回归和决策树的寿险续保预测

1. 项目概述

2. 数据集介绍

3. 数据探索性分析（EDA）

3.1 客户特征与续保

3.2 保单特征与续保

3.3 保单地区续保率

3.4 数据相关性分析

4. 逻辑回归模型分析 - 预测续保用户

4.1 模型性能指标

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

4.2 模型系数分析 - 哪些是影响用户续保决策的特征

4.3 业务启示

4.4 结论

5. 决策树模型分析 - 预测续保用户

5.1 模型性能指标

5.2 决策树分析 - 重要特征

5.3 决策树规则

5.4 业务启示

5.5 结论

6. 项目流程

6.1 数据探索性分析（EDA）

6.2 数据预处理

6.3 逻辑回归模型预测续保

6.4 决策树模型预测续保

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具