2025 年数据科学、人工智能与机器学习职位薪资数据可视化分析

1.项目背景

随着全球数字化转型的加速推进，数据科学、人工智能和机器学习领域在 2025 年已成为推动经济增长和技术创新的核心驱动力。企业对于具备高级数据分析能力和 AI 技术专长的人才需求呈现爆发式增长，这种需求直接反映在相关职位的薪资水平上，使得该领域成为就业市场中备受关注的高价值赛道。近年来，随着大模型技术、边缘智能和联邦学习等前沿技术的商业化落地，行业对复合型技术人才的需求结构发生了显著变化，不仅要求从业者掌握传统的编程和统计学知识，还需要具备跨学科的问题解决能力和业务场景理解力。在此背景下，全球各科技中心如硅谷、北京、班加罗尔等地区纷纷通过具有竞争力的薪酬方案来争夺顶尖人才，导致不同地域、不同细分岗位的薪资差异呈现出新的特征。与此同时，远程办公模式的普及使得人才流动的地理限制减弱，进一步加剧了企业间的人才竞争，薪资数据中开始出现传统行业与科技公司薪资趋同、新兴市场薪资增速超越发达国家等有趣现象。通过对 2025 年最新薪资数据的可视化分析，不仅可以揭示当前技术人才市场的价值分布规律，还能为从业者的职业规划、企业的招聘策略以及教育机构的人才培养方向提供数据支撑，具有重要的实践指导意义。

2.数据集介绍

本实验数据集来源于 Kaggle，原始数据集共有 66063 条数据，11 个变量。各变量含义如下：

work_year -> 报告薪水的年份。所有条目都反映了 2025 年的数据。
job_title -> 具体的角色或职称，如数据科学家、机器学习工程师、人工智能研究员等。
job_category -> 工作的更广泛的类别或专业化（例如，数据工程，NLP，计算机视觉等）。
salary_currency -> 支付工资的原始货币（例如，美元，欧元，印度卢比）。
salary -> 以原货币计算的报告年薪（转换前）。
salary_in_usd -> 使用 2025 年的平均汇率将年薪转换为美元，以便进行全球比较。
employee_residence -> 雇员居住或主要工作的国家。
experience_level -> 该角色所需或持有的专业经验水平。共同价值观：初级、中级、高级、行政人员。
employment_type -> 雇佣合同的类型。例子：全职，兼职，合同，自由职业者。
work_setting -> 作业类型为 Remote、Hybrid 或 site。
company_location -> 雇主总部或主要办公室所在的国家。
company_size -> 雇主组织的规模，分类如下：
- 规模小（1-50 人）
- 中型（51-500 人）
- 大型（500 人以上）

该数据集通过结合市场研究和公开数据源精心整理而成，反映了全球真实的薪酬模式。此数据集旨在支持薪资预测和机器学习建模、全球市场标杆、职业决策与谈判、远程工作趋势分析及商业智能仪表板和可视化。

3.技术工具

Python 版本：3.9
代码编辑器：Jupyter Notebook

4.导入数据

导入可视化库并加载数据集

文章配图

查看数据集大小

文章配图

查看数据基本信息

文章配图

查看数值型变量的描述性统计

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots import warnings warnings.filterwarnings('ignore') sns.set(style="whitegrid") plt.style.use('fivethirtyeight') df = pd.read_csv('salaries.csv') print(df.head()) print(df.shape) print(df.info()) print(df.describe()) print(df.describe(include='O')) print(df.isnull().sum()) print(df.duplicated().sum()) # 薪资分布 plt.figure(figsize=(12, 6)) sns.histplot(df['salary_in_usd'], kde=True, bins=50) plt.title('Distribution of Salaries (USD)', fontsize=16) plt.xlabel('Salary (USD)', fontsize=12) plt.ylabel('Frequency', fontsize=12) plt.axvline(df['salary_in_usd'].median(), color='red', linestyle='--', label=f'Median: ${df["salary_in_usd"].median():,}') plt.axvline(df['salary_in_usd'].mean(), color='green', linestyle='--', label=f'Mean: ${df["salary_in_usd"].mean():,}') plt.legend() plt.show() # 对数变换后的工资分布（处理偏度） plt.figure(figsize=(12, 6)) sns.histplot(np.log1p(df['salary_in_usd']), kde=True, bins=50) plt.title('Log-Transformed Salary Distribution', fontsize=16) plt.xlabel('Log(Salary+1)', fontsize=12) plt.ylabel('Frequency', fontsize=12) plt.show() # 薪资分布是右倾斜的，有大量的高收入异常值，这是典型的工资数据。对数变换给了我们一个更正态的分布，这对建模很有用。 # 年薪趋势 yearly_stats = df.groupby('work_year')['salary_in_usd'].agg(['mean', 'median', 'std']).reset_index() fig = px.line(yearly_stats, x='work_year', y=['mean', 'median'], title='Data Science Salary Trends (2020-2025)', labels={'value': 'Salary (USD)', 'work_year': 'Year', 'variable': 'Metric'}, template='plotly_white') fig.update_layout(legend_title_text='', hovermode='x unified', width=900, height=500) fig.add_trace(go.Scatter( x=np.concatenate([yearly_stats['work_year'], yearly_stats['work_year'][::-1]]), y=np.concatenate([yearly_stats['mean'] + yearly_stats['std'], (yearly_stats['mean'] - yearly_stats['std'])[::-1]]), fill='toself', fillcolor='rgba(0,100,80,0.2)', line=dict(color='rgba(255,255,255,0)'), name='Standard Deviation' )) fig.show() # 我们可以观察到，从 2020 年到 2025 年，平均工资和中位数工资都在稳步增长，两者之间的差距略有扩大，表明该领域的不平等正在加剧。 # 经验水平薪资对比 plt.figure(figsize=(12, 6)) sns.boxplot(x='experience_level', y='salary_in_usd', data=df, order=['EN', 'MI', 'SE', 'EX']) plt.title('Salary Distribution by Experience Level', fontsize=16) plt.xlabel('Experience Level', fontsize=12) plt.ylabel('Salary (USD)', fontsize=12) plt.xticks(ticks=[0, 1, 2, 3], labels=['Entry-level', 'Mid-level', 'Senior', 'Executive']) plt.show() # 按工作经验划分的平均工资 exp_time = df.groupby(['work_year', 'experience_level'])['salary_in_usd'].mean().reset_index() exp_time['experience_level'] = exp_time['experience_level'].replace({'EN': 'Entry-level', 'MI': 'Mid-level', 'SE': 'Senior', 'EX': 'Executive'}) fig = px.line(exp_time, x='work_year', y='salary_in_usd', color='experience_level', title='Salary Trends by Experience Level (2020-2025)', labels={'salary_in_usd': 'Average Salary (USD)', 'work_year': 'Year'}, template='plotly_white') fig.update_layout(width=900, height=500, hovermode='x unified') fig.show() # 正如预期的那样，经验水平对工资有重大影响，高管的收入远远高于其他水平。高管薪酬的增长率似乎更大，表明领导职位的溢价在上升。 # 排名前 15 位的工作头衔 top_jobs = df['job_title'].value_counts().head(15) print("Distribution of top job titles:") print(top_jobs) # 最高职位的平均工资 top_jobs_salary = df[df['job_title'].isin(top_jobs.index)].groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False) fig = px.bar(x=top_jobs_salary.index, y=top_jobs_salary.values, labels={'x': 'Job Title', 'y': 'Average Salary (USD)'}, title='Average Salary by Top Job Titles', color=top_jobs_salary.values, color_continuous_scale='Viridis') fig.update_layout(xaxis_tickangle=-45, width=1000, height=600) fig.show() # 前 5 个职位的薪酬分布 plt.figure(figsize=(14, 8)) top5_jobs = top_jobs.index[:5] sns.violinplot(x='job_title', y='salary_in_usd', data=df[df['job_title'].isin(top5_jobs)]) plt.title('Salary Distribution for Top 5 Job Titles', fontsize=16) plt.xlabel('Job Title', fontsize=12) plt.ylabel('Salary (USD)', fontsize=12) plt.xticks(rotation=30, ha='right') plt.tight_layout() plt.show() # 在最常见的职位头衔中，机器学习工程师和研究科学家往往要求最高的薪水，而数据分析师通常收入较低。软件工程师的薪酬分布最为广泛，反映了这一头衔下角色的多样性。 # 地理分析 df['adjusted_salary'] = df['salary_in_usd'] iso2_to_name = {'US': 'United States', 'GB': 'United Kingdom', 'DE': 'Germany', 'FR': 'France', 'CA': 'Canada', 'IN': 'India', 'AU': 'Australia', 'ES': 'Spain', 'BR': 'Brazil', 'NL': 'Netherlands', 'JP': 'Japan', 'CH': 'Switzerland', 'IT': 'Italy', 'SG': 'Singapore', 'SE': 'Sweden', 'MX': 'Mexico', 'FI': 'Finland', 'DK': 'Denmark', 'PL': 'Poland', 'PT': 'Portugal', 'NZ': 'New Zealand', 'IE': 'Ireland', 'HK': 'Hong Kong', 'RU': 'Russia', 'BE': 'Belgium', 'IL': 'Israel', 'UA': 'Ukraine', 'TR': 'Turkey', 'AE': 'United Arab Emirates', 'ZA': 'South Africa', 'CO': 'Colombia', 'AR': 'Argentina', 'CL': 'Chile', 'AT': 'Austria', 'MY': 'Malaysia', 'NG': 'Nigeria', 'VN': 'Vietnam', 'KR': 'South Korea', 'TH': 'Thailand'} avg_salary_by_residence = df.groupby('employee_residence')['adjusted_salary'].mean().reset_index() avg_salary_by_residence['country_name'] = avg_salary_by_residence['employee_residence'].map(iso2_to_name) avg_salary_by_residence = avg_salary_by_residence.dropna(subset=['country_name']) fig2 = px.choropleth(avg_salary_by_residence, locations='country_name', locationmode='country names', color='adjusted_salary', hover_name='country_name', hover_data={'employee_residence': True, 'adjusted_salary': ':,.0f'}, color_continuous_scale=px.colors.sequential.Plasma, title='Average Salary by Employee Residence', labels={'adjusted_salary': 'Average Adjusted Salary'}, projection='natural earth') fig2.update_layout(width=1000, height=600) fig2.show() top_countries = avg_salary_by_residence.sort_values('adjusted_salary', ascending=False).head(20) plt.figure(figsize=(14, 8)) chart = sns.barplot(x='country_name', y='adjusted_salary', data=top_countries, palette='viridis', order=top_countries['country_name']) plt.title('Top 20 Countries by Average Data Science Salary', fontsize=16) plt.xlabel('Employee Residence', fontsize=12) plt.ylabel('Average Adjusted Salary (USD)', fontsize=12) plt.xticks(rotation=45, ha='right') plt.grid(axis='y', linestyle='--', alpha=0.7) for i, bar in enumerate(chart.patches): chart.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2000, f'${top_countries["adjusted_salary"].iloc[i]:,.0f}', ha='center', fontsize=9) plt.tight_layout() plt.show() # 美国、瑞士和以色列在数据科学薪酬方面处于领先地位 # 按远程比例计算的工资 remote_salary = df.groupby('remote_ratio')['salary_in_usd'].agg(['mean', 'median', 'count']).reset_index() remote_salary['remote_ratio'] = remote_salary['remote_ratio'].map({0: 'On-site', 50: 'Hybrid', 100: 'Remote'}) fig = px.bar(remote_salary, x='remote_ratio', y=['mean', 'median'], barmode='group', title='Salary by Remote Work Ratio', labels={'value': 'Salary (USD)', 'remote_ratio': 'Work Setting', 'variable': 'Metric'}, color_discrete_sequence=['#2a9d8f', '#e76f51']) fig.update_layout(width=800, height=500) fig.show() # 远程比率随时间的变化趋势 remote_time = df.groupby(['work_year', 'remote_ratio']).size().reset_index(name='count') total_per_year = remote_time.groupby('work_year')['count'].sum().reset_index() remote_time = remote_time.merge(total_per_year, on='work_year', suffixes=('', '_total')) remote_time['percentage'] = (remote_time['count'] / remote_time['count_total']) * 100 remote_time['remote_ratio'] = remote_time['remote_ratio'].map({0: 'On-site', 50: 'Hybrid', 100: 'Remote'}) fig = px.line(remote_time, x='work_year', y='percentage', color='remote_ratio', title='Remote Work Trends (2020-2025)', labels={'percentage': 'Percentage of Jobs', 'work_year': 'Year'}, template='plotly_white') fig.update_layout(width=900, height=500, hovermode='x unified') fig.show() # 完全远程工作的平均工资往往更高。我们还可以观察到 2020 年后远程工作的重大转变（可能是由于 COVID-19 大流行），这一趋势在 2023-2024 年左右趋于稳定，但与 2020 年前的水平相比，远程工作的比例仍然更高。 # 按公司规模划分的工资 company_salary = df.groupby(['company_size', 'experience_level'])['salary_in_usd'].median().reset_index() company_salary['company_size'] = company_salary['company_size'].map({'S': 'Small', 'M': 'Medium', 'L': 'Large'}) company_salary['experience_level'] = company_salary['experience_level'].map({'EN': 'Entry-level', 'MI': 'Mid-level', 'SE': 'Senior', 'EX': 'Executive'}) fig = px.bar(company_salary, x='company_size', y='salary_in_usd', color='experience_level', barmode='group', title='Median Salary by Company Size and Experience Level', labels={'salary_in_usd': 'Median Salary (USD)', 'company_size': 'Company Size'}, template='plotly_white') fig.update_layout(width=900, height=500) fig.show() # 大公司通常对所有经验级别的员工都提供更高的薪水，其中高管级别的差距最为明显。这反映了更大的组织拥有更多的资源和收入。 # 最高工资货币 currency_counts = df['salary_currency'].value_counts().head(10) fig = px.pie(values=currency_counts.values, names=currency_counts.index, title='Distribution of Salary Currencies', template='plotly_white') fig.update_traces(textposition='inside', textinfo='percent+label') fig.update_layout(width=700, height=500) fig.show() # 汇率分析（由 salary 和 salary_in_usd 得出） df['implied_exchange_rate'] = df['salary'] / df['salary_in_usd'] top_currencies = df['salary_currency'].value_counts().head(10).index.tolist() exchange_rates = df[df['salary_currency'].isin(top_currencies)].groupby(['work_year', 'salary_currency'])['implied_exchange_rate'].median().reset_index() fig = px.line(exchange_rates, x='work_year', y='implied_exchange_rate', color='salary_currency', title='Implied Exchange Rate Trends (2020-2025)', labels={'implied_exchange_rate': 'Rate vs USD', 'work_year': 'Year'}, template='plotly_white') fig.update_layout(width=900, height=500, hovermode='x unified') fig.show() # 美元是全球数据科学薪酬的主要货币。汇率分析显示，随着时间的推移，货币相对强势，一些货币对美元出现贬值。 # 为分类特征创建虚拟变量 categorical_cols = ['experience_level', 'employment_type', 'job_title', 'salary_currency', 'employee_residence', 'company_location', 'company_size'] numerical_cols = ['work_year', 'salary', 'salary_in_usd', 'remote_ratio'] top_job_titles = df['job_title'].value_counts().head(5).index.tolist() df_corr = df[df['job_title'].isin(top_job_titles)].copy() df_dummies = pd.get_dummies(df_corr, columns=categorical_cols, drop_first=True) corr_matrix = df_dummies.corr() plt.figure(figsize=(20, 16)) sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, linewidths=0.5) plt.title('Correlation Matrix of Features', fontsize=18) plt.xticks(fontsize=10, rotation=90) plt.yticks(fontsize=10) plt.tight_layout() plt.show() salary_corr = corr_matrix['salary_in_usd'].sort_values(ascending=False) print("Top 10 features positively correlated with salary:") print(salary_corr.head(10)) print("\nTop 10 features negatively correlated with salary:") print(salary_corr.tail(10)) # 相关分析显示，工资与经验水平、公司规模和某些职位等因素之间存在很强的关系。我们可以将这些见解用于模型中的特征选择。

2025 年数据科学、人工智能与机器学习职位薪资数据可视化分析

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

更多推荐文章

相关免费在线工具

5.数据可视化

6.总结

源代码

更多推荐文章

相关免费在线工具

2025 年数据科学、人工智能与机器学习职位薪资数据可视化分析

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

5.数据可视化

6.总结

源代码

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具