Python 数据分析实战指南：基于 Pandas 的四个经典案例 | 极客日志

PythonAI算法

Python 数据分析实战指南：基于 Pandas 的四个经典案例

综述由AI生成使用 Python 的 Pandas 库进行数据分析的基础知识与实战案例。涵盖了 Numpy、Matplotlib 等工具的安装与导入，详细讲解了 Series、DataFrame 等数据结构及 read_table、merge、iloc、pivot_table、groupby 等核心函数。通过 MovieLens 电影评分、美国婴儿姓名、美国农业部食品数据库三个实际数据集，演示了数据读取、清洗、合并、透视表分析及可视化可视化的完整流程。内容适合希望掌握 Python 数据处理技能的开发者参考。

随缘发布于 2025/2/7更新于 2026/5/2817 浏览

一、前期准备

进行 Python 数据分析主要依赖三个核心库：Numpy、Pandas 和 Matplotlib；开发工具推荐使用 Jupyter Notebook。首先确保导入这两个包。

# 导入 Numpy 包
import numpy as np
# 导入 Pandas 包
import pandas as pd

二、基础知识

Pandas 提供了三种主要数据结构：Series、DataFrame 和 Panel。Series 类似于一维数组；DataFrame 是类似表格的二维数组；Panel 可以视为 Excel 的多表单 Sheet（注：在较新版本中 Panel 已不推荐使用，建议用 MultiIndex DataFrame 替代）。

read_table

用于读取 csv、excel、dat 等文本文件。常用参数包括 filepath_or_buffer（文件路径）、sep（分隔符）、header（表头行号）、names（列名列表）等。

read_table(filepath_or_buffer, sep=False, delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=, compression=, thousands=, decimal=, lineterminator=, quotechar=, quoting=, doublequote=, escapechar=, comment=, encoding=, dialect=, tupleize_cols=, error_bad_lines=, warn_bad_lines=, delim_whitespace=, low_memory=, memory_map=, float_precision=)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

# 读取 users.dat 文件
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("datasets/movielens/users.dat", sep="::",
                      header=None, names=unames, engine="python")
# 读取 ratings.dat 文件
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::",
                        header=None, names=rnames, engine="python")
# 读取 movies.dat 文件
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
                       header=None, names=mnames, engine="python")

users.head(5)
ratings.head(5)
movies.head(5)

data = pd.merge(pd.merge(ratings, users), movies)
data.iloc[0]

mean_ratings = data.pivot_table("rating", index="title",
                                columns="gender", aggfunc="mean")
mean_ratings.head(5)

ratings_by_title = data.groupby("title").size()
ratings_by_title.head()
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

mean_ratings = mean_ratings.rename(index={"Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)":
                           "Seven Samurai (Shichinin no samurai) (1954)"})

top_female_ratings = mean_ratings.sort_values("F", ascending=False)
top_female_ratings.head()

mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]

sorted_by_diff = mean_ratings.sort_values("diff")
sorted_by_diff.head()

sorted_by_diff[::-1].head()

rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.head()

rating_std_by_title.sort_values(ascending=False)[:10]

movies["genres"].head()
movies["genres"].head().str.split("|")
movies["genre"] = movies.pop("genres").str.split("|")
movies.head()

movies_exploded = movies.explode("genre")
movies_exploded[:10]

ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users)
ratings_with_genre.iloc[0]
genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
                 ["rating"].mean()
                 .unstack("age"))
genre_ratings[:10]

names1880 = pd.read_csv("datasets/babynames/yob1880.txt",
                        names=["name", "sex", "births"])
names1880

names1880.groupby("sex")["births"].sum()

pieces = []
for year in range(1880, 2011):
    path = f"datasets/babynames/yob{year}.txt"
    frame = pd.read_csv(path, names=["name", "sex", "births"])

    # Add a column for the year
    frame["year"] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
names

total_births = names.pivot_table("births", index="year",
                                 columns="sex", aggfunc=sum)
total_births.tail()
total_births.plot(title="Total births by sex and year")

def add_prop(group):
    group["prop"] = group["births"] / group["births"].sum()
    return group
names = names.groupby(["year", "sex"], group_keys=False).apply(add_prop)
names

names.groupby(["year", "sex"])["prop"].sum()

def get_top1000(group):
    return group.sort_values("births", ascending=False)[:1000]
grouped = names.groupby(["year", "sex"])
top1000 = grouped.apply(get_top1000)
top1000.head()

top1000 = top1000.reset_index(drop=True)
top1000.head()

boys = top1000[top1000["sex"] == "M"]
girls = top1000[top1000["sex"] == "F"]
total_births = top1000.pivot_table("births", index="year",
                                   columns="name",
                                   aggfunc=sum)
total_births.info()
subset = total_births[["John", "Harry", "Mary", "Marilyn"]]
subset.plot(subplots=True, figsize=(12, 10),
            title="Number of births per year")

plt.figure()
table = top1000.pivot_table("prop", index="year",
                            columns="sex", aggfunc=sum)
table.plot(title="Sum of table1000.prop by year and sex",
           yticks=np.linspace(0, 1.2, 13))

df = boys[boys["year"] == 2010]
df

prop_cumsum = df["prop"].sort_values(ascending=False).cumsum()
prop_cumsum[:10]
prop_cumsum.searchsorted(0.5)

df = boys[boys.year == 1900]
in1900 = df.sort_values("prop", ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1

def get_quantile_count(group, q=0.5):
    group = group.sort_values("prop", ascending=False)
    return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(["year", "sex"]).apply(get_quantile_count)
diversity = diversity.unstack()
fig = plt.figure()
diversity.head()
diversity.plot(title="Number of popular names in top 50%")

def get_last_letter(x):
    return x[-1]

last_letters = names["name"].map(get_last_letter)
last_letters.name = "last_letter"

table = names.pivot_table("births", index=last_letters,
                          columns=["sex", "year"], aggfunc=sum)
subtable = table.reindex(columns=[1910, 1960, 2010], level="year")
subtable.head()

subtable.sum()
letter_prop = subtable / subtable.sum()
letter_prop

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop["M"].plot(kind="bar", rot=0, ax=axes[0], title="Male")
letter_prop["F"].plot(kind="bar", rot=0, ax=axes[1], title="Female",
                      legend=False)

letter_prop = table / table.sum()

dny_ts = letter_prop.loc[["d", "n", "y"], "M"].T
dny_ts.head()

all_names = pd.Series(top1000["name"].unique())
lesley_like = all_names[all_names.str.contains("Lesl")]
lesley_like

filtered = top1000[top1000["name"].isin(lesley_like)]
filtered.groupby("name")["births"].sum()

table = filtered.pivot_table("births", index="year",
                             columns="sex", aggfunc="sum")
table = table.div(table.sum(axis="columns"), axis="index")
table.tail()

fig = plt.figure()
table.plot(style={"M": "k-", "F": "k--"})

import json
db = json.load(open("datasets/usda_food/database.json"))
len(db)

db[0].keys()
db[0]["nutrients"][0]
nutrients = pd.DataFrame(db[0]["nutrients"])
nutrients.head(7)

info_keys = ["description", "group", "id", "manufacturer"]
info = pd.DataFrame(db, columns=info_keys)
info.head()
info.info()

pd.value_counts(info["group"])[:10]

nutrients = []

for rec in db:
    fnuts = pd.DataFrame(rec["nutrients"])
    fnuts["id"] = rec["id"]
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)
nutrients

nutrients.duplicated().sum()  # number of duplicates
nutrients = nutrients.drop_duplicates()

col_mapping = {"description" : "food",
               "group"       : "fgroup"}
info = info.rename(columns=col_mapping, copy=False)
info.info()
col_mapping = {"description" : "nutrient",
               "group" : "nutgroup"}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients

ndata = pd.merge(nutrients, info, on="id")
ndata.info()
ndata.iloc[30000]

fig = plt.figure()
result = ndata.groupby(["nutrient", "fgroup"])["value"].quantile(0.5)
result["Zinc, Zn"].sort_values().plot(kind="barh")

by_nutrient = ndata.groupby(["nutgroup", "nutrient"])

def get_maximum(x):
    return x.loc[x.value.idxmax()]

max_foods = by_nutrient.apply(get_maximum)[["value", "food"]]

# make the food a little smaller
max_foods["food"] = max_foods["food"].str[:50]
max_foods.loc["Amino Acids"]["food"]

Python 数据分析实战指南：基于 Pandas 的四个经典案例

一、前期准备

二、基础知识

更多推荐文章

相关免费在线工具

三、具体案例

3.1 MovieLens 1M 数据集

3.2 美国 1880-2010 年的婴儿名字

3.3 美国农业部食品数据库

更多推荐文章

相关免费在线工具

Python 数据分析实战指南：基于 Pandas 的四个经典案例

一、前期准备

二、基础知识

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

三、具体案例

3.1 MovieLens 1M 数据集

3.2 美国 1880-2010 年的婴儿名字

3.3 美国农业部食品数据库

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具