0. 环境准备(当前推荐)
pip install --upgrade pandas pyarrow numpy
pip install pyarrow fastparquet
import pandas as pd
import numpy as np
print(pd.__version__)
print(pd.get_option("mode.string_storage"))
pandas 3.0+ 重要变化(必须知道):
- 默认字符串类型 →
string[pyarrow_numpy] 或 string[python](不再是 object)
- 很多旧的 deprecated API 被彻底移除
- 性能大幅提升(尤其 groupby、join、字符串操作)
1. 核心数据结构速览(5 分钟掌握本质)
| 结构 | 对应 Excel | 一句话本质 | 主要场景 | 内存标签(pandas 3.0+) |
|---|
| Series | 一列 | 带标签的 1 维数组 | 单列数据、时间序列 | dtype 可为 Arrow |
| DataFrame | 表格 | 带行/列索引的 2 维表 | 几乎所有 tabular 数据 | — |
| Index | 行/列标签 | 不可变数组(可多级) | 索引、对齐、合并依据 | — |
2. 读取数据(最常用的 10 种方式)
df = pd.read_csv("data.csv", encoding="utf-8-sig", dtype_backend="pyarrow")
df = pd.read_excel("sales.xlsx", sheet_name="2025Q4", engine="openpyxl")
df = pd.read_parquet("bigfile.parquet")
df = pd.read_json("api_response.json", lines=True)
df = pd.read_sql("SELECT * FROM orders WHERE year=2025", con=engine)
df = pd.read_csv("huge.csv", nrows=10000)
df = pd.read_csv("huge.csv", chunksize=500_000)
pandas 3.0+ 推荐始终加上:dtype_backend="pyarrow"(内存少、速度快)
3. 探索数据(EDA 标准 8 步)
df.shape
df.info(memory_usage="deep")
df.describe(percentiles=[.01,.05,.25,.5,.75,.95,.99])
df.nunique()
df.isna().sum()/len(df)*100
for col in df.select_dtypes("string").columns:
print(f"\n{col} 前三高频:")
print(df[col].value_counts(dropna=False).head(3))
df.sample(8)
4. 数据清洗 – 最高频操作一网打尽
df = df.assign(
age=df["age"].fillna(df["age"].median()),
income=df["income"].fillna(method="ffill"),
gender=df["gender"].fillna("Unknown")
).dropna(subset=["user_id","order_date"])
df = df.drop_duplicates(subset=["user_id","event"], keep="last")
df = df.astype({
"order_date":"datetime64[ns]",
"price":"float64[pyarrow]",
"user_id":"string[pyarrow]",
"category":"category"
})
df["product_name"] = (
df["product_name"].str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True)
.replace({"iphone 1[456] pro max":"iphone pro max"}, regex=True)
)
clip_lower, clip_upper = df["amount"].quantile([0.005,0.995])
df["amount_clip"] = df["amount"].clip(clip_lower, clip_upper)
5. 核心操作 – groupby + agg + transform + pivot 组合拳
sales_by_month = (
df.query("year == 2025")
.assign(month=lambda x: x["order_date"].dt.to_period("M"))
.groupby(["category","month"], observed=True, dropna=False)
.agg(
revenue=("amount","sum"),
orders=("order_id","nunique"),
avg_price=("amount","mean"),
active_users=("user_id","nunique")
)
.round(2)
.reset_index()
)
df["rank_in_category"] = df.groupby("category")["amount"].rank(ascending=False)
df["pct_of_category"] = df.groupby("category")["amount"].transform(lambda x: x / x.sum())
df = df.merge(
user_df[["user_id","reg_date","city"]],
on="user_id",
how="left",
suffixes=("","_user")
)
pd.pivot_table(
df,
values="amount",
index="month",
columns="category",
aggfunc=["sum","count"],
margins=True,
margins_name="总计"
)
6. 时间序列处理(电商/金融必备)
df["order_date"] = pd.to_datetime(df["order_date"])
df["year"] = df["order_date"].dt.year
df["quarter"] = df["order_date"].dt.quarter
df["month"] = df["order_date"].dt.month
df["week"] = df["order_date"].dt.isocalendar().week
df["dayofweek"] = df["order_date"].dt.dayofweek
df["is_weekend"] = df["order_date"].dt.dayofweek >= 5
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
7. 性能优化 – 大数据时代必知(2025-2026 重点)
| 场景 | 推荐做法 | 加速倍数参考 |
|---|
| >1GB csv | read_parquet + dtype_backend="pyarrow" | 3-10× |
| groupby 很慢 | .agg() 用元组写法 / pyarrow string | 2-5× |
| 循环/iterrows | 改用 vectorized / apply / map | 10-1000× |
| 内存爆炸 | category 类型 + downcast + gc.collect | 30-70% ↓ |
| 超大表 join | merge + sort=False 或 pd.concat | — |
| 重复 groupby | 先 groupby 对象保存,再多次 agg | 2-4× |
df["category"] = df["category"].astype("category")
df["user_id"] = df["user_id"].astype("string[pyarrow]")
df["price"] = pd.to_numeric(df["price"], downcast="float")
8. 真实项目模板(推荐结构)
import pandas as pd
from pathlib import Path
def load_raw():
return pd.read_parquet("data/raw/orders_2025.parquet")
def clean(df):
return df
def feature_engineering(df):
return df
def aggregate(df):
return ...
if __name__ == "__main__":
df = load_raw()
df_clean = clean(df)
df_feat = feature_engineering(df_clean)
result = aggregate(df_feat)
result.to_parquet("data/processed/monthly_report.parquet")
print("Done.")
进阶学习路线建议(当前版)
| 阶段 | 重点掌握 | 推荐资源(截至当前) |
|---|
| 入门 | read_xxx, select_dtypes, groupby+agg | 官方 10 分钟入门 + Corey Schafer 视频 |
| 中级 | merge/join, pivot_table, 时间序列,apply/map | Real Python pandas 路径 |
| 高级 | pyarrow backend, MultiIndex, 自定义 agg, style | pandas 3.0 what's new + 用户指南 |
| 专家 | query/numexpr, styler, pandas + polars 对比,dask | Towards Data Science 2025 高级文章 + GitHub 源码 |