跳到主要内容
Python+AI 学习路线:从基础到实战 | 极客日志
Python AI 算法
Python+AI 学习路线:从基础到实战 介绍 Python 结合人工智能的完整学习路线,涵盖 Python 基础、数据科学、机器学习、深度学习及 NLP 大模型应用。内容包含分阶段学习指南、核心代码示例(如 Scikit-learn、PyTorch、Transformers)及实战项目推荐,并提供环境搭建与常用命令速查,旨在帮助读者从零掌握 AI 开发技能。
修罗 发布于 2026/4/5 更新于 2026/5/27 28 浏览为什么选择 Python+AI?
Python 已成为人工智能领域最主流的编程语言,根据 Stack Overflow 2024 年开发者调查,Python 在 AI/ML 领域的使用率超过85% 。
Python 在 AI 领域的优势
优势 说明 🐍 语法简洁 上手快,专注算法本身而非语法细节 📦 生态丰富 NumPy、Pandas、PyTorch 等成熟库 👥 社区活跃 海量教程、开源项目和问题解答 🔧 工具完善 Jupyter、Colab 等优秀开发环境 🚀 部署便捷 Flask/FastAPI 快速构建 AI 服务
AI 技术领域分布
了解 AI 各领域的占比,帮助你更好地规划学习重点:
35% 机器学习
30% 深度学习
15% 自然语言处理
12% 计算机视觉
5% 强化学习
3% 其他
完整学习路径
下图展示了从零基础到 AI 专家的完整学习路线:
graph TD
A[开始学习 Python+AI] --> B{有编程基础?}
B -- 否 --> C[阶段 0: Python 基础]
B -- 是 --> D[阶段 1: 数据科学基础]
C --> D
D --> E[阶段 2: 机器学习]
E --> F{选择方向}
F --> G[阶段 3a: NLP]
F --> H[阶段 3b: 计算机视觉]
F --> I[阶段 3c: 深度学习通用]
G --> J[阶段 4: 实战项目]
H --> J
I --> J
J --> K[🎉 AI 工程师]
分阶段学习指南
🟢 阶段 0:Python 基础(2-4 周)
学习目标 :掌握 Python 核心语法和编程思维
核心知识点
Python 基础 :数据类型、控制流程、函数、面向对象、文件操作、异常处理
数据结构 :int, float, str, list, dict, tuple, set
逻辑控制 :if/else, for/while 循环
函数进阶 :函数定义、lambda 表达式、装饰器基础
OOP :类与对象、继承与多态
必学代码示例
示例 1:列表推导式
squares = []
for i in ( ):
squares.append(i ** )
squares = [i ** i ( )]
even_squares = [i ** i ( ) i % == ]
(even_squares)
range
10
2
2
for
in
range
10
2
for
in
range
10
if
2
0
print
word_count = "hello world hello python"
counts = {word: word_count.split().count(word) for word in set (word_count.split())}
print (counts)
from collections import defaultdict
counts = defaultdict(int )
for word in word_count.split():
counts[word] += 1
with open ('data.txt' , 'r' , encoding='utf-8' ) as f:
content = f.read()
from contextlib import contextmanager
@contextmanager
def timer ():
import time
start = time.time()
yield
print (f"耗时:{time.time() - start:.2 f} 秒" )
with timer():
sum (range (1000000 ))
🔵 阶段 1:数据科学基础(4-6 周)
核心技能树
NumPy :数组创建、数组运算、广播机制、线性代数
Pandas :Series 操作、DataFrame 操作、数据清洗、数据分组、数据合并
可视化 :Matplotlib、Seaborn、Plotly
NumPy 实战代码 import numpy as np
arr = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ]])
zeros = np.zeros((3 , 4 ))
random = np.random.randn(3 , 3 )
print (arr * 2 )
print (arr @ arr.T)
print (np.dot(arr, arr.T))
a = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ]])
b = np.array([10 , 20 , 30 ])
print (a + b)
print (np.mean(arr, axis=1 ))
print (np.argmax(arr, axis=1 ))
Pandas 数据处理实战 import pandas as pd
data = {
'name' : ['张三' , '李四' , '王五' , '赵六' ],
'age' : [25 , 30 , 35 , 28 ],
'city' : ['北京' , '上海' , '深圳' , '杭州' ],
'salary' : [15000 , 20000 , 25000 , 18000 ]
}
df = pd.DataFrame(data)
high_salary = df[df['salary' ] > 18000 ]
beijing = df[df['city' ] == '北京' ]
city_stats = df.groupby('city' ).agg({'salary' : ['mean' , 'max' , 'count' ]})
df_sorted = df.sort_values('salary' , ascending=False )
df2 = pd.DataFrame({'name' : ['张三' , '李四' ], 'department' : ['技术' , '产品' ]})
merged = pd.merge(df, df2, on='name' , how='left' )
print (city_stats)
数据可视化示例 import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif' ] = ['SimHei' ]
plt.rcParams['axes.unicode_minus' ] = False
fig, axes = plt.subplots(2 , 2 , figsize=(12 , 10 ))
axes[0 , 0 ].bar(df['name' ], df['salary' ])
axes[0 , 0 ].set_title('薪资对比' )
axes[0 , 0 ].set_xlabel('姓名' )
axes[0 , 0 ].set_ylabel('薪资' )
axes[0 , 1 ].scatter(df['age' ], df['salary' ], s=100 , alpha=0.6 )
axes[0 , 1 ].set_title('年龄与薪资关系' )
axes[0 , 1 ].set_xlabel('年龄' )
axes[0 , 1 ].set_ylabel('薪资' )
city_counts = df['city' ].value_counts()
axes[1 , 0 ].pie(city_counts, labels=city_counts.index, autopct='%1.1f%%' )
axes[1 , 0 ].set_title('城市分布' )
axes[1 , 1 ].boxplot(df['salary' ])
axes[1 , 1 ].set_title('薪资分布' )
plt.tight_layout()
plt.savefig('visualization.png' , dpi=300 )
plt.show()
🟡 阶段 2:机器学习(6-8 周) 学习目标 :理解 ML 原理,掌握 Scikit-learn 实战
ML 算法分类图
监督学习 :回归(线性回归、决策树回归、随机森林回归、梯度提升回归)、分类(逻辑回归、SVM、决策树、随机森林、XGBoost、神经网络)
无监督学习 :聚类(K-Means、DBSCAN、层次聚类)、降维(PCA、t-SNE、UMAP)
强化学习 :Q-Learning, Policy Gradient 等
经典算法实现 import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
np.random.seed(42 )
n_samples = 1000
X = np.random.randn(n_samples, 3 )
X[:, 0 ] = X[:, 0 ] * 50 + 100
X[:, 1 ] = np.abs (X[:, 1 ]) * 2 + 1
X[:, 2 ] = np.abs (X[:, 2 ]) * 10 + 1
y = (X[:, 0 ] * 1000 + X[:, 1 ] * 50000 - X[:, 2 ] * 2000 + np.random.randn(n_samples) * 50000 )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42 )
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print (f"均方误差 MSE: {mse:.2 f} " )
print (f"决定系数 R²: {r2:.4 f} " )
print (f"系数:{model.coef_} " )
print (f"截距:{model.intercept_:.2 f} " )
plt.figure(figsize=(10 , 6 ))
plt.scatter(y_test, y_pred, alpha=0.5 )
plt.plot([y.min (), y.max ()], [y.min (), y.max ()], 'r--' , lw=2 )
plt.xlabel('真实价格' )
plt.ylabel('预测价格' )
plt.title('房价预测:真实值 vs 预测值' )
plt.grid(True , alpha=0.3 )
plt.show()
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(
n_samples=1000 , n_features=20 , n_informative=15 ,
n_redundant=5 , n_classes=3 , random_state=42
)
models = {
'逻辑回归' : LogisticRegression(max_iter=1000 ),
'SVM' : SVC(),
'决策树' : DecisionTreeClassifier(),
'随机森林' : RandomForestClassifier(n_estimators=100 ),
'朴素贝叶斯' : GaussianNB(),
'KNN' : KNeighborsClassifier()
}
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5 , scoring='accuracy' )
results[name] = {'mean' : scores.mean(), 'std' : scores.std()}
print (f"{name} : {scores.mean():.4 f} (+/- {scores.std():.4 f} )" )
plt.figure(figsize=(12 , 6 ))
names = list (results.keys())
means = [results[name]['mean' ] for name in names]
stds = [results[name]['std' ] for name in names]
plt.bar(names, means, yerr=stds, alpha=0.8 , capsize=5 )
plt.ylabel('准确率' )
plt.title('不同分类算法性能对比(5 折交叉验证)' )
plt.ylim(0.7 , 1.0 )
plt.grid(axis='y' , alpha=0.3 )
plt.xticks(rotation=15 )
plt.show()
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
X, _ = make_blobs(
n_samples=500 , centers=4 , cluster_std=1.5 , random_state=42
)
kmeans = KMeans(n_clusters=4 , random_state=42 , n_init=10 )
kmeans_labels = kmeans.fit_predict(X)
kmeans_silhouette = silhouette_score(X, kmeans_labels)
dbscan = DBSCAN(eps=1.5 , min_samples=10 )
dbscan_labels = dbscan.fit_predict(X)
n_clusters = len (set (dbscan_labels)) - (1 if -1 in dbscan_labels else 0 )
print (f"K-Means: 发现 4 个簇,轮廓系数={kmeans_silhouette:.3 f} " )
print (f"DBSCAN: 发现{n_clusters} 个簇" )
inertias = []
K_range = range (2 , 11 )
for K in K_range:
kmeans = KMeans(n_clusters=K, random_state=42 , n_init=10 )
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(12 , 5 ))
plt.subplot(1 , 2 , 1 )
plt.plot(K_range, inertias, 'bo-' )
plt.xlabel('K 值' )
plt.ylabel('惯性(Inertia)' )
plt.title('肘部法则确定最佳 K 值' )
plt.grid(True , alpha=0.3 )
plt.subplot(1 , 2 , 2 )
plt.scatter(X[:, 0 ], X[:, 1 ], c=kmeans_labels, cmap='viridis' , alpha=0.6 )
plt.scatter(kmeans.cluster_centers_[:, 0 ], kmeans.cluster_centers_[:, 1 ], c='red' , s=200 , marker='X' , label='质心' )
plt.title('K-Means 聚类结果' )
plt.legend()
plt.tight_layout()
plt.show()
🟠 阶段 3:深度学习(8-12 周)
深度学习框架选择
PyTorch :研究首选,动态图,Python 原生风格
TensorFlow :工业部署,TF-Serving, TFLite 移动端
JAX :函数式编程,自动微分,高性能计算
PyTorch 实战代码 import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu' )
print (f"使用设备:{device} " )
class NeuralNetwork (nn.Module):
def __init__ (self, input_size, hidden_size, num_classes ):
super (NeuralNetwork, self ).__init__()
self .layer1 = nn.Linear(input_size, hidden_size)
self .relu = nn.ReLU()
self .layer2 = nn.Linear(hidden_size, hidden_size // 2 )
self .layer3 = nn.Linear(hidden_size // 2 , num_classes)
self .dropout = nn.Dropout(0.2 )
def forward (self, x ):
out = self .layer1(x)
out = self .relu(out)
out = self .dropout(out)
out = self .layer2(out)
out = self .relu(out)
out = self .dropout(out)
out = self .layer3(out)
return out
input_size = 784
hidden_size = 256
num_classes = 10
num_epochs = 10
batch_size = 100
learning_rate = 0.001
model = NeuralNetwork(input_size, hidden_size, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
X_train = torch.randn(1000 , input_size).to(device)
y_train = torch.randint(0 , num_classes, (1000 ,)).to(device)
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True )
train_losses = []
for epoch in range (num_epochs):
model.train()
epoch_loss = 0
for i, (images, labels) in enumerate (train_loader):
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len (train_loader)
train_losses.append(avg_loss)
print (f'Epoch [{epoch+1 } /{num_epochs} ], Loss: {avg_loss:.4 f} ' )
plt.figure(figsize=(10 , 5 ))
plt.plot(train_losses, marker='o' )
plt.xlabel('Epoch' )
plt.ylabel('Loss' )
plt.title('训练损失曲线' )
plt.grid(True , alpha=0.3 )
plt.show()
import torch
import torch.nn as nn
import torch.nn.functional as F
class CNN (nn.Module):
def __init__ (self, num_classes=10 ):
super (CNN, self ).__init__()
self .conv1 = nn.Conv2d(1 , 32 , kernel_size=3 , padding=1 )
self .bn1 = nn.BatchNorm2d(32 )
self .conv2 = nn.Conv2d(32 , 32 , kernel_size=3 , padding=1 )
self .bn2 = nn.BatchNorm2d(32 )
self .pool1 = nn.MaxPool2d(2 , 2 )
self .conv3 = nn.Conv2d(32 , 64 , kernel_size=3 , padding=1 )
self .bn3 = nn.BatchNorm2d(64 )
self .conv4 = nn.Conv2d(64 , 64 , kernel_size=3 , padding=1 )
self .bn4 = nn.BatchNorm2d(64 )
self .pool2 = nn.MaxPool2d(2 , 2 )
self .fc1 = nn.Linear(64 * 7 * 7 , 256 )
self .dropout = nn.Dropout(0.5 )
self .fc2 = nn.Linear(256 , num_classes)
def forward (self, x ):
x = F.relu(self .bn1(self .conv1(x)))
x = F.relu(self .bn2(self .conv2(x)))
x = self .pool1(x)
x = F.relu(self .bn3(self .conv3(x)))
x = F.relu(self .bn4(self .conv4(x)))
x = self .pool2(x)
x = x.view(-1 , 64 * 7 * 7 )
x = F.relu(self .fc1(x))
x = self .dropout(x)
x = self .fc2(x)
return x
model = CNN()
print (model)
total_params = sum (p.numel() for p in model.parameters())
trainable_params = sum (p.numel() for p in model.parameters() if p.requires_grad)
print (f"总参数量:{total_params:,} " )
print (f"可训练参数量:{trainable_params:,} " )
import torch
import torch.nn as nn
import math
class MultiHeadAttention (nn.Module):
def __init__ (self, d_model, num_heads ):
super (MultiHeadAttention, self ).__init__()
assert d_model % num_heads == 0
self .d_model = d_model
self .num_heads = num_heads
self .d_k = d_model // num_heads
self .W_q = nn.Linear(d_model, d_model)
self .W_k = nn.Linear(d_model, d_model)
self .W_v = nn.Linear(d_model, d_model)
self .W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention (self, Q, K, V, mask=None ):
scores = torch.matmul(Q, K.transpose(-2 , -1 )) / math.sqrt(self .d_k)
if mask is not None :
scores = scores.masked_fill(mask == 0 , -1e9 )
attention_weights = F.softmax(scores, dim=-1 )
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward (self, query, key, value, mask=None ):
batch_size = query.size(0 )
Q = self .W_q(query).view(batch_size, -1 , self .num_heads, self .d_k).transpose(1 , 2 )
K = self .W_k(key).view(batch_size, -1 , self .num_heads, self .d_k).transpose(1 , 2 )
V = self .W_v(value).view(batch_size, -1 , self .num_heads, self .d_k).transpose(1 , 2 )
x, attention_weights = self .scaled_dot_product_attention(Q, K, V, mask)
x = x.transpose(1 , 2 ).contiguous().view(batch_size, -1 , self .d_model)
output = self .W_o(x)
return output, attention_weights
d_model = 512
num_heads = 8
seq_length = 10
batch_size = 4
x = torch.randn(batch_size, seq_length, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, attention = mha(x, x, x)
print (f"输入形状:{x.shape} " )
print (f"输出形状:{output.shape} " )
print (f"注意力权重形状:{attention.shape} " )
🔴 阶段 4:NLP 与 LLM 应用(6-8 周) 学习目标 :掌握现代 NLP 技术,熟练使用大语言模型
NLP 技术发展时间线
2017: Transformer
2018: BERT/GPT
2019: GPT-2
2020: GPT-3
2022: ChatGPT
2023: GPT-4/Llama2
2024: 多模态大模型
Transformers 实战 from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
sentiment_pipeline = pipeline("sentiment-analysis" , model="distilbert-base-uncased-finetuned-sst-2-english" )
texts = [
"I love this product! It's amazing!" ,
"This is the worst experience ever." ,
"It's okay, nothing special."
]
results = sentiment_pipeline(texts)
for text, result in zip (texts, results):
print (f"文本:{text} " )
print (f"情感:{result['label' ]} , 置信度:{result['score' ]:.4 f} \n" )
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese" )
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese" , num_labels=3 )
text = "这家餐厅的菜品味道很好,服务也很周到!"
inputs = tokenizer(text, return_tensors="pt" , padding=True , truncation=True )
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1 )
predicted_class = torch.argmax(predictions).item()
labels = ["负面" , "中性" , "正面" ]
print (f"预测类别:{labels[predicted_class]} " )
print (f"置信度:{predictions[0 ][predicted_class]:.4 f} " )
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
loader = TextLoader('knowledge_base.txt' )
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500 , chunk_overlap=50 , length_function=len
)
splits = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(
search_type="similarity" , search_kwargs={"k" : 3 }
)
prompt_template = """使用以下上下文信息来回答问题。如果不知道答案,就说不知道,不要编造答案。
上下文信息:{context}
问题:{question}
答案:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context" , "question" ]
)
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0 ),
chain_type="stuff" ,
retriever=retriever,
return_source_documents=True ,
chain_type_kwargs={"prompt" : PROMPT}
)
query = "如何办理社保卡?"
result = qa_chain({"query" : query})
print (f"问题:{query} " )
print (f"答案:{result['result' ]} " )
print ("\n参考文档:" )
for doc in result['source_documents' ]:
print (f"- {doc.page_content[:100 ]} ..." )
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b" , trust_remote_code=True )
model = AutoModel.from_pretrained("THUDM/chatglm3-6b" , trust_remote_code=True ).half().cuda()
model = model.eval ()
history = []
response = "你好!我是智能助手小政,有什么可以帮助您的吗?"
print (f"助手:{response} " )
while True :
user_input = input ("\n用户:" )
if user_input.lower() in ['退出' , 'exit' , 'quit' ]:
break
system_prompt = "你是一个政务服务大厅的智能引导员,名叫'小政'。"
response, history = model.chat(
tokenizer,
f"{system_prompt} \n用户:{user_input} " ,
history=history,
max_length=2048 ,
temperature=0.7
)
print (f"小政:{response} " )
history.append((user_input, response))
🟣 阶段 5:实战项目(持续进行)
项目推荐 项目类型 难度 涉及技术 预计时间 房价预测 ⭐⭐ Pandas, Scikit-learn 1 周 图像分类 ⭐⭐⭐ PyTorch, CNN 2 周 情感分析 ⭐⭐⭐ Transformers, NLP 2 周 智能客服 ⭐⭐⭐⭐ LangChain, LLM 3 周 RAG 系统 ⭐⭐⭐⭐⭐ 向量数据库,Agent 4 周
端到端项目示例:智能文档问答系统
"""
智能文档问答系统
│
├── data/ # 数据目录
│ ├── documents/ # 原始文档
│ └── vectorstore/ # 向量存储
│
├── src/ # 源代码
│ ├── config.py # 配置文件
│ ├── loader.py # 文档加载
│ ├── embeddings.py # 向量化
│ ├── retriever.py # 检索器
│ ├── generator.py # 生成器
│ └── api.py # API 接口
│
├── app.py # 主应用
├── requirements.txt # 依赖
└── README.md # 说明文档
"""
import os
from dataclasses import dataclass
from typing import Optional
@dataclass
class Config :
OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY" , "" )
EMBEDDING_MODEL: str = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
LLM_MODEL: str = "gpt-3.5-turbo"
LLM_TEMPERATURE: float = 0.7
LLM_MAX_TOKENS: int = 1000
CHUNK_SIZE: int = 500
CHUNK_OVERLAP: int = 50
VECTOR_DB_PATH: str = "data/vectorstore"
TOP_K: int = 3
SIMILARITY_THRESHOLD: float = 0.7
API_HOST: str = "0.0.0.0"
API_PORT: int = 8000
from typing import List
from langchain.document_loaders import (
TextLoader, PyPDFLoader, DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
class DocumentLoader :
def __init__ (self, chunk_size: int = 500 , chunk_overlap: int = 50 ):
self .text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len ,
separators=["\n\n" , "\n" , "。" , "!" , "?" , "." , "!" , "?" , " " , "" ]
)
def load_text (self, file_path: str ) -> List [Document]:
"""加载文本文件"""
loader = TextLoader(file_path, encoding='utf-8' )
documents = loader.load()
return self .text_splitter.split_documents(documents)
def load_pdf (self, file_path: str ) -> List [Document]:
"""加载 PDF 文件"""
loader = PyPDFLoader(file_path)
documents = loader.load()
return self .text_splitter.split_documents(documents)
def load_directory (self, directory: str , glob: str = "**/*.txt" ) -> List [Document]:
"""加载目录下的所有文档"""
loader = DirectoryLoader(directory, glob=glob)
documents = loader.load()
return self .text_splitter.split_documents(documents)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List , Optional
import uvicorn
app = FastAPI(title="智能文档问答系统" , version="1.0.0" )
class QueryRequest (BaseModel ):
question: str
top_k: Optional [int ] = 3
class QueryResponse (BaseModel ):
answer: str
sources: List [str ]
confidence: float
@app.get("/" )
async def root ():
return {"message" : "智能文档问答系统 API" , "version" : "1.0.0" , "endpoints" : {"/query" : "POST - 问答接口" , "/health" : "GET - 健康检查" }}
@app.get("/health" )
async def health_check ():
return {"status" : "healthy" }
@app.post("/query" , response_model=QueryResponse )
async def query (request: QueryRequest ):
try :
return QueryResponse(
answer="这是示例回答" ,
sources=["来源 1" , "来源 2" ],
confidence=0.95
)
except Exception as e:
raise HTTPException(status_code=500 , detail=str (e))
if __name__ == "__main__" :
uvicorn.run("api:app" , host="0.0.0.0" , port=8000 , reload=True )
实战代码示例
示例 1:完整的机器学习项目 """
项目:预测客户是否会购买理财产品
数据集:模拟银行客户数据
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore' )
plt.rcParams['font.sans-serif' ] = ['SimHei' ]
plt.rcParams['axes.unicode_minus' ] = False
np.random.seed(42 )
n_samples = 5000
data = {
'年龄' : np.random.randint(18 , 70 , n_samples),
'收入' : np.random.randint(3000 , 50000 , n_samples),
'存款' : np.random.randint(0 , 1000000 , n_samples),
'债务' : np.random.randint(0 , 500000 , n_samples),
'信用评分' : np.random.randint(300 , 850 , n_samples),
'已购买产品数' : np.random.randint(0 , 10 , n_samples),
'上次购买天数' : np.random.randint(30 , 3650 , n_samples),
'职业' : np.random.choice(['学生' , '上班族' , '个体户' , '退休' , '自由职业' ], n_samples),
'婚姻状况' : np.random.choice(['单身' , '已婚' , '离异' ], n_samples),
'学历' : np.random.choice(['高中' , '本科' , '硕士' , '博士' ], n_samples),
}
df = pd.DataFrame(data)
def calc_purchase_prob (row ):
score = 0
if 25 <= row['年龄' ] <= 55 : score += 20
if row['收入' ] > 15000 : score += 20
if row['存款' ] > 100000 : score += 20
if row['信用评分' ] > 650 : score += 15
if row['职业' ] in ['上班族' , '个体户' ]: score += 15
return min (score + np.random.randint(-10 , 10 ), 100 ) / 100
df['购买概率' ] = df.apply(calc_purchase_prob, axis=1 )
df['是否购买' ] = (df['购买概率' ] > 0.5 ).astype(int )
print ("=" * 50 )
print ("数据集基本信息" )
print ("=" * 50 )
print (df.info())
print ("\n目标变量分布:" )
print (df['是否购买' ].value_counts())
print (f"购买率:{df['是否购买' ].mean():.2 %} " )
fig, axes = plt.subplots(2 , 3 , figsize=(15 , 10 ))
axes[0 , 0 ].hist(df[df['是否购买' ] == 0 ]['年龄' ], bins=30 , alpha=0.5 , label='未购买' )
axes[0 , 0 ].hist(df[df['是否购买' ] == 1 ]['年龄' ], bins=30 , alpha=0.5 , label='购买' )
axes[0 , 0 ].set_xlabel('年龄' )
axes[0 , 0 ].set_ylabel('人数' )
axes[0 , 0 ].legend()
axes[0 , 0 ].set_title('年龄与购买关系' )
axes[0 , 1 ].hist(df[df['是否购买' ] == 0 ]['收入' ], bins=30 , alpha=0.5 , label='未购买' )
axes[0 , 1 ].hist(df[df['是否购买' ] == 1 ]['收入' ], bins=30 , alpha=0.5 , label='购买' )
axes[0 , 1 ].set_xlabel('收入' )
axes[0 , 1 ].legend()
axes[0 , 1 ].set_title('收入与购买关系' )
axes[0 , 2 ].hist(df[df['是否购买' ] == 0 ]['信用评分' ], bins=30 , alpha=0.5 , label='未购买' )
axes[0 , 2 ].hist(df[df['是否购买' ] == 1 ]['信用评分' ], bins=30 , alpha=0.5 , label='购买' )
axes[0 , 2 ].set_xlabel('信用评分' )
axes[0 , 2 ].legend()
axes[0 , 2 ].set_title('信用评分与购买关系' )
career_purchase = df.groupby('职业' )['是否购买' ].mean()
axes[1 , 0 ].bar(career_purchase.index, career_purchase.values)
axes[1 , 0 ].set_ylabel('购买率' )
axes[1 , 0 ].set_title('不同职业的购买率' )
edu_purchase = df.groupby('学历' )['是否购买' ].mean()
axes[1 , 1 ].bar(edu_purchase.index, edu_purchase.values)
axes[1 , 1 ].set_ylabel('购买率' )
axes[1 , 1 ].set_title('不同学历的购买率' )
numeric_cols = ['年龄' , '收入' , '存款' , '债务' , '信用评分' , '已购买产品数' , '上次购买天数' , '是否购买' ]
correlation = df[numeric_cols].corr()
sns.heatmap(correlation, annot=True , fmt='.2f' , cmap='coolwarm' , center=0 , ax=axes[1 , 2 ])
axes[1 , 2 ].set_title('特征相关性热图' )
plt.tight_layout()
plt.savefig('data_exploration.png' , dpi=300 )
plt.show()
le = LabelEncoder()
df['职业编码' ] = le.fit_transform(df['职业' ])
df['婚姻编码' ] = le.fit_transform(df['婚姻状况' ])
df['学历编码' ] = le.fit_transform(df['学历' ])
df['债务收入比' ] = df['债务' ] / (df['收入' ] * 12 + 1 )
df['存款收入比' ] = df['存款' ] / (df['收入' ] * 12 + 1 )
df['净资产' ] = df['存款' ] - df['债务' ]
feature_cols = ['年龄' , '收入' , '存款' , '债务' , '信用评分' , '已购买产品数' , '上次购买天数' , '职业编码' , '婚姻编码' , '学历编码' , '债务收入比' , '存款收入比' , '净资产' ]
X = df[feature_cols]
y = df['是否购买' ]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2 , random_state=42 , stratify=y
)
models = {
'逻辑回归' : Pipeline([('scaler' , StandardScaler()), ('model' , LogisticRegression(max_iter=1000 , random_state=42 ))]),
'随机森林' : RandomForestClassifier(n_estimators=100 , random_state=42 ),
'梯度提升' : GradientBoostingClassifier(random_state=42 )
}
results = {}
for name, model in models.items():
print (f"\n训练 {name} ..." )
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1 ]
results[name] = {
'model' : model,
'predictions' : y_pred,
'probabilities' : y_pred_proba,
'accuracy' : model.score(X_test, y_test),
'roc_auc' : roc_auc_score(y_test, y_pred_proba)
}
print (f"准确率:{results[name]['accuracy' ]:.4 f} " )
print (f"AUC: {results[name]['roc_auc' ]:.4 f} " )
best_model_name = max (results, key=lambda x: results[x]['roc_auc' ])
best_model = results[best_model_name]['model' ]
print (f"\n最佳模型:{best_model_name} " )
print ("=" * 50 )
print ("\n分类报告:" )
print (classification_report(y_test, results[best_model_name]['predictions' ]))
cm = confusion_matrix(y_test, results[best_model_name]['predictions' ])
plt.figure(figsize=(8 , 6 ))
sns.heatmap(cm, annot=True , fmt='d' , cmap='Blues' )
plt.xlabel('预测标签' )
plt.ylabel('真实标签' )
plt.title(f'{best_model_name} - 混淆矩阵' )
plt.show()
plt.figure(figsize=(10 , 6 ))
for name, result in results.items():
fpr, tpr, _ = roc_curve(y_test, result['probabilities' ])
plt.plot(fpr, tpr, label=f"{name} (AUC = {result['roc_auc' ]:.3 f} )" )
plt.plot([0 , 1 ], [0 , 1 ], 'k--' , label='随机分类器' )
plt.xlabel('假正率' )
plt.ylabel('真正率' )
plt.title('ROC 曲线对比' )
plt.legend()
plt.grid(alpha=0.3 )
plt.show()
if hasattr (best_model, 'feature_importances_' ):
feature_importance = pd.DataFrame({
'feature' : feature_cols,
'importance' : best_model.feature_importances_
}).sort_values('importance' , ascending=False )
plt.figure(figsize=(10 , 6 ))
plt.barh(feature_importance['feature' ], feature_importance['importance' ])
plt.xlabel('重要性' )
plt.title('特征重要性' )
plt.tight_layout()
plt.show()
print ("\n特征重要性排序:" )
print (feature_importance)
print ("\n" + "=" * 50 )
print ("项目完成!" )
print ("=" * 50 )
学习资源推荐
📚 在线课程 平台 课程 适合阶段 难度 Coursera Machine Learning (Andrew Ng) 初学者 ⭐⭐⭐ 吴恩达深度学习课程 Deep Learning Specialization 阶段 2-3 ⭐⭐⭐⭐ 李宏毅机器学习 Machine Learning 中级 ⭐⭐⭐⭐ Fast.ai Practical Deep Learning for Coders 实战导向 ⭐⭐⭐⭐ 极客时间 Python 进阶 阶段 0-1 ⭐⭐
📖 推荐书籍
Python 编程 :《Python 编程:从入门到实践》、《流畅的 Python》
数据科学 :《利用 Python 进行数据分析》、《Python 数据科学手册》
机器学习 :《机器学习》(周志华)、《统计学习方法》(李航)、《西瓜书》
深度学习 :《深度学习》(花书)、《动手学深度学习》
NLP 与 LLM :《自然语言处理综论》、《注意力机制》
🛠️ 实用工具与库
开发环境 :Jupyter, VS Code, PyCharm, Google Colab
数据处理 :NumPy, Pandas, Polars, Apache Arrow
机器学习 :Scikit-learn, XGBoost, LightGBM, CatBoost
深度学习 :PyTorch, Transformers, LangChain, Accelerate
部署 :FastAPI, Docker, MLflow, Gradio
🔗 重要资源链接
常见问题解答
Q1: 没有编程基础,能学 Python+AI 吗? 答 : 完全可以!Python 是公认最适合初学者的语言。建议学习路径:
花 2-3 周打好 Python 基础
从简单的数据分析项目入手
逐步过渡到机器学习
边学边做,保持实践
Q2: 数学基础不好,能学 AI 吗?
微积分 :导数/梯度
线性代数 :偏导数、矩阵运算、特征值/向量
概率统计 :概率分布、贝叶斯定理、假设检验
优化理论 :梯度下降、凸优化
Q3: 学习多久能找到工作? 投入时间 学习周期 可达到水平 1 小时/天 12-18 个月 初级 AI 工程师 2-3 小时/天 8-12 个月 中级 AI 工程师 全职学习 4-6 个月 实战能力
Q4: GPU 不够用怎么办?
云平台 :Google Colab(免费 GPU)、Kaggle Notebooks(免费)、AutoDL(便宜)
模型压缩 :量化(Quantization)、剪枝(Pruning)、知识蒸馏(Distillation)
batch_size = 16
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
accumulation_steps = 4
Q5: 如何保持技术更新?
关注前沿 :arXiv.org(论文预印本)、Hugging Face(新模型)、Twitter/X(大佬动态)
实践导向 :每月做一个小项目、参与 Kaggle 竞赛、贡献开源项目
社区交流 :加入技术 Discord/微信群、参加线下 meetup、写技术博客
学习时间线总览
2025-01-01 ~ 2025-02-01 : Python 语法、数据结构、OOP 编程
2025-03-01 ~ 2025-04-01 : NumPy/Pandas、数据可视化、实战项目 1
2025-05-01 ~ 2025-06-01 : ML 基础理论、Scikit-learn、实战项目 2
2025-07-01 ~ 2025-08-01 : PyTorch 基础、CNN/RNN/Transformer、实战项目 3
2025-09-01 ~ 2025-10-01 : NLP 基础、Transformers、RAG 与 Agent、实战项目 4
总结
核心要点
循序渐进 :不要急于求成,按阶段学习
项目驱动 :理论结合实践,多做项目
持续学习 :AI 技术更新快,保持学习热情
社区参与 :加入社区,与他人交流学习
定期复盘 :总结经验,形成自己的知识体系
学习建议
✅ 每天至少写代码 30 分钟
✅ 每周学习一个新概念
✅ 每月完成一个小项目
✅ 每季度进行一次技术复盘
✅ 保持好奇心和探索精神
附录
A. Python 环境搭建
conda create -n ai_env python=3.10
conda activate ai_env
pip install numpy pandas matplotlib seaborn
pip install scikit-learn xgboost lightgbm
pip install torch torchvision torchaudio
pip install transformers langchain
pip install jupyter lab
jupyter lab
B. 常用命令速查
jupyter notebook
jupyter lab
jupyter nbconvert
git clone <url>
git add .
git commit -m "msg"
git push
conda env list
conda install <pkg>
conda env remove -n <env >
C. 学习检查清单
阶段 0 检查清单
掌握 Python 基本语法
理解数据类型和结构
会写函数和类
熟悉常用模块
阶段 1 检查清单
能用 NumPy 进行数组运算
熟练使用 Pandas 处理数据
能用 Matplotlib 绘制图表
完成至少 3 个数据分析项目
阶段 2 检查清单
理解常见 ML 算法原理
能使用 Scikit-learn 建模
会做特征工程
完成至少 2 个 ML 项目
阶段 3 检查清单
掌握 PyTorch 基础
能搭建神经网络
理解 CNN/RNN/Transformer
完成至少 2 个 DL 项目
阶段 4 检查清单
熟练使用 Transformers
能调用预训练模型
理解 RAG 原理
完成至少 1 个 NLP 项目
相关免费在线工具 加密/解密文本 使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
RSA密钥对生成器 生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
Mermaid 预览与可视化编辑 基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
随机西班牙地址生成器 随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online
Gemini 图片去水印 基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online
curl 转代码 解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online