跳到主要内容2026 年 Python+AI 学习路线:从零基础到实战 | 极客日志PythonAI算法
2026 年 Python+AI 学习路线:从零基础到实战
Python+AI 技术栈涵盖基础语法、数据处理、机器学习及深度学习框架。本文梳理了从零开始的学习路径,包含 NumPy/Pandas 实操、Scikit-learn 建模、PyTorch 神经网络构建以及 LLM 应用开发。通过分阶段实战项目与资源推荐,帮助开发者系统掌握 AI 工程化能力,解决环境配置、数学基础及 GPU 资源等常见问题,适合希望转型或进阶的技术人员参考。
Python 已成为人工智能领域最主流的编程语言,根据 Stack Overflow 2024 年开发者调查,Python 在 AI/ML 领域的使用率超过 85%。
Python 在 AI 领域的优势
| 优势 | 说明 |
|---|
| 🐍 语法简洁 | 上手快,专注算法本身而非语法细节 |
| 📦 生态丰富 | NumPy、Pandas、PyTorch 等成熟库 |
| 👥 社区活跃 | 海量教程、开源项目和问题解答 |
| 🔧 工具完善 | Jupyter、Colab 等优秀开发环境 |
| 🚀 部署便捷 | Flask/FastAPI 快速构建 AI 服务 |
AI 技术领域分布
了解 AI 各领域的占比,帮助你更好地规划学习重点:
| 领域 | 占比 |
|---|
| 机器学习 | 35% |
| 深度学习 | 30% |
| 自然语言处理 | 15% |
| 计算机视觉 | 12% |
| 强化学习 | 5% |
| 其他 | 3% |
完整学习路径
下图展示了从零基础到 AI 专家的完整学习路线逻辑:
- 开始学习:评估是否有编程基础。无基础则进入阶段 0。
- 阶段 0 (Python 基础):掌握核心语法与编程思维。
- 阶段 1 (数据科学):学习数据处理、分析与可视化。
- 阶段 2 (机器学习):理解 ML 原理,掌握 Scikit-learn。
- 选择方向:
- NLP:Transformers, LLM 应用。
- CV:CNN 架构,目标检测。
- 通用:PyTorch 基础,神经网络优化。
- 阶段 4 (实战项目):端到端项目,模型部署,性能优化。
- 目标:成为 AI 工程师。
分阶段学习指南
🟢 阶段 0:Python 基础(2-4 周)
学习目标:掌握 Python 核心语法和编程思维
核心知识点
- 数据类型:int, float, str, list, dict, tuple, set
- 控制流程:if/else, for/while 循环
- 函数:函数定义,lambda 表达式,装饰器基础
- 面向对象:类与对象,继承与多态
- :异常处理,上下文管理器
文件操作
必学代码示例
squares = []
for i in range(10):
squares.append(i ** 2)
squares = [i ** 2 for i in range(10)]
even_squares = [i ** 2 for i in range(10) if i % 2 == 0]
print(even_squares)
word_count = "hello world hello python"
counts = {word: word_count.split().count(word) for word in set(word_count.split())}
print(counts)
from collections import defaultdict
counts = defaultdict(int)
for word in word_count.split():
counts[word] += 1
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
from contextlib import contextmanager
@contextmanager
def timer():
import time
start = time.time()
yield
print(f"耗时:{time.time() - start:.2f}秒")
with timer():
sum(range(1000000))
🔵 阶段 1:数据科学基础(4-6 周)
核心技能树
- NumPy:数组创建,数组运算,广播机制,线性代数
- Pandas:Series 操作,DataFrame 操作,数据清洗,数据分组,数据合并
- 可视化:Matplotlib, Seaborn, Plotly
NumPy 实战代码
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
zeros = np.zeros((3, 4))
random = np.random.randn(3, 3)
print(arr * 2)
print(arr @ arr.T)
print(np.dot(arr, arr.T))
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
print(a + b)
print(np.mean(arr, axis=1))
print(np.argmax(arr, axis=1))
Pandas 数据处理实战
import pandas as pd
data = {
'name': ['张三', '李四', '王五', '赵六'],
'age': [25, 30, 35, 28],
'city': ['北京', '上海', '深圳', '杭州'],
'salary': [15000, 20000, 25000, 18000]
}
df = pd.DataFrame(data)
high_salary = df[df['salary'] > 18000]
beijing = df[df['city'] == '北京']
city_stats = df.groupby('city').agg({'salary': ['mean', 'max', 'count']})
df_sorted = df.sort_values('salary', ascending=False)
df2 = pd.DataFrame({'name': ['张三', '李四'], 'department': ['技术', '产品']})
merged = pd.merge(df, df2, on='name', how='left')
print(city_stats)
数据可视化示例
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].bar(df['name'], df['salary'])
axes[0, 0].set_title('薪资对比')
axes[0, 0].set_xlabel('姓名')
axes[0, 0].set_ylabel('薪资')
axes[0, 1].scatter(df['age'], df['salary'], s=100, alpha=0.6)
axes[0, 1].set_title('年龄与薪资关系')
axes[0, 1].set_xlabel('年龄')
axes[0, 1].set_ylabel('薪资')
city_counts = df['city'].value_counts()
axes[1, 0].pie(city_counts, labels=city_counts.index, autopct='%1.1f%%')
axes[1, 0].set_title('城市分布')
axes[1, 1].boxplot(df['salary'])
axes[1, 1].set_title('薪资分布')
plt.tight_layout()
plt.savefig('visualization.png', dpi=300)
plt.show()
🟡 阶段 2:机器学习(6-8 周)
学习目标:理解 ML 原理,掌握 Scikit-learn 实战
ML 算法分类图
- 监督学习:回归 (线性回归,决策树回归,随机森林回归),分类 (逻辑回归,SVM, 决策树,随机森林,XGBoost, 神经网络)
- 无监督学习:聚类 (K-Means, DBSCAN, 层次聚类), 降维 (PCA, t-SNE, UMAP)
- 强化学习:策略梯度,Q-Learning
经典算法实现
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 3)
X[:, 0] = X[:, 0] * 50 + 100
X[:, 1] = np.abs(X[:, 1]) * 2 + 1
X[:, 2] = np.abs(X[:, 2]) * 10 + 1
y = (X[:, 0] * 1000 + X[:, 1] * 50000 - X[:, 2] * 2000 + np.random.randn(n_samples) * 50000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"均方误差 MSE: {mse:.2f}")
print(f"决定系数 R²: {r2:.4f}")
print(f"系数:{model.coef_}")
print(f"截距:{model.intercept_:.2f}")
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('真实价格')
plt.ylabel('预测价格')
plt.title('房价预测:真实值 vs 预测值')
plt.grid(True, alpha=0.3)
plt.show()
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_classes=3, random_state=42)
models = {
'逻辑回归': LogisticRegression(max_iter=1000),
'SVM': SVC(),
'决策树': DecisionTreeClassifier(),
'随机森林': RandomForestClassifier(n_estimators=100),
'朴素贝叶斯': GaussianNB(),
'KNN': KNeighborsClassifier()
}
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
results[name] = {'mean': scores.mean(), 'std': scores.std()}
print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
plt.figure(figsize=(12, 6))
names = list(results.keys())
means = [results[name]['mean'] for name in names]
stds = [results[name]['std'] for name in names]
plt.bar(names, means, yerr=stds, alpha=0.8, capsize=5)
plt.ylabel('准确率')
plt.title('不同分类算法性能对比(5 折交叉验证)')
plt.ylim(0.7, 1.0)
plt.grid(axis='y', alpha=0.3)
plt.xticks(rotation=15)
plt.show()
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=1.5, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)
kmeans_silhouette = silhouette_score(X, kmeans_labels)
dbscan = DBSCAN(eps=1.5, min_samples=10)
dbscan_labels = dbscan.fit_predict(X)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(f"K-Means: 发现 4 个簇,轮廓系数={kmeans_silhouette:.3f}")
print(f"DBSCAN: 发现{n_clusters}个簇")
inertias = []
K_range = range(2, 11)
for K in K_range:
kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K 值')
plt.ylabel('惯性(Inertia)')
plt.title('肘部法则确定最佳 K 值')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=200, marker='X', label='质心')
plt.title('K-Means 聚类结果')
plt.legend()
plt.tight_layout()
plt.show()
🟠 阶段 3:深度学习(8-12 周)
深度学习框架选择
- PyTorch:研究首选,动态图,Python 原生风格
- TensorFlow:工业部署,TF-Serving, TFLite 移动端
- JAX:函数式编程,自动微分,高性能计算
PyTorch 实战代码
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"使用设备:{device}")
class NeuralNetwork(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(NeuralNetwork, self).__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(hidden_size, hidden_size // 2)
self.layer3 = nn.Linear(hidden_size // 2, num_classes)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
out = self.layer1(x)
out = self.relu(out)
out = self.dropout(out)
out = self.layer2(out)
out = self.relu(out)
out = self.dropout(out)
out = self.layer3(out)
return out
input_size = 784
hidden_size = 256
num_classes = 10
num_epochs = 10
batch_size = 100
learning_rate = 0.001
model = NeuralNetwork(input_size, hidden_size, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
X_train = torch.randn(1000, input_size).to(device)
y_train = torch.randint(0, num_classes, (1000,)).to(device)
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
train_losses = []
for epoch in range(num_epochs):
model.train()
epoch_loss = 0
for i, (images, labels) in enumerate(train_loader):
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len(train_loader)
train_losses.append(avg_loss)
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
plt.figure(figsize=(10, 5))
plt.plot(train_losses, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('训练损失曲线')
plt.grid(True, alpha=0.3)
plt.show()
import torch
import torch.nn as nn
import torch.nn.functional as F
class CNN(nn.Module):
def __init__(self, num_classes=10):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(32)
self.pool1 = nn.MaxPool2d(2, 2)
self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(64)
self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(64)
self.pool2 = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 7 * 7, 256)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = self.pool1(x)
x = F.relu(self.bn3(self.conv3(x)))
x = F.relu(self.bn4(self.conv4(x)))
x = self.pool2(x)
x = x.view(-1, 64 * 7 * 7)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
model = CNN()
print(model)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"总参数量:{total_params:,}")
print(f"可训练参数量:{trainable_params:,}")
import torch
import torch.nn as nn
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
x, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
output = self.W_o(x)
return output, attention_weights
d_model = 512
num_heads = 8
seq_length = 10
batch_size = 4
x = torch.randn(batch_size, seq_length, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, attention = mha(x, x, x)
print(f"输入形状:{x.shape}")
print(f"输出形状:{output.shape}")
print(f"注意力权重形状:{attention.shape}")
🔴 阶段 4:NLP 与 LLM 应用(6-8 周)
学习目标:掌握现代 NLP 技术,熟练使用大语言模型
NLP 技术发展时间线
- 2017: Transformer
- 2018: BERT/GPT
- 2019: GPT-2
- 2020: GPT-3
- 2022: ChatGPT
- 2023: GPT-4/Llama2
- 2024: 多模态大模型
Transformers 实战
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
texts = [
"I love this product! It's amazing!",
"This is the worst experience ever.",
"It's okay, nothing special."
]
results = sentiment_pipeline(texts)
for text, result in zip(texts, results):
print(f"文本:{text}")
print(f"情感:{result['label']}, 置信度:{result['score']:.4f}\n")
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=3)
text = "这家餐厅的菜品味道很好,服务也很周到!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions).item()
labels = ["负面", "中性", "正面"]
print(f"预测类别:{labels[predicted_class]}")
print(f"置信度:{predictions[0][predicted_class]:.4f}")
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
loader = TextLoader('knowledge_base.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50, length_function=len)
splits = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(
search_type="similarity", search_kwargs={"k": 3})
prompt_template = """使用以下上下文信息来回答问题。如果不知道答案,就说不知道,不要编造答案。
上下文信息:{context}
问题:{question}
答案:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"])
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0), chain_type="stuff",
retriever=retriever, return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT})
query = "如何办理社保卡?"
result = qa_chain({"query": query})
print(f"问题:{query}")
print(f"答案:{result['result']}")
print("\n参考文档:")
for doc in result['source_documents']:
print(f"- {doc.page_content[:100]}...")
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = model.eval()
history = []
response = "你好!我是智能助手小政,有什么可以帮助您的吗?"
print(f"助手:{response}")
while True:
user_input = input("\n用户:")
if user_input.lower() in ['退出', 'exit', 'quit']:
break
system_prompt = "你是一个政务服务大厅的智能引导员,名叫'小政'。"
response, history = model.chat(
tokenizer, f"{system_prompt}\n用户:{user_input}", history=history,
max_length=2048, temperature=0.7)
print(f"小政:{response}")
history.append((user_input, response))
🟣 阶段 5:实战项目(持续进行)
项目推荐
| 项目类型 | 难度 | 涉及技术 | 预计时间 |
|---|
| 房价预测 | ⭐⭐ | Pandas, Scikit-learn | 1 周 |
| 图像分类 | ⭐⭐⭐ | PyTorch, CNN | 2 周 |
| 情感分析 | ⭐⭐⭐ | Transformers, NLP | 2 周 |
| 智能客服 | ⭐⭐⭐⭐ | LangChain, LLM | 3 周 |
| RAG 系统 | ⭐⭐⭐⭐⭐ | 向量数据库,Agent | 4 周 |
端到端项目示例:智能文档问答系统
"""
智能文档问答系统
│
├── data/ # 数据目录
│ ├── documents/ # 原始文档
│ └── vectorstore/ # 向量存储
│
├── src/ # 源代码
│ ├── config.py # 配置文件
│ ├── loader.py # 文档加载
│ ├── embeddings.py # 向量化
│ ├── retriever.py # 检索器
│ ├── generator.py # 生成器
│ └── api.py # API 接口
│
├── app.py # 主应用
├── requirements.txt # 依赖
└── README.md # 说明文档
"""
import os
from dataclasses import dataclass
from typing import Optional
@dataclass
class Config:
OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
EMBEDDING_MODEL: str = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
LLM_MODEL: str = "gpt-3.5-turbo"
LLM_TEMPERATURE: float = 0.7
LLM_MAX_TOKENS: int = 1000
CHUNK_SIZE: int = 500
CHUNK_OVERLAP: int = 50
VECTOR_DB_PATH: str = "data/vectorstore"
TOP_K: int = 3
SIMILARITY_THRESHOLD: float = 0.7
API_HOST: str = "0.0.0.0"
API_PORT: int = 8000
from typing import List
from langchain.document_loaders import (
TextLoader, PyPDFLoader, DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
class DocumentLoader:
def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap,
length_function=len, separators=["\n\n", "\n", "。", "!", "?", ".", "!", "?", " ", ""])
def load_text(self, file_path: str) -> List[Document]:
"""加载文本文件"""
loader = TextLoader(file_path, encoding='utf-8')
documents = loader.load()
return self.text_splitter.split_documents(documents)
def load_pdf(self, file_path: str) -> List[Document]:
"""加载 PDF 文件"""
loader = PyPDFLoader(file_path)
documents = loader.load()
return self.text_splitter.split_documents(documents)
def load_directory(self, directory: str, glob: str = "**/*.txt") -> List[Document]:
"""加载目录下的所有文档"""
loader = DirectoryLoader(directory, glob=glob)
documents = loader.load()
return self.text_splitter.split_documents(documents)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import uvicorn
app = FastAPI(title="智能文档问答系统", version="1.0.0")
class QueryRequest(BaseModel):
question: str
top_k: Optional[int] = 3
class QueryResponse(BaseModel):
answer: str
sources: List[str]
confidence: float
@app.get("/")
async def root():
return {"message": "智能文档问答系统 API", "version": "1.0.0", "endpoints": {"/query": "POST - 问答接口", "/health": "GET - 健康检查"}}
@app.get("/health")
async def health_check():
return {"status": "healthy"}
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
try:
return QueryResponse(
answer="这是示例回答",
sources=["来源 1", "来源 2"],
confidence=0.95)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
uvicorn.run("api:app", host="0.0.0.0", port=8000, reload=True)
实战代码示例
示例 1:完整的机器学习项目
"""
项目:预测客户是否会购买理财产品
数据集:模拟银行客户数据
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
np.random.seed(42)
n_samples = 5000
data = {
'年龄': np.random.randint(18, 70, n_samples),
'收入': np.random.randint(3000, 50000, n_samples),
'存款': np.random.randint(0, 1000000, n_samples),
'债务': np.random.randint(0, 500000, n_samples),
'信用评分': np.random.randint(300, 850, n_samples),
'已购买产品数': np.random.randint(0, 10, n_samples),
'上次购买天数': np.random.randint(30, 3650, n_samples),
'职业': np.random.choice(['学生', '上班族', '个体户', '退休', '自由职业'], n_samples),
'婚姻状况': np.random.choice(['单身', '已婚', '离异'], n_samples),
'学历': np.random.choice(['高中', '本科', '硕士', '博士'], n_samples),
}
df = pd.DataFrame(data)
def calc_purchase_prob(row):
score = 0
if 25 <= row['年龄'] <= 55: score += 20
if row['收入'] > 15000: score += 20
if row['存款'] > 100000: score += 20
if row['信用评分'] > 650: score += 15
if row['职业'] in ['上班族', '个体户']: score += 15
return min(score + np.random.randint(-10, 10), 100) / 100
df['购买概率'] = df.apply(calc_purchase_prob, axis=1)
df['是否购买'] = (df['购买概率'] > 0.5).astype(int)
print("=" * 50)
print("数据集基本信息")
print("=" * 50)
print(df.info())
print("\n目标变量分布:")
print(df['是否购买'].value_counts())
print(f"购买率:{df['是否购买'].mean():.2%}")
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes[0, 0].hist(df[df['是否购买']==0]['年龄'], bins=30, alpha=0.5, label='未购买')
axes[0, 0].hist(df[df['是否购买']==1]['年龄'], bins=30, alpha=0.5, label='购买')
axes[0, 0].set_xlabel('年龄')
axes[0, 0].set_ylabel('人数')
axes[0, 0].legend()
axes[0, 0].set_title('年龄与购买关系')
axes[0, 1].hist(df[df['是否购买']==0]['收入'], bins=30, alpha=0.5, label='未购买')
axes[0, 1].hist(df[df['是否购买']==1]['收入'], bins=30, alpha=0.5, label='购买')
axes[0, 1].set_xlabel('收入')
axes[0, 1].legend()
axes[0, 1].set_title('收入与购买关系')
axes[0, 2].hist(df[df['是否购买']==0]['信用评分'], bins=30, alpha=0.5, label='未购买')
axes[0, 2].hist(df[df['是否购买']==1]['信用评分'], bins=30, alpha=0.5, label='购买')
axes[0, 2].set_xlabel('信用评分')
axes[0, 2].legend()
axes[0, 2].set_title('信用评分与购买关系')
career_purchase = df.groupby('职业')['是否购买'].mean()
axes[1, 0].bar(career_purchase.index, career_purchase.values)
axes[1, 0].set_ylabel('购买率')
axes[1, 0].set_title('不同职业的购买率')
edu_purchase = df.groupby('学历')['是否购买'].mean()
axes[1, 1].bar(edu_purchase.index, edu_purchase.values)
axes[1, 1].set_ylabel('购买率')
axes[1, 1].set_title('不同学历的购买率')
numeric_cols = ['年龄', '收入', '存款', '债务', '信用评分', '已购买产品数', '上次购买天数', '是否购买']
correlation = df[numeric_cols].corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=axes[1, 2])
axes[1, 2].set_title('特征相关性热图')
plt.tight_layout()
plt.savefig('data_exploration.png', dpi=300)
plt.show()
le = LabelEncoder()
df['职业编码'] = le.fit_transform(df['职业'])
df['婚姻编码'] = le.fit_transform(df['婚姻状况'])
df['学历编码'] = le.fit_transform(df['学历'])
df['债务收入比'] = df['债务'] / (df['收入'] * 12 + 1)
df['存款收入比'] = df['存款'] / (df['收入'] * 12 + 1)
df['净资产'] = df['存款'] - df['债务']
feature_cols = ['年龄', '收入', '存款', '债务', '信用评分', '已购买产品数', '上次购买天数', '职业编码', '婚姻编码', '学历编码', '债务收入比', '存款收入比', '净资产']
X = df[feature_cols]
y = df['是否购买']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
models = {
'逻辑回归': Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression(max_iter=1000, random_state=42))]),
'随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
'梯度提升': GradientBoostingClassifier(random_state=42)
}
results = {}
for name, model in models.items():
print(f"\n训练 {name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
results[name] = {
'model': model,
'predictions': y_pred,
'probabilities': y_pred_proba,
'accuracy': model.score(X_test, y_test),
'roc_auc': roc_auc_score(y_test, y_pred_proba)
}
print(f"准确率:{results[name]['accuracy']:.4f}")
print(f"AUC: {results[name]['roc_auc']:.4f}")
best_model_name = max(results, key=lambda x: results[x]['roc_auc'])
best_model = results[best_model_name]['model']
print(f"\n最佳模型:{best_model_name}")
print("=" * 50)
print("\n分类报告:")
print(classification_report(y_test, results[best_model_name]['predictions']))
cm = confusion_matrix(y_test, results[best_model_name]['predictions'])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title(f'{best_model_name} - 混淆矩阵')
plt.show()
plt.figure(figsize=(10, 6))
for name, result in results.items():
fpr, tpr, _ = roc_curve(y_test, result['probabilities'])
plt.plot(fpr, tpr, label=f"{name} (AUC = {result['roc_auc']:.3f})")
plt.plot([0, 1], [0, 1], 'k--', label='随机分类器')
plt.xlabel('假正率')
plt.ylabel('真正率')
plt.title('ROC 曲线对比')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
if hasattr(best_model, 'feature_importances_'):
feature_importance = pd.DataFrame({
'feature': feature_cols,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('重要性')
plt.title('特征重要性')
plt.tight_layout()
plt.show()
print("\n特征重要性排序:")
print(feature_importance)
print("\n" + "=" * 50)
print("项目完成!")
print("=" * 50)
学习资源推荐
📚 在线课程
| 平台 | 课程 | 适合阶段 | 难度 |
|---|
| Coursera | Machine Learning (Andrew Ng) | 初学者 | ⭐⭐⭐ |
| 吴恩达深度学习课程 | Deep Learning Specialization | 阶段 2-3 | ⭐⭐⭐⭐ |
| 李宏毅机器学习 | Machine Learning | 中级 | ⭐⭐⭐⭐ |
| Fast.ai | Practical Deep Learning for Coders | 实战导向 | ⭐⭐⭐⭐ |
| 极客时间 | Python 进阶 | 阶段 0-1 | ⭐⭐ |
📖 推荐书籍
- Python 编程:《Python 编程:从入门到实践》、《流畅的 Python》
- 数据科学:《利用 Python 进行数据分析》、《Python 数据科学手册》
- 机器学习:《机器学习》(周志华)、《统计学习方法》(李航)、《西瓜书》
- 深度学习:《深度学习》(花书)、《动手学深度学习》
- NLP 与 LLM:《自然语言处理综论》、《注意力机制》
🛠️ 实用工具与库
- 开发环境:Jupyter, VS Code, PyCharm, Google Colab
- 数据处理:NumPy, Pandas, Polars, Apache Arrow
- 机器学习:Scikit-learn, XGBoost, LightGBM, CatBoost
- 深度学习:PyTorch, Transformers, LangChain, Accelerate
- 部署:FastAPI, Docker, MLflow, Gradio
🔗 重要资源链接
常见问题解答
Q1: 没有编程基础,能学 Python+AI 吗?
答: 完全可以!Python 是公认最适合初学者的语言。建议学习路径:
- 花 2-3 周打好 Python 基础
- 从简单的数据分析项目入手
- 逐步过渡到机器学习
- 边学边做,保持实践
Q2: 数学基础不好,能学 AI 吗?
- 微积分:导数/梯度,偏导数
- 线性代数:矩阵运算,特征值/向量
- 概率统计:概率分布,贝叶斯定理,假设检验
- 优化理论:梯度下降,凸优化
Q3: 学习多久能找到工作?
| 投入时间 | 学习周期 | 可达到水平 |
|---|
| 1 小时/天 | 12-18 个月 | 初级 AI 工程师 |
| 2-3 小时/天 | 8-12 个月 | 中级 AI 工程师 |
| 全职学习 | 4-6 个月 | 实战能力 |
Q4: GPU 不够用怎么办?
- 云平台:Google Colab(免费 GPU),Kaggle Notebooks(免费),AutoDL(便宜)
- 模型压缩:量化(Quantization),剪枝(Pruning),知识蒸馏(Distillation)
batch_size = 16
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
accumulation_steps = 4
Q5: 如何保持技术更新?
- 关注前沿:arXiv.org(论文预印本),Hugging Face(新模型),Twitter/X(大佬动态)
- 实践导向:每月做一个小项目,参与 Kaggle 竞赛,贡献开源项目
- 社区交流:加入技术 Discord/微信群,参加线下 meetup,写技术博客
学习时间线总览
- 2025-01~02: Python 语法,数据结构,OOP 编程
- 2025-03~04: NumPy/Pandas, 数据可视化,实战项目 1
- 2025-05~06: ML 基础理论,Scikit-learn, 实战项目 2
- 2025-07~08: PyTorch 基础,CNN/RNN/Transformer, 实战项目 3
- 2025-09~10: NLP 基础,Transformers, RAG 与 Agent, 实战项目 4
总结
核心要点
- 循序渐进:不要急于求成,按阶段学习
- 项目驱动:理论结合实践,多做项目
- 持续学习:AI 技术更新快,保持学习热情
- 社区参与:加入社区,与他人交流学习
- 定期复盘:总结经验,形成自己的知识体系
学习建议
- ✅ 每天至少写代码 30 分钟
- ✅ 每周学习一个新概念
- ✅ 每月完成一个小项目
- ✅ 每季度进行一次技术复盘
- ✅ 保持好奇心和探索精神
附录
A. Python 环境搭建
conda create -n ai_env python=3.10
conda activate ai_env
pip install numpy pandas matplotlib seaborn
pip install scikit-learn xgboost lightgbm
pip install torch torchvision torchaudio
pip install transformers langchain
pip install jupyter lab
jupyter lab
B. 常用命令速查
jupyter notebook
jupyter lab
jupyter nbconvert
git clone <url>
git add .
git commit -m "msg"
git push
conda env list
conda install <pkg>
conda env remove -n <env>
C. 学习检查清单
阶段 0 检查清单
- 掌握 Python 基本语法
- 理解数据类型和结构
- 会写函数和类
- 熟悉常用模块
阶段 1 检查清单
- 能用 NumPy 进行数组运算
- 熟练使用 Pandas 处理数据
- 能用 Matplotlib 绘制图表
- 完成至少 3 个数据分析项目
阶段 2 检查清单
- 理解常见 ML 算法原理
- 能使用 Scikit-learn 建模
- 会做特征工程
- 完成至少 2 个 ML 项目
阶段 3 检查清单
- 掌握 PyTorch 基础
- 能搭建神经网络
- 理解 CNN/RNN/Transformer
- 完成至少 2 个 DL 项目
阶段 4 检查清单
- 熟练使用 Transformers
- 能调用预训练模型
- 理解 RAG 原理
- 完成至少 1 个 NLP 项目
相关免费在线工具
- 加密/解密文本
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
- RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
- Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
- curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
- Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
- Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online