Python+AI 学习路线:从基础到实战
介绍 Python 结合人工智能的完整学习路线,涵盖 Python 基础、数据科学、机器学习、深度学习及 NLP 大模型应用。内容包含分阶段学习指南、核心代码示例(如 Scikit-learn、PyTorch、Transformers)及实战项目推荐,并提供环境搭建与常用命令速查,旨在帮助读者从零掌握 AI 开发技能。

介绍 Python 结合人工智能的完整学习路线,涵盖 Python 基础、数据科学、机器学习、深度学习及 NLP 大模型应用。内容包含分阶段学习指南、核心代码示例(如 Scikit-learn、PyTorch、Transformers)及实战项目推荐,并提供环境搭建与常用命令速查,旨在帮助读者从零掌握 AI 开发技能。

Python 已成为人工智能领域最主流的编程语言,根据 Stack Overflow 2024 年开发者调查,Python 在 AI/ML 领域的使用率超过85%。

| 优势 | 说明 |
|---|---|
| 🐍 语法简洁 | 上手快,专注算法本身而非语法细节 |
| 📦 生态丰富 | NumPy、Pandas、PyTorch 等成熟库 |
| 👥 社区活跃 | 海量教程、开源项目和问题解答 |
| 🔧 工具完善 | Jupyter、Colab 等优秀开发环境 |
| 🚀 部署便捷 | Flask/FastAPI 快速构建 AI 服务 |
了解 AI 各领域的占比,帮助你更好地规划学习重点:
35% 机器学习 30% 深度学习 15% 自然语言处理 12% 计算机视觉 5% 强化学习 3% 其他
下图展示了从零基础到 AI 专家的完整学习路线:
graph TD
A[开始学习 Python+AI] --> B{有编程基础?}
B -- 否 --> C[阶段 0: Python 基础]
B -- 是 --> D[阶段 1: 数据科学基础]
C --> D
D --> E[阶段 2: 机器学习]
E --> F{选择方向}
F --> G[阶段 3a: NLP]
F --> H[阶段 3b: 计算机视觉]
F --> I[阶段 3c: 深度学习通用]
G --> J[阶段 4: 实战项目]
H --> J
I --> J
J --> K[🎉 AI 工程师]
学习目标:掌握 Python 核心语法和编程思维
示例 1:列表推导式
# 传统方式
squares = []
for i in range(10):
squares.append(i ** 2)
# Pythonic 方式
squares = [i ** 2 for i in range(10)]
# 带条件的列表推导式
even_squares = [i ** 2 for i in range(10) if i % 2 == 0]
print(even_squares)
# [0, 4, 16, 36, 64]
示例 2:字典操作
# 字典推导式
word_count = "hello world hello python"
counts = {word: word_count.split().count(word) for word in set(word_count.split())}
print(counts)
# {'hello': 2, 'world': 1, 'python': 1}
# defaultdict 使用
from collections import defaultdict
counts = defaultdict(int)
for word in word_count.split():
counts[word] += 1
示例 3:上下文管理器
# 正确的文件操作方式
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
# 自动关闭文件,即使发生异常
# 自定义上下文管理器
from contextlib import contextmanager
@contextmanager
def timer():
import time
start = time.time()
yield
print(f"耗时:{time.time() - start:.2f}秒")
with timer():
sum(range(1000000))
学习目标:掌握数据处理、分析和可视化技能
import numpy as np
# 数组创建
arr = np.array([[1, 2, 3], [4, 5, 6]])
zeros = np.zeros((3, 4))
random = np.random.randn(3, 3)
# 数组运算
print(arr * 2) # 元素级乘法
print(arr @ arr.T) # 矩阵乘法
print(np.dot(arr, arr.T)) # 矩阵乘法
# 广播机制
a = np.array([[1, 2, 3], [4, 5, 6]]) # (2, 3)
b = np.array([10, 20, 30]) # (3,)
print(a + b) # b 广播到 (2, 3)
# 实用函数
print(np.mean(arr, axis=1)) # 按行求均值
print(np.argmax(arr, axis=1)) # 按行找最大值索引
import pandas as pd
# 创建 DataFrame
data = {
'name': ['张三', '李四', '王五', '赵六'],
'age': [25, 30, 35, 28],
'city': ['北京', '上海', '深圳', '杭州'],
'salary': [15000, 20000, 25000, 18000]
}
df = pd.DataFrame(data)
# 数据筛选
high_salary = df[df['salary'] > 18000]
beijing = df[df['city'] == '北京']
# 数据分组
city_stats = df.groupby('city').agg({'salary': ['mean', 'max', 'count']})
# 数据排序
df_sorted = df.sort_values('salary', ascending=False)
# 数据合并
df2 = pd.DataFrame({'name': ['张三', '李四'], 'department': ['技术', '产品']})
merged = pd.merge(df, df2, on='name', how='left')
print(city_stats)
import matplotlib.pyplot as plt
import seaborn as sns
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 创建子图
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. 柱状图
axes[0, 0].bar(df['name'], df['salary'])
axes[0, 0].set_title('薪资对比')
axes[0, 0].set_xlabel('姓名')
axes[0, 0].set_ylabel('薪资')
# 2. 散点图
axes[0, 1].scatter(df['age'], df['salary'], s=100, alpha=0.6)
axes[0, 1].set_title('年龄与薪资关系')
axes[0, 1].set_xlabel('年龄')
axes[0, 1].set_ylabel('薪资')
# 3. 饼图(城市分布)
city_counts = df['city'].value_counts()
axes[1, 0].pie(city_counts, labels=city_counts.index, autopct='%1.1f%%')
axes[1, 0].set_title('城市分布')
# 4. 箱线图
axes[, ].boxplot(df[])
axes[, ].set_title()
plt.tight_layout()
plt.savefig(, dpi=)
plt.show()
学习目标:理解 ML 原理,掌握 Scikit-learn 实战
1. 线性回归完整流程
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# 生成模拟数据
np.random.seed(42)
n_samples = 1000
# 特征:面积、房间数、房龄
X = np.random.randn(n_samples, 3)
X[:, 0] = X[:, 0] * 50 + 100 # 面积:50-150㎡
X[:, 1] = np.abs(X[:, 1]) * 2 + 1 # 房间数:1-5 间
X[:, 2] = np.abs(X[:, 2]) * 10 + 1 # 房龄:1-30 年
# 真实价格 = 面积*1000 + 房间数*50000 - 房龄*2000 + 噪声
y = (X[:, 0] * 1000 + X[:, 1] * 50000 - X[:, 2] * 2000 + np.random.randn(n_samples) * 50000)
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
()
()
()
()
plt.figure(figsize=(, ))
plt.scatter(y_test, y_pred, alpha=)
plt.plot([y.(), y.()], [y.(), y.()], , lw=)
plt.xlabel()
plt.ylabel()
plt.title()
plt.grid(, alpha=)
plt.show()
2. 分类算法对比
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
# 生成分类数据
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_classes=3, random_state=42
)
# 定义模型
models = {
'逻辑回归': LogisticRegression(max_iter=1000),
'SVM': SVC(),
'决策树': DecisionTreeClassifier(),
'随机森林': RandomForestClassifier(n_estimators=100),
'朴素贝叶斯': GaussianNB(),
'KNN': KNeighborsClassifier()
}
# 交叉验证评估
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
results[name] = {'mean': scores.mean(), 'std': scores.std()}
print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
plt.figure(figsize=(, ))
names = (results.keys())
means = [results[name][] name names]
stds = [results[name][] name names]
plt.bar(names, means, yerr=stds, alpha=, capsize=)
plt.ylabel()
plt.title()
plt.ylim(, )
plt.grid(axis=, alpha=)
plt.xticks(rotation=)
plt.show()
3. 聚类算法实现
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
# 生成聚类数据
X, _ = make_blobs(
n_samples=500, centers=4, cluster_std=1.5, random_state=42
)
# K-Means 聚类(需要指定簇数)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)
kmeans_silhouette = silhouette_score(X, kmeans_labels)
# DBSCAN 聚类(自动发现簇数)
dbscan = DBSCAN(eps=1.5, min_samples=10)
dbscan_labels = dbscan.fit_predict(X)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(f"K-Means: 发现 4 个簇,轮廓系数={kmeans_silhouette:.3f}")
print(f"DBSCAN: 发现{n_clusters}个簇")
# 肘部法则确定最佳 K 值
inertias = []
K_range = range(2, 11)
for K in K_range:
kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(12, 5))
# 肘部图
plt.subplot(1, 2, 1)
plt.plot(K_range, inertias, )
plt.xlabel()
plt.ylabel()
plt.title()
plt.grid(, alpha=)
plt.subplot(, , )
plt.scatter(X[:, ], X[:, ], c=kmeans_labels, cmap=, alpha=)
plt.scatter(kmeans.cluster_centers_[:, ], kmeans.cluster_centers_[:, ], c=, s=, marker=, label=)
plt.title()
plt.legend()
plt.tight_layout()
plt.show()
学习目标:掌握 PyTorch,理解深度学习原理
1. 构建神经网络
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
# 检查 CUDA 可用性
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"使用设备:{device}")
# 定义神经网络
class NeuralNetwork(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(NeuralNetwork, self).__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(hidden_size, hidden_size // 2)
self.layer3 = nn.Linear(hidden_size // 2, num_classes)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
out = self.layer1(x)
out = self.relu(out)
out = self.dropout(out)
out = self.layer2(out)
out = self.relu(out)
out = self.dropout(out)
out = self.layer3(out)
return out
# 超参数
input_size = 784 # MNIST 图像 28x28
hidden_size =
num_classes =
num_epochs =
batch_size =
learning_rate =
model = NeuralNetwork(input_size, hidden_size, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
X_train = torch.randn(, input_size).to(device)
y_train = torch.randint(, num_classes, (,)).to(device)
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=)
train_losses = []
epoch (num_epochs):
model.train()
epoch_loss =
i, (images, labels) (train_loader):
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / (train_loader)
train_losses.append(avg_loss)
()
plt.figure(figsize=(, ))
plt.plot(train_losses, marker=)
plt.xlabel()
plt.ylabel()
plt.title()
plt.grid(, alpha=)
plt.show()
2. CNN 图像分类
import torch
import torch.nn as nn
import torch.nn.functional as F
class CNN(nn.Module):
def __init__(self, num_classes=10):
super(CNN, self).__init__()
# 第一个卷积块
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(32)
self.pool1 = nn.MaxPool2d(2, 2)
# 第二个卷积块
self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(64)
self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(64)
self.pool2 = nn.MaxPool2d(2, 2)
# 全连接层
self.fc1 = nn.Linear( * * , )
.dropout = nn.Dropout()
.fc2 = nn.Linear(, num_classes)
():
x = F.relu(.bn1(.conv1(x)))
x = F.relu(.bn2(.conv2(x)))
x = .pool1(x)
x = F.relu(.bn3(.conv3(x)))
x = F.relu(.bn4(.conv4(x)))
x = .pool2(x)
x = x.view(-, * * )
x = F.relu(.fc1(x))
x = .dropout(x)
x = .fc2(x)
x
model = CNN()
(model)
total_params = (p.numel() p model.parameters())
trainable_params = (p.numel() p model.parameters() p.requires_grad)
()
()
3. Transformer 注意力机制
import torch
import torch.nn as nn
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# 线性变换并分割成多头
Q = self.W_q(query).view(batch_size, -1, self.num_heads, .d_k).transpose(, )
K = .W_k(key).view(batch_size, -, .num_heads, .d_k).transpose(, )
V = .W_v(value).view(batch_size, -, .num_heads, .d_k).transpose(, )
x, attention_weights = .scaled_dot_product_attention(Q, K, V, mask)
x = x.transpose(, ).contiguous().view(batch_size, -, .d_model)
output = .W_o(x)
output, attention_weights
d_model =
num_heads =
seq_length =
batch_size =
x = torch.randn(batch_size, seq_length, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, attention = mha(x, x, x)
()
()
()
学习目标:掌握现代 NLP 技术,熟练使用大语言模型
1. 使用预训练模型
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
# 情感分析
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
texts = [
"I love this product! It's amazing!",
"This is the worst experience ever.",
"It's okay, nothing special."
]
results = sentiment_pipeline(texts)
for text, result in zip(texts, results):
print(f"文本:{text}")
print(f"情感:{result['label']}, 置信度:{result['score']:.4f}\n")
# 文本分类
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=3) # 正面、中性、负面
# 编码文本
text = "这家餐厅的菜品味道很好,服务也很周到!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# 预测
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions).item()
labels = ["负面", "中性", "正面"]
print(f"预测类别:{labels[predicted_class]}")
print(f"置信度:{predictions[][predicted_class]:f}")
2. RAG 检索增强生成
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
# 1. 加载文档
loader = TextLoader('knowledge_base.txt')
documents = loader.load()
# 2. 文本切分
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50, length_function=len
)
splits = text_splitter.split_documents(documents)
# 3. 创建向量存储
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)
vectorstore = FAISS.from_documents(splits, embeddings)
# 4. 创建检索器
retriever = vectorstore.as_retriever(
search_type="similarity", search_kwargs={"k": 3}
)
# 5. 自定义提示模板
prompt_template = """使用以下上下文信息来回答问题。如果不知道答案,就说不知道,不要编造答案。
上下文信息:{context}
问题:{question}
答案:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# 6. 创建 QA 链
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
# 7. 查询
query = "如何办理社保卡?"
result = qa_chain({"query": query})
()
()
()
doc result[]:
()
3. 简单的 ChatGLM 对话示例
from transformers import AutoTokenizer, AutoModel
import torch
# 加载模型
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = model.eval()
# 对话历史
history = []
response = "你好!我是智能助手小政,有什么可以帮助您的吗?"
print(f"助手:{response}")
while True:
user_input = input("\n用户:")
if user_input.lower() in ['退出', 'exit', 'quit']:
break
# 添加政务系统人设
system_prompt = "你是一个政务服务大厅的智能引导员,名叫'小政'。"
response, history = model.chat(
tokenizer,
f"{system_prompt}\n用户:{user_input}",
history=history,
max_length=2048,
temperature=0.7
)
print(f"小政:{response}")
history.append((user_input, response))
| 项目类型 | 难度 | 涉及技术 | 预计时间 |
|---|---|---|---|
| 房价预测 | ⭐⭐ | Pandas, Scikit-learn | 1 周 |
| 图像分类 | ⭐⭐⭐ | PyTorch, CNN | 2 周 |
| 情感分析 | ⭐⭐⭐ | Transformers, NLP | 2 周 |
| 智能客服 | ⭐⭐⭐⭐ | LangChain, LLM | 3 周 |
| RAG 系统 | ⭐⭐⭐⭐⭐ | 向量数据库,Agent | 4 周 |
# project_structure.txt
"""
智能文档问答系统
│
├── data/ # 数据目录
│ ├── documents/ # 原始文档
│ └── vectorstore/ # 向量存储
│
├── src/ # 源代码
│ ├── config.py # 配置文件
│ ├── loader.py # 文档加载
│ ├── embeddings.py # 向量化
│ ├── retriever.py # 检索器
│ ├── generator.py # 生成器
│ └── api.py # API 接口
│
├── app.py # 主应用
├── requirements.txt # 依赖
└── README.md # 说明文档
"""
# config.py
import os
from dataclasses import dataclass
from typing import Optional
@dataclass
class Config:
# API 密钥
OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
# 模型配置
EMBEDDING_MODEL: str = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
LLM_MODEL: str = "gpt-3.5-turbo"
LLM_TEMPERATURE: float = 0.7
LLM_MAX_TOKENS: int = 1000
# 向量存储配置
CHUNK_SIZE: int = 500
CHUNK_OVERLAP: int = 50
VECTOR_DB_PATH: str = "data/vectorstore"
# 检索配置
TOP_K: int = 3
SIMILARITY_THRESHOLD: float = 0.7
# API 配置
API_HOST: str = "0.0.0.0"
API_PORT: int = 8000
# loader.py
from typing import
langchain.document_loaders (
TextLoader, PyPDFLoader, DirectoryLoader
)
langchain.text_splitter RecursiveCharacterTextSplitter
langchain.schema Document
:
():
.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=,
separators=[, , , , , , , , , ]
)
() -> [Document]:
loader = TextLoader(file_path, encoding=)
documents = loader.load()
.text_splitter.split_documents(documents)
() -> [Document]:
loader = PyPDFLoader(file_path)
documents = loader.load()
.text_splitter.split_documents(documents)
() -> [Document]:
loader = DirectoryLoader(directory, glob=glob)
documents = loader.load()
.text_splitter.split_documents(documents)
fastapi FastAPI, HTTPException
pydantic BaseModel
typing ,
uvicorn
app = FastAPI(title=, version=)
():
question:
top_k: [] =
():
answer:
sources: []
confidence:
():
{: , : , : {: , : }}
():
{: }
():
:
QueryResponse(
answer=,
sources=[, ],
confidence=
)
Exception e:
HTTPException(status_code=, detail=(e))
__name__ == :
uvicorn.run(, host=, port=, reload=)
"""
项目:预测客户是否会购买理财产品
数据集:模拟银行客户数据
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# ========== 1. 数据生成 ==========
np.random.seed(42)
n_samples = 5000
data = {
'年龄': np.random.randint(18, 70, n_samples),
'收入': np.random.randint(3000, 50000, n_samples),
'存款': np.random.randint(0, 1000000, n_samples),
'债务': np.random.randint(0, 500000, n_samples),
'信用评分': np.random.randint(300, 850, n_samples),
: np.random.randint(, , n_samples),
: np.random.randint(, , n_samples),
: np.random.choice([, , , , ], n_samples),
: np.random.choice([, , ], n_samples),
: np.random.choice([, , , ], n_samples),
}
df = pd.DataFrame(data)
():
score =
<= row[] <= : score +=
row[] > : score +=
row[] > : score +=
row[] > : score +=
row[] [, ]: score +=
(score + np.random.randint(-, ), ) /
df[] = df.apply(calc_purchase_prob, axis=)
df[] = (df[] > ).astype()
( * )
()
( * )
(df.info())
()
(df[].value_counts())
()
fig, axes = plt.subplots(, , figsize=(, ))
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].set_xlabel()
axes[, ].set_ylabel()
axes[, ].legend()
axes[, ].set_title()
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].set_xlabel()
axes[, ].legend()
axes[, ].set_title()
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].hist(df[df[] == ][], bins=, alpha=, label=)
axes[, ].set_xlabel()
axes[, ].legend()
axes[, ].set_title()
career_purchase = df.groupby()[].mean()
axes[, ].bar(career_purchase.index, career_purchase.values)
axes[, ].set_ylabel()
axes[, ].set_title()
edu_purchase = df.groupby()[].mean()
axes[, ].bar(edu_purchase.index, edu_purchase.values)
axes[, ].set_ylabel()
axes[, ].set_title()
numeric_cols = [, , , , , , , ]
correlation = df[numeric_cols].corr()
sns.heatmap(correlation, annot=, fmt=, cmap=, center=, ax=axes[, ])
axes[, ].set_title()
plt.tight_layout()
plt.savefig(, dpi=)
plt.show()
le = LabelEncoder()
df[] = le.fit_transform(df[])
df[] = le.fit_transform(df[])
df[] = le.fit_transform(df[])
df[] = df[] / (df[] * + )
df[] = df[] / (df[] * + )
df[] = df[] - df[]
feature_cols = [, , , , , , , , , , , , ]
X = df[feature_cols]
y = df[]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=, random_state=, stratify=y
)
models = {
: Pipeline([(, StandardScaler()), (, LogisticRegression(max_iter=, random_state=))]),
: RandomForestClassifier(n_estimators=, random_state=),
: GradientBoostingClassifier(random_state=)
}
results = {}
name, model models.items():
()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, ]
results[name] = {
: model,
: y_pred,
: y_pred_proba,
: model.score(X_test, y_test),
: roc_auc_score(y_test, y_pred_proba)
}
()
()
best_model_name = (results, key= x: results[x][])
best_model = results[best_model_name][]
()
( * )
()
(classification_report(y_test, results[best_model_name][]))
cm = confusion_matrix(y_test, results[best_model_name][])
plt.figure(figsize=(, ))
sns.heatmap(cm, annot=, fmt=, cmap=)
plt.xlabel()
plt.ylabel()
plt.title()
plt.show()
plt.figure(figsize=(, ))
name, result results.items():
fpr, tpr, _ = roc_curve(y_test, result[])
plt.plot(fpr, tpr, label=)
plt.plot([, ], [, ], , label=)
plt.xlabel()
plt.ylabel()
plt.title()
plt.legend()
plt.grid(alpha=)
plt.show()
(best_model, ):
feature_importance = pd.DataFrame({
: feature_cols,
: best_model.feature_importances_
}).sort_values(, ascending=)
plt.figure(figsize=(, ))
plt.barh(feature_importance[], feature_importance[])
plt.xlabel()
plt.title()
plt.tight_layout()
plt.show()
()
(feature_importance)
( + * )
()
( * )
| 平台 | 课程 | 适合阶段 | 难度 |
|---|---|---|---|
| Coursera | Machine Learning (Andrew Ng) | 初学者 | ⭐⭐⭐ |
| 吴恩达深度学习课程 | Deep Learning Specialization | 阶段 2-3 | ⭐⭐⭐⭐ |
| 李宏毅机器学习 | Machine Learning | 中级 | ⭐⭐⭐⭐ |
| Fast.ai | Practical Deep Learning for Coders | 实战导向 | ⭐⭐⭐⭐ |
| 极客时间 | Python 进阶 | 阶段 0-1 | ⭐⭐ |
答: 完全可以!Python 是公认最适合初学者的语言。建议学习路径:
答: 可以,但需要补充必要的数学知识:
建议:边做项目边补数学,遇到不懂的再学。
答: 因人而异,大致时间线:
| 投入时间 | 学习周期 | 可达到水平 |
|---|---|---|
| 1 小时/天 | 12-18 个月 | 初级 AI 工程师 |
| 2-3 小时/天 | 8-12 个月 | 中级 AI 工程师 |
| 全职学习 | 4-6 个月 | 实战能力 |
关键:项目经验 > 理论知识,一定要做项目!
答: 多种解决方案:
优化技巧:
# 减小 batch size
batch_size = 16 # 而不是 64
# 使用混合精度训练
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
# 梯度累积
accumulation_steps = 4
答: AI 技术迭代快,建议:
# 1. 安装 Anaconda(推荐)
# 下载:https://www.anaconda.com/
# 2. 创建虚拟环境
conda create -n ai_env python=3.10
# 3. 激活环境
conda activate ai_env
# 4. 安装核心库
pip install numpy pandas matplotlib seaborn
pip install scikit-learn xgboost lightgbm
pip install torch torchvision torchaudio
pip install transformers langchain
pip install jupyter lab
# 5. 启动 Jupyter
jupyter lab
# Jupyter 相关
jupyter notebook # 启动 notebook
jupyter lab # 启动 lab
jupyter nbconvert # 转换 notebook 格式
# Git 相关
git clone <url> # 克隆仓库
git add . # 添加更改
git commit -m "msg" # 提交
git push # 推送
# Conda 相关
conda env list # 列出环境
conda install <pkg> # 安装包
conda env remove -n <env> # 删除环境

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online