30 个数据工程必备的 Python 包 | 极客日志

PythonAI算法

30 个数据工程必备的 Python 包

30 个数据工程必备的 Python 包，涵盖任务通知、进度监控、日志记录、文本处理、统计计算、模型验证等维度。内容包括 Knockknock、tqdm、Pandas-log、Emoji、TheFuzz 等工具的安装命令、代码示例及应用场景。这些工具能有效提升数据处理效率、简化调试流程并增强模型评估能力，适用于数据分析、机器学习及自动化运维场景。

NodeJser发布于 2025/2/6更新于 2026/6/322 浏览

Python 作为目前最易入门且生态丰富的编程语言，在数据处理、机器学习及自动化领域占据主导地位。得益于 NumPy、SciPy 等基础库的支持，以及庞大的社区贡献，Python 拥有众多高效工具包，能显著提升数据工程师的工作流效率。

本文精选了 30 个在数据工程、数据分析及机器学习场景中极具价值的 Python 包。这些工具涵盖了任务通知、进度监控、日志记录、文本处理、统计计算、模型验证等多个维度。以下将逐一介绍其核心功能、安装方式及典型应用场景。

1. Knockknock

Knockknock 是一个用于机器学习模型训练状态通知的工具包。当模型训练结束或发生崩溃时，它可以通过电子邮件、Slack、Microsoft Teams 等多种渠道发送通知，确保开发者及时获知任务状态。

安装：

pip install knockknock

使用示例：

from knockknock import email_sender
from sklearn.linear_model import LinearRegression
import numpy as np

@email_sender(recipient_emails=["[email protected]"], sender_email="[email protected]")
def train_linear_model(your_nicest_parameters):
    x = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
    y = np.dot(x, np.array([1, 2])) + 3
    regression = LinearRegression().fit(x, y)
    return regression.score(x, y)

场景： 适用于长时间运行的训练任务，避免等待结束后才发现失败。

2. tqdm

tqdm 是 Python 中最流行的进度条库。在进行迭代或循环操作时，它能提供可视化的进度反馈，支持 Jupyter Notebook 和命令行终端。

安装：

pip install tqdm

使用示例：

from tqdm import tqdm
q = 0
for i in tqdm(()):
    q = i +

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install pandas-log

import pandas as pd
import numpy as np
import pandas_log

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})

with pandas_log.enable():
    res = (df.drop("born", axis=1).groupby('name'))

pip install emoji

import emoji
print(emoji.emojize('Python is :thumbs_up:'))

pip install thefuzz

from thefuzz import fuzz, process

# 测试两个句子的相似度分数
score = fuzz.ratio("Test the word", "test the Word!")

# 从多个候选词中提取相似度最高的
choices = ["Atlanta Falcons", "New York Jets", "New York Giants"]
result = process.extract("new york jets", choices, limit=2)

pip install numerizer

from numerizer import numerize
numerize('forty two')       # 输出：42
numerize('nine and three quarters') # 输出：9.75

pip install pyautogui

import pyautogui
pyautogui.moveTo(10, 15)
pyautogui.click()
pyautogui.doubleClick()
pyautogui.press('enter')

pip install weightedcalcs

import seaborn as sns
import weightedcalcs as wc

df = sns.load_dataset('mpg')
calc = wc.Calculator("mpg")
calc.distribution(df, "origin")

pip install scikit-posthocs

import statsmodels.api as sa
import statsmodels.formula.api as sfa
import scikit_posthocs as sp

df = sa.datasets.get_rdataset('iris').data
df.columns = df.columns.str.replace('.', '')
lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()
anova = sa.stats.anova_lm(lm)
sp.posthoc_ttest(df, val_col='SepalWidth', group_col='Species', p_adjust='holm')

pip install cerberus

from cerberus import Validator
schema = {'name': {'type': 'string'}, 'age': {'type': 'integer'}}
v = Validator(schema)
document = {'name': 'john doe', 'age': 15}
v.validate(document) # 返回 True

pip install ppscore

import seaborn as sns
import ppscore as pps
df = sns.load_dataset('mpg')
pps.predictors(df, 'mpg')

pip install maya

import maya
now = maya.now()
tomorrow = maya.when('tomorrow')
print(tomorrow.datetime())

pip install pendulum

import pendulum
now = pendulum.now("Europe/Berlin")
now.in_timezone("Asia/Tokyo")
now.add(days=2)

pip install category_encoders

from category_encoders import BinaryEncoder
enc = BinaryEncoder(cols=['origin']).fit(df)
numeric_dataset = enc.transform(df)

pip install scikit-multilearn

from skmultilearn.dataset import load_dataset
from skmultilearn.adapt import MLkNN
X_train, y_train, _, _ = load_dataset('emotions', 'train')
classifier = MLkNN(k=3)
prediction = classifier.fit(X_train, y_train).predict(X_test)

pip install multiset

from multiset import Multiset
set1 = Multiset('aab')

pip install jazzit

from jazzit import error_track
@error_track("music.mp3", wait=5)
def run():
    for num in reversed(range(10)):
        print(10/num)

pip install handcalcs

import handcalcs.render
from math import sqrt
%%render
a = 4
b = 6
c = sqrt(3*a + b/7)

pip install neattext

import neattext as nt
mytext = "This is the word sample but ,our WEBSITE is https://example.com ."
docx = nt.TextFrame(text=mytext)
docx.normalize()

pip install combo

from combo.models.classifier_stacking import Stacking
clf = Stacking(classifiers, n_folds=4, shuffle_data=False)
clf.fit(X_train, y_train)

pip install pyaztro

import pyaztro
pyaztro.Aztro(sign='gemini').description

pip install Faker

from faker import Faker
fake = Faker()
print(fake.name())

pip install fairlearn

from fairlearn.metrics import MetricFrame, selection_rate
from fairlearn.datasets import fetch_adult
data = fetch_adult(as_frame=True)
selection_rates = MetricFrame(metrics=selection_rate, y_true=y_true, sensitive_features=sex)

pip install tiobeindexpy

from tiobeindexpy import tiobeindexpy as tb
df = tb.top_20()

pip install pytrends

from pytrends.request import TrendReq
pytrend = TrendReq()
keywords = pytrend.suggestions(keyword='Present Gift')

pip install visions

from visions.functional import detect_type, infer_type
from visions.typesets import CompleteSet
df = sns.load_dataset('titanic')
print(detect_type(df, CompleteSet()))

pip install schedule

import schedule
import time
def job():
    print("I'm working...")
schedule.every(10).seconds.do(job)
while True:
    schedule.run_pending()
    time.sleep(1)

pip install autocorrect

from autocorrect import Speller
spell = Speller()
spell("I'm not sleaspy and tehre is no place I'm giong to.")

pip install funcy

from funcy import select, even
select(even, {i for i in range(20)})

pip install icecream

from icecream import ic
def some_function(i):
    return i + 35
ic(some_function(121))

30 个数据工程必备的 Python 包

1. Knockknock

2. tqdm

更多推荐文章

相关免费在线工具

3. pandas-log

4. Emoji

5. TheFuzz

6. Numerizer

7. PyAutoGUI

8. Weightedcalcs

9. scikit-posthocs

10. Cerberus

11. ppscore

12. Maya

13. Pendulum

14. category_encoders

15. scikit-multilearn

16. Multiset

17. Jazzit

18. handcalcs

19. NeatText

20. Combo

21. PyAztro

22. Faker

23. Fairlearn

24. tiobeindexpy

25. pytrends

26. visions

27. Schedule

28. autocorrect

29. funcy

30. IceCream

总结

更多推荐文章

相关免费在线工具

30 个数据工程必备的 Python 包

1. Knockknock

2. tqdm

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. pandas-log

4. Emoji

5. TheFuzz

6. Numerizer

7. PyAutoGUI

8. Weightedcalcs

9. scikit-posthocs

10. Cerberus

11. ppscore

12. Maya

13. Pendulum

14. category_encoders

15. scikit-multilearn

16. Multiset

17. Jazzit

18. handcalcs

19. NeatText

20. Combo

21. PyAztro

22. Faker

23. Fairlearn

24. tiobeindexpy

25. pytrends

26. visions

27. Schedule

28. autocorrect

29. funcy

30. IceCream

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具