DIAMOND 基因序列快速比对工具使用及超算集群并行计算指南 | 极客日志

Shell / Bash算法

DIAMOND 基因序列快速比对工具使用及超算集群并行计算指南

DIAMOND 是一款快速序列比对工具，适用于基因序列分析。介绍安装方法，包括源码编译与 Docker 方式。提供基础 blastx/blastp 命令示例，以及基于 SLURM 的超算集群多节点并行计算配置方案，涉及共享目录与临时目录设置。结果处理部分涵盖得分、E 值、相似性阈值过滤，并给出 AWK、Python 及 R 语言提取最优匹配结果的脚本示例。支持参数优化与敏感性调整，适合大规模数据处理。

监控大屏发布于 2025/1/17更新于 2026/6/1220 浏览

DIAMOND 基因序列快速比对工具使用详解

DIAMOND 是一款快速的序列比对工具，其使用方法如下：

1. 安装 DIAMOND

可从官方网站下载安装包，并安装到本地电脑中。当然还有 Docker、Conda 以及编译安装方式，一般用不上，但注意新版对 GCC 的要求高，出现 GCC 错误时可选择下载低版本的 DIAMOND 或者升级 GCC 到指定版本以上。

# 下载 diamond 程序文件
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
# 解压会出来一个 diamond 的文件
tar -xzvf diamond-linux64.tar.gz
# 移到系统环境目录、或将当前目录加入系统环境目录，或者直接使用路径加 diamond 命令运行
diamond blastx ./diamond blastx /opt/diamond blastx

2. 准备数据集

首先需要准备用于比对的序列数据集，比如 FASTA 格式的序列文件。

# 下载 nr 数据库，或者自己需要的数据库
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
gunzip nr.gz
# 使用 diamond 命令创建 diamond 格式数据库
diamond makedb --in nr --db nr

3. 运行 DIAMOND

常规使用

在终端中输入以下命令，即可启动 DIAMOND 程序并运行比对任务： diamond blastx -d [参考序列文件] -q [待比对序列文件] -o [输出文件名]

# 下载 nr 数据库，或者自己需要的数据库
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
gunzip nr.gz
# 使用 diamond 命令创建 diamond 格式数据库
diamond makedb --in nr --db nr
# 命令使用
diamond blastx --db nr -q reads.fna -o dna_matches_fmt6.txt
diamond blastp --db nr -q reads.faa -o protein_matches_fmt6.txt

其中，blastx 表示使用蛋白质序列比对算法，-d 和-q 分别指定参考序列文件和待比对序列文件，-o 指定输出文件名。

超算集群多计算节点并行计算（Distributed computing）

DIAMOND 尽管速度快，但对于大文件进行比对时，大于 1G 以上的文件对于 40 核的单个节点可能仍然需要几天的时间，如果有较多的节点时，可以使用多节点的并行计算。

准备工作：

将 DIAMOND 程序目录在各节点间共享
样品序列目录在各节点间共享
所有节点使用相同的临时目录在各节点间共享

# Diamond distributed-memory parallel processing
# Diamond supports the parallel processing of large alignments on HPC clusters and supercomputers, spanning numerous compute nodes.
# Work distribution is orchestrated by Diamond via a shared file system typically available on such clusters, using lightweight file-based stacks and POSIX functionality.





diamond blastp --db DATABASE.dmnd --query QUERY.fasta --multiprocessing --mp-init --tmpdir  --parallel-tmpdir 




diamond blastp --db DATABASE.dmnd --query QUERY.fasta -o OUTPUT_FILE --multiprocessing --tmpdir  --parallel-tmpdir 













module purge
module load gcc impi
 SLURM_HINT=multithread


srun diamond FLAGS
FLAGS refers to the aforementioned parallel flags  Diamond. Note that the actual configuration of the nodes varies between machines, and therefore, the parameters shown here are not of general applicability. It is recommended to start with few nodes on small problems, first.


Parallel Runs can be aborted and later resumed, and unfinished work packages from a previous run can be recovered and resubmitted  a subsequent run. Using the option --multiprocessing --mp-recover  the same value of --parallel-tmpdir will scan the working directory and configure a new parallel run including only the work packages that have not been completed  the previous run. Placing a file stop  the working directory causes DIAMOND processes to shut down  a controlled way after finishing the current work package. After removing the stop file, the multiprocessing run can be continued.


The granularity of the size of the work packages can be adjusted via the --block-size  at the same  affects the memory requirements at runtime. Parallel runs on more than 512 nodes of a supercomputer have been performed successfully.

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online

# downloading the tool，下载工具
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
# creating a diamond-formatted database file 创建 diamond 数据库
./diamond makedb --in reference.fasta -d reference
# running a search in blastp mode 使用 blastp 模式比对序列
./diamond blastp -d reference -q queries.fasta -o matches.tsv
# running a search in blastx mode 使用 blastx 模式比对序列
./diamond blastx -d reference -q reads.fasta -o matches.tsv
# downloading and using a BLAST database update_blastdb.pl --decompress --blastdb_version 5 swissprot
./diamond prepdb -d swissprot
./diamond blastp -d swissprot -q queries.fasta -o matches.tsv
Some important points to consider:
Repeat masking is applied to the query and reference sequences by default. To disable it, use --masking 0.
默认情况下是允许重复结果，如果只输出最优结果就加上 --masking 0
DIAMOND is optimized for large input files of >1 million proteins. Naturally the tool can be used for smaller files as well, but the algorithm will not reach its full efficiency.
The program may use quite a lot of memory and also temporary disk space. Should the program fail due to running out of either one, you need to set a lower value for the block size parameter -b.
DIAMOND 是大文件效率更好，对于小文件建议添加 -b 的参数
The sensitivity can be adjusted using the options --fast, --mid-sensitive, --sensitive, --more-sensitive, --very-sensitive and --ultra-sensitive.
比对敏感性，越往后其结果越接近原生 blast 结果，但速度也越慢，一般使用--more-sensitive 比较适中，计算资源不够的就使用 fast。

diamond --help
diamond v2.0.11.149 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Syntax: diamond COMMAND [OPTIONS]
Commands:
makedb Build DIAMOND database from a FASTA file #以 fasta 文件创建 diamond 格式数据库
blastp Align amino acid query sequences against a protein reference database #功能与原生 blastp 功能一致
blastx Align DNA query sequences against a protein reference database #功能与原生 blastx 一致
view View DIAMOND alignment archive (DAA) formatted file
help Produce help message
version Display version information
getseq Retrieve sequences from a DIAMOND database file
dbinfo Print information about a DIAMOND database file
test Run regression tests
makeidx Make database index
General options:
--threads (-p) number of CPU threads #指定需要运行的线程数，可尽量大
--db (-d) database file #diamond makedb 产生的 diamond 可使用格式的数据库
--out (-o) output file #比对结果输出命名
--outfmt (-f) output format #outfmt，一般选 6 表格格式，与原生 blast 一致
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM Value
6 may be followed by a space-separated list of these keywords:
qseqid means Query Seq - id
qlen means Query sequence length
sseqid means Subject Seq - id
sallseqid means All subject Seq - id(s), separated by a ';'
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
qseq_translated means Aligned part of query sequence (translated)
full_qseq means Query sequence full
qseq_mate means Query sequence of the mate
sseq means Aligned part of subject sequence
full_sseq means Subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive - scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive - scoring matches
qframe means Query frame
btop means Blast traceback operations(BTOP)
cigar means CIGAR string
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
skingdoms means unique Subject Kingdom(s), separated by a ';'
sphylums means unique Subject Phylum(s), separated by a ';'
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
qcovhsp means Query Coverage Per HSP
scovhsp means Subject Coverage Per HSP
qtitle means Query title
qqual means Query quality values for the aligned part of the query
full_qqual means Query quality values
qstrand means Query strand
Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
--verbose (-v) verbose console output
--log enable debug log
--quiet disable console output
--header Write header lines to blast tabular format.
Makedb options:
--in input reference file in FASTA format
--taxonmap protein accession to taxid mapping file
--taxonnodes taxonomy nodes.dmp from NCBI
--taxonnames taxonomy names.dmp from NCBI
Aligner options:
--query (-q) input query file
--strand query strands to search (both/minus/plus)
--un file for unaligned queries
--al file or aligned queries
--unfmt format of unaligned query file (fasta/fastq)
--alfmt format of aligned query file (fasta/fastq)
--unal report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k) maximum number of target sequences to report alignments for (default=25)
--top report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
--max-hsps maximum number of HSPs per target sequence to report for each query (default=1)
--range-culling restrict hit culling to overlapping query ranges
--compress compression for output files (0=none, 1=gzip, zstd)
--evalue (-e) maximum e-value to report alignments (default=0.001)
--min-score minimum bit score to report alignments (overrides e-value setting)
--id minimum identity% to report an alignment
--query-cover minimum query cover% to report an alignment
--subject-cover minimum subject cover% to report an alignment
--fast enable fast mode
--mid-sensitive enable mid-sensitive mode
--sensitive enable sensitive mode)
--more-sensitive enable more sensitive mode
--very-sensitive enable very sensitive mode
--ultra-sensitive enable ultra sensitive mode
--iterate iterated search with increasing sensitivity
--global-ranking (-g) number of targets for global ranking
--block-size (-b) sequence block size in billions of letters (default=2.0)
--index-chunks (-c) number of chunks for index processing (default=4)
--tmpdir (-t) directory for temporary files
--parallel-tmpdir directory for temporary files used by multiprocessing
--gapopen gap open penalty
--gapextend gap extension penalty
--frameshift (-F) frame shift penalty (default=disabled)
--long-reads short for --range-culling
--top 10 -F 15
--matrix score matrix for protein alignment (default=BLOSUM62)
--custom-matrix file containing custom scoring matrix
--comp-based-stats composition based statistics mode (0-4)
--masking enable tantan masking of repeat regions (0/1=default)
--query-gencode genetic code to use to translate query (see user manual)
--salltitles include full subject titles in DAA file
--sallseqid include all subject ids in DAA file
--no-self-hits suppress reporting of identical self hits
--taxonlist restrict search to list of taxon ids (comma-separated)
--taxon-exclude exclude list of taxon ids (comma-separated)
--seqidlist filter the database by list of accessions
--skip-missing-seqids ignore accessions missing in the database
Advanced options:
--algo Seed search algorithm (0=double-indexed/1=query-indexed/ctg=contiguous-seed)
--bin number of query bins for seed search
--min-orf (-l) ignore translated sequences without an open reading frame of at least this length
--freq-sd number of standard deviations for ignoring frequent seeds
--id2 minimum number of identities for stage 1 hit
--xdrop (-x) xdrop for ungapped alignment
--gapped-filter-evalue E-value threshold for gapped filter (auto)
--band band for dynamic programming computation
--shapes (-s) number of seed shapes (default=all available)
--shape-mask seed shapes
--multiprocessing enable distributed-memory parallel processing
--mp-init initialize multiprocessing run
--mp-recover enable continuation of interrupted multiprocessing run
--mp-query-chunk process only a single query chunk as specified
--ext-chunk-size chunk size for adaptive ranking (default=auto)
--no-ranking disable ranking heuristic
--ext Extension mode (banded-fast/banded-slow/full)
--culling-overlap minimum range overlap with higher scoring hit to delete a hit (default=50%)
--taxon-k maximum number of targets to report per species
--range-cover percentage of query range to be covered for range culling (default=50%)
--dbsize effective database size (in letters)
--no-auto-append disable auto appending of DAA and DMND file extensions
--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output
--stop-match-score Set the match score of stop codons against each other.
--tantan-minMaskProb minimum repeat probability for masking (default=0.9)
--file-buffer-size file buffer size in bytes (default=67108864)
--memory-limit (-M) Memory limit for extension stage in GB
--no-unlink Do not unlink temporary files.
--target-indexed Enable target-indexed mode
--ignore-warnings Ignore warnings
View options:
--daa (-a) DIAMOND alignment archive (DAA) file
--forwardonly only show alignments of forward strand
Getseq options:
--seq Sequence numbers to display.
Online documentation at http://www.diamondsearch.org

diamond --help
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Syntax: diamond COMMAND [OPTIONS]
Commands:
makedb Build DIAMOND database from a FASTA file
prepdb Prepare BLAST database for use with Diamond
blastp Align amino acid query sequences against a protein reference database
blastx Align DNA query sequences against a protein reference database
cluster Cluster protein sequences
linclust Cluster protein sequences in linear time
realign Realign clustered sequences against their centroids
recluster Recompute clustering to fix errors
reassign Reassign clustered sequences to the closest centroid
view View DIAMOND alignment archive (DAA) formatted file
merge-daa Merge DAA files
help Produce help message
version Display version information
getseq Retrieve sequences from a DIAMOND database file
dbinfo Print information about a DIAMOND database file
test Run regression tests
makeidx Make database index
greedy-vertex-cover Compute greedy vertex cover
Possible [OPTIONS] for COMMAND can be seen with syntax: diamond COMMAND Online documentation at http://www.diamondsearch.org

diamond makedb
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Options:
--threads number of CPU threads
--verbose verbose console output
--log enable debug log
--quiet disable console output
--tmpdir directory for temporary files
--db database file
--in input reference file in FASTA format/input DAA files for merge-daa
--taxonmap protein accession to taxid mapping file
--taxonnodes taxonomy nodes.dmp from NCBI
--taxonnames taxonomy names.dmp from NCBI
--file-buffer-size file buffer size in bytes (default=67108864)
--no-unlink Do not unlink temporary files.
--ignore-warnings Ignore warnings
--no-parse-seqids Print raw seqids without parsing
Error: Missing parameter: database file (--db/-d)

diamond blastp -d database.fasta -q query.fasta -o output.m8 --min-score 100
### --min-score 100 表示只保留得分高于等于 100 的比对结果。

diamond blastp -d database.fasta -q query.fasta -o output.m8 --evalue 1e-5
### --evalue 1e-5 表示只保留 E 值低于或等于 1e-5 的比对结果。

diamond blastp -d database.fasta -q query.fasta -o output.m8 --id 97
### --id 97 表示只保留相似性大于等于 97% 的比对结果。

### 使用 AWK 命令根据第一个列（query ID）或其他标识符来提取唯一结果
sort -k1,1 -u output.m8 > unique_output.m8

# 打开 Diamond Blastp 输出文件
with open('output.m8', 'r') as file:
    best_hit = {}
    # 逐行读取文件
    for line in file:
        fields = line.strip().split('\t')
        # 根据制表符分割字段
        query_id, subject_id, percent_identity, alignment_length, e_value, bit_score = fields[:6]
        # 如果查询 ID 不在 best_hit 中或当前行比最佳结果更好，则更新最优结果
        if query_id not in best_hit or float(bit_score) > float(best_hit[query_id]['bit_score']):
            best_hit[query_id] = {
                'subject_id': subject_id,
                'percent_identity': float(percent_identity),
                'alignment_length': int(alignment_length),
                'e_value': float(e_value),
                'bit_score': float(bit_score)
            }
    # 输出最优结果
    for query_id, hit_info in best_hit.items():
        print(f"Query ID: {query_id}")
        print(f"Subject ID: {hit_info['subject_id']}")
        print(f"Percent Identity: {hit_info['percent_identity']}")
        print(f"Alignment Length: {hit_info['alignment_length']}")
        print(f"E-value: {hit_info['e_value']}")
        print(f"Bit Score: {hit_info['bit_score']}")
        print("-------------")

# 读取 Diamond Blastp 输出文件
data <- read.table("output.m8", header = FALSE, sep = "\t")
# 命名列名
colnames(data) <- c("query_id", "subject_id", "percent_identity", "alignment_length", "e_value", "bit_score")
# 根据查询 ID 获取最优结果
library(dplyr)
best_hits <- data %>% group_by(query_id) %>% slice(which.max(bit_score))
# 根据最高比对分数选择最优结果，可以根据其他标准替换 bit_score
# 显示最优结果
print(best_hits)

DIAMOND 基因序列快速比对工具使用及超算集群并行计算指南

DIAMOND 基因序列快速比对工具使用详解

1. 安装 DIAMOND

2. 准备数据集

3. 运行 DIAMOND

常规使用

超算集群多计算节点并行计算（Distributed computing）

更多推荐文章

相关免费在线工具

4. 结果解读

5. 帮助说明

全参数帮助文件

新版本帮助文件

6. 结果过滤

1. 根据比对的得分进行过滤

2. 根据期望的价值（E 值）进行过滤

3. 根据相似性阈值过滤

4. 取唯一结果

5. Python 脚本处理输出最优结果

6. 使用 R 提取最有结果

7. 参考文献

更多推荐文章

相关免费在线工具

DIAMOND 基因序列快速比对工具使用及超算集群并行计算指南

DIAMOND 基因序列快速比对工具使用详解

1. 安装 DIAMOND

2. 准备数据集

3. 运行 DIAMOND

常规使用

超算集群多计算节点并行计算（Distributed computing）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

4. 结果解读

5. 帮助说明

全参数帮助文件

新版本帮助文件

6. 结果过滤

1. 根据比对的得分进行过滤

2. 根据期望的价值（E 值）进行过滤

3. 根据相似性阈值过滤

4. 取唯一结果

5. Python 脚本处理输出最优结果

6. 使用 R 提取最有结果

7. 参考文献

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具