Bulk-RNA-seq流程——从测序数据到count文件(AGSdata)|江阴雨辰互联

2023年7月3日发(作者：)

Bulk-RNA-seq流程——从测序数据到count⽂件（AGSdata）2022-2-11软件安装conda的安装和配置wget -c /miniconda/Miniconda3-latest-Linux-x86_d 777 Miniconda3-latest-Linux-x86_ #给执⾏权限bash Miniconda3-latest-Linux-x86_ #运⾏当命令⾏前⾯出现(base)的时候说明现在已经在conda的环境中了conda config --add channels biocondaconda config --add channels conda-forge 清华镜像，哈⼯⼤镜像conda config --add channels /anaconda/pkgs/free/conda config --add channels /anaconda/pkgs/main/conda config --add channels /anaconda/cloud/conda-forge/conda config --add channels /anaconda/cloud/bioconda/conda config --add channels /anaconda/cloud/conda config --add channels /anaconda/pkgs/free/conda config --add channels /anaconda/pkgs/main/创建名为rna的环境变量：conda create -n rna python=2（许多软件依赖python2环境）环境退出：conda deactivate#创建名为rna的环境变量,许多软件依赖python2环境conda create -n rna python=2To activate this environment, use## $ conda activate rna## To deactivate an active environment, use## $ conda deactivateconda安装⽣信软件conda activate rna# 很多软件需要python版本较低conda install multiqc

conda install trim-galoreconda install hisat2conda install samtoolsconda install RSeQC测序数据处理数据下载axel -a "下载链接"质控QCfastqc⽣成质控报告，multiqc将各个样本的质控报告整合为⼀个。#! /bin/bash# Used for RNA-seq data by FastQC and multiQC#Usage# fastqc -o [output dir] --(no)extract -f [fastq|bam|sam] -c [contaminant file] seqfile1 .. seqfileN#简写代码：fastqc -t 8 -o seqfile1cd /mnt/AGS_RNA-seq/raw_datals *gz | xargs fastqc -t 64multiqc ./

质控报告 Basic Statistics 从read⽔平来总览，判断测序质量。Encoding ：测序平台的版本，因为不同版本的 error p的计算⽅法不⼀样。Total sequence：测序深度。⼀共测序的read数。是质量分析的主要参数。Sequence length：测序长度。%GC：GC碱基含量⽐，⼀般是物种特异性，⽐如⼈类是42%左右。

Perbase sequence quality 横坐标：第1-100个测序得到的碱基纵坐标：测序质量评估。这⾥的Q=-10*lg10(error P),即20%代表1%的错误读取率，30%代表0.1%的错误读取率箱型图：红⾊线，是某个顺序下测序碱基所有测序质量的中位数。黄⾊块，是测序质量在25%-75%区域。蓝⾊线，平均数。⼀般要求：测序箱型图10%的线⼤于Q=20。Q20过滤法。per tail sequence quality 横坐标：同上。纵坐标：tail的index编号。⽬的：防⽌测序过程中某些tail受不可控因素测序质量低。标准：蓝⾊表⽰质量⾼，浅⾊或暖⾊表⽰质量低，后续的分析可以去除低质量tail

trim_galore去除低质量readstrim_galore的参数trim_galore [options] --quality #设定phred quality阈值。默认20（99%的read质量），如果测序深度较深，可以设定25--phred33 #设定记分⽅式，代表Q+33=ASCII码的⽅式来记分⽅式--paired # 对于双端结果，⼀对reads中若⼀个read因为质量或其他原因被抛弃，则对应的另⼀个read也抛弃。--output_dir #输出⽬录--length #设定长度阈值，⼩于此长度会被抛弃--strency #设定可以忍受的前后adapter重叠的碱基数，默认是1-e #设定默认质量控制数，默认是0.1，即ERROR rate⼤于10%的read 会被舍弃，如果添加来--paired参数则会舍弃⼀对reads这⼩东西还挺贼，安装是连字符，运⾏就成了下划线#! /bin/bashcd /mnt/AGS_RNA-seq/clean_datals /mnt/AGS_RNA-seq/raw_data/*. >1ls /mnt/AGS_RNA-seq/raw_data/*. >2paste 1 2 > configdir='/mnt/AGS_RNA-seq/clean_data'cat config |while read iddo arr=(${id}) fq1=${arr[0]} fq2=${arr[1]}

trim_galore -q 20 --phred33 --length 36 --stringency 3 --paired $fq1 $fq2 -o $dir

done

数据⽐较⼤，2.5G的⼤概要⼗来分钟

再次QC检验cleandata#! /bin/bashls /mnt/AGS_RNA-seq/clean_data/*gz | xargs fastqc -t 64multiqc ./

mapping

常⽤软件有star和HISAT2下载索引⽂件cd /mnt/AGS_RNA-seq/index_hisat2axel /hisat/grch38_ zxvf grch38_38_tran/grch38_tran/genome_2grch38_tran/genome_2grch38_tran/genome_2grch38_tran/genome_2grch38_tran/make_grch38_38_tran/genome_2grch38_tran/genome_2grch38_tran/genome_2grch38_tran/genome_2HISAT2⽤法HISAT2 version 2.1.0 by Daehwan Kim (infphilo@, /people/infphilo)Usage: hisat2 [options]* -x {-1 -2 | -U | --sra-acc } [-S ] Index filename prefix (minus trailing .2). Files with #1 mates, paired with files in . Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). Files with #2 mates, paired with files in . Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). Files with unpaired reads. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654. File for SAM output (default: stdout) , , can be comma-separated lists (no whitespace) and can be specified many times. E.g. '-U , -U '.

HISAT2参数 -x 参考基因组索引⽂件的前缀（⽬录及⽂件前缀） -1 双端测序结果的第⼀个⽂件。若有多组数据，使⽤逗号将⽂件分隔。Reads的长度可以不⼀致。 -2 双端测序结果的第⼆个⽂件。若有多组数据，使⽤逗号将⽂件分隔，并且⽂件顺序要和-1参数对应。Reads的长度可以不⼀致。 -U 单端数据⽂件。若有多组数据，使⽤逗号将⽂件分隔。可以和-1、-2参数同时使⽤。Reads的长度可以不⼀致。 –sra-acc 输⼊SRA登录号，⽐如SRR353653，SRR353654。多组数据之间使⽤逗号分隔。HISAT将⾃动下载并识别数据类型，进⾏⽐对。 -S 指定输出的SAM⽂件（⽬录及名称）。 -t/--time print wall-clock time taken by search phases -p/--threads number of alignment threads to launch 指定进程数⽬（8/16）

2022-2-15HISAT2对⽐#使⽤代码：hisat2 -p 6 -x

-1 seq_val_ -2 seq_val_ -S #-p #多线程数#-x #参考基因组索引⽂件⽬录和前缀#-1 #双端测序中⼀端测序⽂件#-2 #同上#-S #输出的sam⽂件hisat#fastq⽂件名称最好统⼀，id_R1/，前缀id可⽤于批量for循环使⽤#! /bin/bashfor i in {61,62,63,191,193,194}dotimetimedonehisat2 -p 16 -x /mnt/AGS_RNA-seq/index_hisat2/grch38_tran/genome_tran -1 /mnt/AGS_RNA-seq/clean_data/${i}_ -2 /mnt/AGS_RNA-seq/clean_data/${HISAT2结果6-1：28870408 reads; of these: 28870408 (100.00%) were paired; of these: 28870408 (100.00%) were paired; of these: 2684635 (9.30%) aligned concordantly 0 times 25375575 (87.89%) aligned concordantly exactly 1 time 810198 (2.81%) aligned concordantly >1 times ---- 2684635 pairs aligned concordantly 0 times; of these: 118501 (4.41%) aligned discordantly 1 time ---- 2566134 pairs aligned 0 times concordantly or discordantly; of these: 5132268 mates make up the pairs; of these: 4280639 (83.41%) aligned 0 times 791002 (15.41%) aligned exactly 1 time 60627 (1.18%) aligned >1 times92.59% overall alignment rate6-2：25730480 reads; of these: 25730480 (100.00%) were paired; of these: 2367386 (9.20%) aligned concordantly 0 times 22654037 (88.04%) aligned concordantly exactly 1 time 709057 (2.76%) aligned concordantly >1 times ---- 2367386 pairs aligned concordantly 0 times; of these: 97370 (4.11%) aligned discordantly 1 time ---- 2270016 pairs aligned 0 times concordantly or discordantly; of these: 4540032 mates make up the pairs; of these: 3716500 (81.86%) aligned 0 times 766389 (16.88%) aligned exactly 1 time 57143 (1.26%) aligned >1 times92.78% overall alignment rate6-3：25872315 reads; of these: 25872315 (100.00%) were paired; of these: 5751525 (22.23%) aligned concordantly 0 times 19520929 (75.45%) aligned concordantly exactly 1 time 599861 (2.32%) aligned concordantly >1 times ---- 5751525 pairs aligned concordantly 0 times; of these: 83123 (1.45%) aligned discordantly 1 time ---- 5668402 pairs aligned 0 times concordantly or discordantly; of these: 11336804 mates make up the pairs; of these: 10683171 (94.23%) aligned 0 times 604057 (5.33%) aligned exactly 1 time 49576 (0.44%) aligned >1 times79.35% overall alignment rate19-1：32784004 reads; of these: 32784004 (100.00%) were paired; of these: 7432679 (22.67%) aligned concordantly 0 times 24487883 (74.69%) aligned concordantly exactly 1 time 863442 (2.63%) aligned concordantly >1 times ---- 7432679 pairs aligned concordantly 0 times; of these: 173136 (2.33%) aligned discordantly 1 time ---- 7259543 pairs aligned 0 times concordantly or discordantly; of these: 14519086 mates make up the pairs; of these: 13602303 (93.69%) aligned 0 times 815692 (5.62%) aligned exactly 1 time 101091 (0.70%) aligned >1 times79.25% overall alignment rate19-3：36814687 reads; of these: 36814687 (100.00%) were paired; of these: 11301595 (30.70%) aligned concordantly 0 times 24658957 (66.98%) aligned concordantly exactly 1 time 854135 (2.32%) aligned concordantly >1 times ---- 11301595 pairs aligned concordantly 0 times; of these: 141457 (1.25%) aligned discordantly 1 time ---- 11160138 pairs aligned 0 times concordantly or discordantly; of these: 22320276 mates make up the pairs; of these: 21538096 (96.50%) aligned 0 times 709870 (3.18%) aligned exactly 1 time 72310 (0.32%) aligned >1 times70.75% overall alignment rate19-4：35366354 reads; of these: 35366354 (100.00%) were paired; of these: 3803396 (10.75%) aligned concordantly 0 times 30536937 (86.34%) aligned concordantly exactly 1 time 1026021 (2.90%) aligned concordantly >1 times ---- 3803396 pairs aligned concordantly 0 times; of these: 133396 (3.51%) aligned discordantly 1 time ---- 3670000 pairs aligned 0 times concordantly or discordantly; of these: 7340000 mates make up the pairs; of these: 6186276 (84.28%) aligned 0 times 1068713 (14.56%) aligned exactly 1 time 85011 (1.16%) aligned >1 times91.25% overall alignment rate⽐对率6-192.59%6-292.78%6-379.35%19-179.25%19-370.75%19-491.25% ⽐对率：⼈类基因组的⽐对率期望值是70-90%SAM转BAMbam格式更利于存储，占⽐⼩，sam格式体积的1/6软件：SAMtoolsProgram: samtools (Tools for alignments in the SAM format)Version: 1.7 (using htslib 1.7)Usage: samtools [options]Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM使⽤参数：#samtools view、sort、index和flagstat，对sam⽂件批量处理#samtools view：格式转换#Usage: samtools view [options] <> | <> | <> []#-@：线程数， -S输⼊sam⽂件，-1b使⽤快速压缩⽐⽣成bam⽂件，-o设定标准输出⽂件名#简写代码：samtools view -@ 8 -S {$i}.sam -1b -o {$i}.bam#samtools sort：排序#Usage: samtools sort [options..] []#-@线程数，-l设置压缩⽐为5，-o设定标准输出⽂件名，最后⼀个为输⼊.bam⽂件，该命令默认为染⾊体位置排序，不影响后期操作。#简写代码：samtools sort -@ 8 -l 5 -o {$i}. {$i}.bam#index建⽴索引#Usage: samtools index [-bc] [-m INT] <> <>#-@线程数，<>输⼊bam⽂件，<>输出index⽂件。#简写代码：samtools index -@ 8 {$i}. {$i}.#flagstat查看⽐对情况#Usage: samtools flagstat [options] <>#-@线程数8，对于输出内容重定向写⼊>指定⽂件#简写代码：samtools flagstat -@ 8 {$i}. > {$i}.at————————————————版权声明：本⽂为CSDN博主「垚垚爸爱学习」的原创⽂章，遵循CC 4.0 BY-SA版权协议，转载请附上原⽂出处链接及本声明。原⽂链接：/xiaomotong123/article/details/106856233具体代码：#! /bin/bashfor i in {61,62,63,191,193,194}dosamtools view -@ 16 -S /mnt/AGS_RNA-seq/sam_aligned/${i}.sam -1b -o /mnt/AGS_RNA-seq/bam/${i}.bamsamtools sort -@ 16 -l 9 -o /mnt/AGS_RNA-seq/bam_sort/${i}. /mnt/AGS_RNA-seq/bam/${i}.bamsamtools index -@ 16 /mnt/AGS_RNA-seq/bam_sort/${i}. /mnt/AGS_RNA-seq/bam_index/${i}.amtools flagstat -@ 16 /mnt/AGS_RNA-seq/bam_sort/${i}. > /mnt/AGS_RNA-seq/scripts_log/${i}.attimedone注意view参数-1b，是数字1，不是字母l

耗时：⼀个25G的sam⽂件整个流程⼤概15mins，我这⾥6个sam⼤概160G，总耗时约2hrBAM⽂件质控（可选）2022-2-16软件RSeQC RNA-Seq质控⼯具，多个python包，⽤于评估RNA-seq的数据：检查序列质量、核酸组分偏性、PCR偏性、GC含量偏性，还有RNA-seq特异性模块：评估测序饱和度、映射读数分布、覆盖均匀性、链特异性、转录⽔平RNA完整性等参数#⽐对情况Usage: bam_ [options]Summarizing mapping statistics of a BAM or SAM s: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input-file=INPUT_FILE Alignment file in BAM or SAM format. -q MAP_QUAL, --mapq=MAP_QUAL Minimum mapping quality (phred scaled) to determine "uniquely mapped" reads. default=30#基因组覆盖率Usage: read_ [options]Check reads distribution over exon, intron, UTR, intergenic ... etcThe following reads will be skipped: qc_failed PCR duplicate Unmapped Non-primary (or secondary)Options: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input-file=INPUT_FILE Alignment file in BAM or SAM format. -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL Reference gene model in bed format.#需要bed⽂件#下载bed⽂件/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_/download#本地下载解压缩，将bed⽂件上传服务器#这个下载链接有时候没反应，科学上⽹后就能正常⽤法for i in {61,62,63,191,193,194}dobam_ -i /mnt/AGS_RNA-seq/bam_sort/${i}. > /mnt/AGS_RNA-seq/scripts_log/${i}_eread_ -i /mnt/AGS_RNA-seq/bam_sort/${i}. -r /mnt/AGS_RNA-seq/BED_file/hg38_ > /mnt/AGS_RNA-seq/scripts_log/${i}_distribCount：FeaturecountsRNA-seq的count有许多可选的软件，HTseq、Stringtie、featurecounts，看教程⾥HTseq和featurecounts的结果差异不⼤，但HTseq耗时太久，所以选择featurecounts进⾏下⼀步安装conda install subread⽂件准备输⼊⽂件需要⼀个注释⽂件和bam_sorted#下载注释⽂件cd /mnt/AGS_RNA-seq/GTFaxel /pub/current_gtf/homo_sapiens/Homo_ -d Homo_使⽤for i in {61,62,63,191,193,194}dodone# -T 使⽤的线程数# -p 如果是paird end 就⽤# -t 将exon作为⼀个feature# -g 将gene_id作为⼀个feature# -a 参考的gtf/gff# -o 输出⽂件# 最后加上bam⽂件，注意bam⽂件要为sorted#全部合在⼀个counts⽂件featureCounts -T 16 -p -t exon -g gene_id -a /mnt/AGS_RNA-seq/GTF/Homo_ -o /mnt/AGS_RNA-seq/featureCounts/${i}. /mnt/featureCounts -T 16 -p -t exon -g gene_id -a /mnt/AGS_RNA-seq/GTF/Homo_ -o /mnt/AGS_RNA-seq/featureCounts/ /mnt/A其他参数输出结果 ========== _____ _ _ ____ _____ ______ _____

===== / ____| | | | _ | __ | ____| / | __

===== | (___ | | | | |_) | |__) | |__ / | | | | ==== ___ | | | | _ <| _ /| __| / / | | | | ==== ____) | |__| | |_) | | | |____ / ____ | |__| | ========== |_____/ ____/|____/|_| _______/_/ ______/ v2.0.1//========================== featureCounts setting ===========================|| |||| Input files : 1 BAM file |||| o |||| |||| Output file : |||| Summary : y |||| Annotation : Homo_ (GTF) |||| Dir for temp files : /mnt/AGS_RNA-seq/featureCounts |||| |||| Threads : 16 |||| Level : meta-feature level |||| Paired-end : yes |||| Multimapping reads : not counted |||| Multi-overlapping reads : not counted |||| Min overlapping bases : 1 |||| |||| Chimeric reads : counted |||| Both ends mapped : not required |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 30720085 |||| Successfully assigned alignments : 22645654 (73.7%) |||| Running time : 0.52 minutes |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 27504748 |||| Successfully assigned alignments : 19565202 (71.1%) |||| Running time : 0.40 minutes |||| |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 27341548 |||| Successfully assigned alignments : 17090457 (62.5%) |||| Running time : 0.37 minutes |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 34769594 |||| Successfully assigned alignments : 21749916 (62.6%) |||| Running time : 0.89 minutes |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 38580695 |||| Successfully assigned alignments : 22749538 (59.0%) |||| Running time : 0.78 minutes |||| Running time : 0.78 minutes |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================////================================= Running ==================================|| |||| Load annotation file Homo_ ... |||| Features : 1552026 |||| Meta-features : 61541 |||| Chromosomes/contigs : 47 |||| |||| Process BAM file ... |||| Paired-end reads are included. |||| Total alignments : 37682170 |||| Successfully assigned alignments : 27515658 (73.0%) |||| Running time : 0.56 minutes |||| |||| Write the final count table. |||| Write the read assignment summary. |||| |||| Summary of counting results can be found in file "/mnt/AGS_RNA-seq/featur |||| eCounts/y" |||| ||============================================================================//Stringtie#stringtie为转录本拼接和定量的软件#利⽤stringtie分别输出gene count数⽬和FPKM值，gene count数⽤户基因差异分析，FPKM值为基因表达值，⽤于heatmap等绘制#Usage: stringtie [options]#aligned_ 是输⼊⽂件，该输⼊⽂件要求必须按基因组位置排序#参数-o []<>，设置StringTie组装转录本的输出GTF⽂件的路径和⽂件名。#参数-p 指定组装转录本的线程数（CPU）#参数-G 使⽤参考注释基因⽂件指导组装过程，格式GTF/GFF3。输出⽂件中既包含已知表达的转录本，也包含新的转录本。#参数-B 应⽤该选项，则会输出Ballgown输⼊表⽂件（* .ctab），其中包含⽤-G选项给出的参考转录本的覆盖率数据。#参数-A 输出基因丰度的⽂件（制表符分隔格式）#简写代码：stringtie {$i}. -o path/ -p 16 -G -eB -A path/#此次主要输出两个⽂件：gtf⽂件⽤于转换gene count⽂件；tab⽂件包含FPKM数值。————————————————版权声明：本⽂为CSDN博主「垚垚爸爱学习」的原创⽂章，遵循CC 4.0 BY-SA版权协议，转载请附上原⽂出处链接及本声明。原⽂链接：/xiaomotong123/article/details/106856233#参数-e 限制reads⽐对的处理，仅估计和输出与⽤-G选项给出的参考转录本匹配的组装转录本。使⽤该选项，则会跳过处理与参考转录本不匹配的组装转录本，这将⼤for i in {61,62,63,191,193,194}domkdir /mnt/AGS_RNA-seq/Stringtie/${i}donestringtie /mnt/AGS_RNA-seq/bam_sort/${i}. -o /mnt/AGS_RNA-seq/Stringtie/${i}/${i}.gtf -p 16 -G /mnt/AGS_RNA-seq/GTF/Homo_38.105.g

发布者：admin，转转请注明出处：http://www.yc00.com/news/1688383655a129862.html