Manual

Pipeline Manual
Source file path: ../INSTALL/
1 Preparation
============================================================
Data preparation
a. Assembly result(all the files):
Sequence files: .scafSeq and .contig
Note: For the sequence started with C in the ./scafSeq, we provide SV detection but without pair-end information it will be validated by split read.
b. Reference: (including the lastz indexed files, if not see below)
Note: When the reference sequence is large than 300 M base pairs, it is suggested to be indexed for reference in lastz. Otherwise, for bacteria genome, run lastz directly.
c. Alignment result (gap-align, bam format)

Lastz indexing:
In the directory of reference:
for i in `ls prefix*`; do srcPATH/lastz $i[unmask][multi] –writecapsule=$i.capsule –seed=12of19 –word=31 –step=5; done
ls /FULLPATH/ref/*capsule >capsule.list
Note:

srcPATH is the path of the INSTALL directory of the scripts file.
If the prefix of chromosome from reference is not ‘prefix’, please modify the command.
Lastz’s parameter target[unmask][multi],unmask means changing all the lowercase to uppercase in case of ignoring the repeat sequence and multi means multiple sequence as target. Seed: 12of19, 14of22, match<length>, half<length>, <pattern>, the default is 12of19.  Word: <=31, the large number in the hash string, default is 28,the smaller number the smaller memory.  Step: default is 1
For more information about lastz:

http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.01.50/README.lastz-1.01.50.html

 

Programs and scripts preparation
Download from: http://yh.genomics.org.cn/download.jsp

Storage preparation
Creat a directory such as prj_sv_renal

Note: the final directory would be
Prj_sv_renal/
|_ref/
|_asm/
|_ seperate
|_blat/
|_blat/
|_list/
|_fa/
|_ …
|_result/
|_prefix01/
|_prefix02/
|_……
|_novel/
|_process/
|_sv/

2 Blat alignment: input the reference sequence and the assembly sequence ,output is all.psl

============================================================
Creat the ref directory , ln or ln –s or cp the ref sequence and generate a file name ref.fa.list
(ls *fa > ref.fa.list) and the lastz indexed files and capsule.list
at prj_sv_renal
mkdir ref
cp or ln –s the files…
Creat the asm directory,for assembly result and preparation of blat step
at prj_sv_renal
mkdir asm
cd asm/
ln –s assembly-result/human* . #files from assembly result
srcPATH/fa_stat.pl human.scafSeq >human.scafSeq.fastat
rm human.newContigIndex #rebuilt this file
srcPATH/Read2Scaff scaff -g human >run.log #a lot of memory

at asm
mkdir seperate
srcPATH/fasta_seperator human.scafSeq human seperate/ 20000000
cd seperate
ls `pwd`/* >fa.list

Generate the blat script
at prj_sv_renal
mkdir blat
cd blat
mkdir blat
cd blat

cp srcPATH /blat_scriptmaker.* ./
./blat_scriptmaker /FULLPATH/../asm/separate/fa.list
/FULLPATH/…/ref/ref.fa.list `pwd` >run.sh

#Seperate the run.sh’s line into many files and run them on multiple computer
nodes.That will be faster.
#After running:
cat chr*psl >all.psl
#do not use cat * > all.psl
# blat is finished

3. Lastz alignment: run it for two parts and combine the result together, input is assembly results and all.psl form blat, output is all.axt
============================================================
For the aligned scaffolds in blat:
at prj_sv_renal/blat

#cp srcPATH/lastz_scriptmaker.cpp ./
#open lastz_scriptmaker.cpp ,modify ‘-ydrop=50000’,and g++ again
#Open .cpp file to modify targetcapsule value for your capsule.list,if needed.
g++ lastz_scriptmaker.cpp –o lastz_scriptmaker
#the two parameter of the scripts are assembly sequence , lastz output directory.
./ lastz_scriptmaker /FULLPATH/../asm/separate/fa.list `pwd` >run.sh
run run.sh
#Seperate the run.sh’s line into many files and run them on multiple computer
nodes.That will be faster.
#memory is about 1.9 G

#Check the result of lastz: if markend parameter is used ,then inside the lastz output files .axt ,the last line should be ‘# lastz end-of-file’ if successfully finished, otherwise another text.
at result/
tail -1 */*axt | perl -e ‘while ($i=<>){chomp $i; next if $i=~/^$/;chomp($j=<>); next
if $j=~/^$/; $i=~ /==> (.*) <==/;print “$1\n” unless $j=~/^#/;}’

For rerun them:

grep –f `tail -1 */*axt | perl -e ‘while ($i=<>){chomp $i; next if $i=~/^$/;chomp($j=<>); next if
$j=~/^$/; $i=~ /==> (.*) <==/;print “$1\n” unless $j=~/^#/;}’
` run.sh > rerun.sh

For the un-aligned scaffolds in blat
at prj_sv_renal
mkdir novel
cd novel
cp srcPATH/pipeline.sh .

sh pipeline.sh /FULLPATH/../asm/uhman.scafSeq /FULLPATH/../blat
/FULLPATH/../novel/ |sh

#open scriptmaker_new.cpp ,modify ‘-ydrop=50000’,and g++ again
g++ novel_scriptmaker.cpp –o novel_scriptmaker

./novel_scriptmaker /FULLPATH/../asm/human.scafSeq /FULLPATH/../result/
human >run.sh
#the three parameter of the scripts are assembly sequence, lastz output directory and job name.

run run.sh
#Seperate the run.sh’s line into many files and run them on multiple computer nodes.That will be faster.

#memory is about 1.9 G

4. SV Calling: input all.axt, assembly results, output is SV files(indel, inversion)
============================================================

Merge two parts lastz result
at prj_sv_renal
mkdir process
cat /FULLPATH/../blat/*axt >all.axt
cat /FULLPATH/../novel/result/*axt >>all.axt

call sv : key steps
at prj_sv_renal/process
cp srcPATH/iter.sh .
sh iter.sh all.axt /FULLPATH/../asm/human.scafSeq.fastat human
/FULLPATH/../asm/human.scafSeq > run.sh

# memory is 30G for human and should be finish within an hours
# sv calling is finished

5. Validation
============================================================

soap.coverage : using soap aligner to align the reads for short libraries and soap.coverage to get
the depth result
at prj_sv_renal
mkdir cvg; cd cvg; mkdir ref; mkdir scaf

SOAP pipeline:
reads vs ref
soap -p 8 -l 32 -v 1 -a 1.fq -b 2.fq -D ref.index -o out.soap -2 out.single -m min_ins -x
max_ins

reads vs scaf
soap -p 8 -l 32 -v 1 -a 1.fq -b 2.fq -D scaf.index -o out.soap -2 out.single -m min_ins
-x max_ins

in scaf:
single:
srcPATH/soap.coverage -cvg -il single.list -o single.o -depthsingle single.ds -plot
single.plot 0 250 -precise -onlyuniq -p 8 -nowarning -refsingle ../../asm/*scafSeq

soap:
srcPATH/soap.coverage -cvg -il soap.list -o soap.o -depthsingle soap.ds -plot
soap.plot 0 250 -precise -onlyuniq -p 8 -nowarning -refsingle ../../asm/*scafSeq

srcPATH/singlespratio single.ds soap.ds > spratio

in ref:
# similar to the steps of scaf,modify -refsingle to -reffastat
single :
srcPATH/soap.coverage -cvg -il single.list -o single.o -depthsingle single.ds -plot
single.plot 0 250 -precise -onlyuniq -p 8 -nowarning -reffastat
/FULLPATH/../ref/human.fastat

#human.fastat can be generate by srcPATH/fa_stat.pl

soap:
srcPATH/soap.coverage -cvg -il soap.list -o soap.o -depthsingle soap.ds -plot
soap.plot 0 250 -precise -onlyuniq -p 8 -nowarning –reffastat
/FULLPATH/../ref/human.fastat

srcPATH/singlespratio single.ds soap.ds > spratio

Note: parameter of soap.covrage
-cvg choose the coverage mode
-il list files of SOAP result
-o output file name
-depthsingle coverage file
-plot [filename] [x-axis lower] [x-axis upper] output the number of postion of certain
depth
-precise precise counting — ignore mismatch in SOAP
-onlyuniq using the uniq mapping reads
-nowarning nowaring information
-reffastat fasta files of assembly result
-refsingle fasta files of reference

Note : the single.list and soap.list are coming from SOAP align result for the sample
separately. In the single.list, each line for one absolute path of one out.single file.

More detail seeing:srcPATH/soap.coverage

SV validation:
at prj_sv_renal
mkdir sv
cd sv
cat ../process/11.intro_indels/*. intro_indels >prefix.intro_indels.
perl –ne ‘chomp;split;$len=length($_[7]);if($len>50){print “$_\n”;}’
prefix.intro_indels > prefix.intro_indels.long;
awk ‘{print $1,$3,$4}’ prefix.intro_indels.long > prefix.intro_indels.long.scaflist
nohup srcPATH/scafcomplex805 /FULLPATH/../asm/human.scafSeq.fastat
prefix.intro_indels.long.scaflist /FULLPATH/../asm/readonscaf/ >scresult 2>scerror
&

copy five shell :
at sv/
cp srcPATH/indel_validation/sh/* .
#before you run these commands , open the .sh file and modify the srcPATH into
your full path.
#perl module for chisquare is needed and modify the full path in
srcPATH/indel_validation/indel_validation.pl
sh 1.pick_range.sh 50 prefix.intro_indels.long | sh #files of indel and inversion
sh 2.readspratio.sh /FULLPATH/../cvg/hg18/spratio
/FULLPATH/../cvg/scaf/spratio
sh 3.process_picked.sh
sh 4.match_files.sh

at cvg/
for i in `ls */*plot`; do awk ‘{if($1!=0&&$2>total){mark=$1;total=$2}}END{print
file”\t”mark”\t”total}’ file=$i $i; done (four lines will be outputted and the second
column will be used in 5.validated.sh, the four number we called: a,b,c,d responding
to ref single, ref soap, scaf single, scaf soap)

sh 5.validate.sh a b c d depth_limit

srcPATH/sv_validate_combinator prefix.intro_indels.long insertion.validated
deletion.validated scresult 0 all.validated

sv_validate_combinator parameter:
rgv1 SV list
scaffold81447 Deletion 43 43 chr1 141814174 141814175 G +
scaf_name type s_start s_end ref_name r_start r_end
segment strand
argv2 S/P ratio insertion validated
insertion0 chr17 33150964 33150964 scaffold93066 387 390
type ref_name r_start r_end scaf_name s_start s_end
argv3 S/P ratio deletion validated
deletion0 chrX 152679503 15267950 C119122584 42 42
type ref_name r_start r_end scaf_name s_start s_end
argv4 Scaffolding Complexity output
1 scaffold1 3341 3341 5730 3041 3641 6 6705
0.25 1.72 1.05 0.17 0.71
avai_mark scaf_name s_start s_end s_max_len ext_s ext_e ext_type acc. ener. entr.
contr. corr. loc.
argv5 Huge_SV Scaffolding Complexity output (0 to bypass)
1 scaffold1 3341 3341 5730 3041 3641 6 6705
0.25 1.72 1.05 0.17 0.71
avai_mark scaf_name s_start s_end s_max_len ext_s ext_e ext_type acc. ener. entr.
contr. corr. loc.
argv6 Results output

 

mkdir validated
cd validated
ln ../all.validated ./
cp srcPATH/callSV/mk_list.sh .
sh mk_list.sh all.validated prefix (Indels which pass the SP ratio will be consider to be
confident.)
perl –ne ‘chomp;print “$_\tNULL\tNULL\n”
BGI_prefix_confident_indels >../prefix.confident.indels

# validation is finish for long SV

For the SV shorter than 50 bps:
We first aligned the short libraries reads to the human reference by bwa allowing gap
opened and then generated a sample level bam file(sample.bam).
cd prj_sv_renal/sv/
perl –ne ‘chomp;split; $len=length($_[7]);if($len<=50){print
“$_[1]\t$_[0]\t$_[2]\t$_[3]\t$_[4]\t$_[5]\t$_[6]\t$len\t$_[7]\n”;}’ prefix.intro_indels >
prefix.intro_indels.short;
perl srcPATH/indel_validation/indel_validationForShort_v2.pl
prefix.intro_indels.short sample.bam ref.fa prefix.intro_indels.short.validated.
(Splitting the prefix. intro_indels.short into pieces can accelerate the procedure. Two
columns will be added to the end of each line: # of reads supporting the indel and # of
proper pair reads supporting the indel) awk ‘$10>=4 || $11>=1’ prefix.intro_indels.short.validated >> prefix.confident.indels;
srcPATH/msort -k m5 -k n6 prefix.confident.indels > prefix.confident.indels.tmp;
mv prefix.confident.indels.tmp prefix.confident.indels.

# End of the document

 

 

No responses yet