Genome assembly workflows for NGS

September 28, 2015, 7:24 am

≫ Next: Miseq Error: No usable signal found in the images; it is possible that clustering has

≪ Previous: ds cDNA shearing on Covaris?

Hello,
I'm trying to assembly a genome and I'd like to know if the following workflows are correct:
1) fastqc - velveth - velvetg - mauve ordering with a reference genome - mauve metrics
2) fastqc - velveth - velvetg - reapr to assess quality and close gap - mauve ordering with a reference genome - mauve

metrics
3) fastqc - abyss - mummer -reapr.

Can you give me any suggestion about the above workflows?
Do you know any other efficent workflow to assembly and evalute a bacterial genome?

Thanks in advance

↧

Miseq Error: No usable signal found in the images; it is possible that clustering has

September 28, 2015, 7:36 am

≫ Next: correlation in per base sequence quality profiles of multiplexed samples.

≪ Previous: Genome assembly workflows for NGS

Hello all,

I wanted to reach out to see if anyone has experienced this error, descript below. The run appears to have failed shortly after starting it and did not take any images of clustering. I was paranoid that the sample may not have denatured optimally before adding the HT1 buffer to neutralize the solution, even if this was the case I feel SOME clustering could have occurred.

Error:
"No usable signal found in the images; it is possible that clustering has failed"

↧

correlation in per base sequence quality profiles of multiplexed samples.

September 28, 2015, 8:11 am

≫ Next: combine cells with same length?

≪ Previous: Miseq Error: No usable signal found in the images; it is possible that clustering has

I have a question related to the "per base sequence quality" profiles.
In a Illumina HiSeq 150bp paired end runs of multiplexed samples, I found that the "per base mean sequence quality" profiles of all the samples correlate with each other. Please see the attached file for figures of quality profiles of all samples, for forward(read1) and reverse (read2) reads.

Though the variation in quality is very small (within 1 quality score), I was expecting that the variation should be rather random for each sample. Is this normal and a common occurrence, and a characteristic of the machine, with something to do with the base calling at each cycle?

* There is a prominent dip in quality at around 105 position in both forward and reverse reads. I was advised by the Sequencing provider that this is a common occurrence, associated with an increase in the laser intensity which occurs around this position. Is it true for all HiSeq runs?

Thanks for your suggestions.

Attached Files

PBSQ.pdf (265.7 KB)

↧

combine cells with same length?

September 28, 2015, 8:52 am

≫ Next: cummeRbund error

≪ Previous: correlation in per base sequence quality profiles of multiplexed samples.

Hi all,

I am new to pacbio, and recently working with iso-seq datasets. Take MFC7 as a example, I find there are 7 cells sequenced with 3-5kb. So I did analysis for each of them, then combine the sam files after mapping high quality cluster sequences to reference.

When I check the wiki page of cDNA primer, it seems another tool was developed for chaining GTF.

Then I think is it possible to provide all the 7 cells to ConsensusTools, and generate a big CCS file for them. Or maybe the better way is to feed all the 28 cells to tofu_warp to sizing automatically.

So, I am confused about the strategy of analysing samples with more cells for different size. It seems that I have four options for construction the FL cDNA:
1. combine all cells -> pb_warp
2. combine cells with same length -> ConsensusTools -> classify -> cluster -> collapse -> chain
3. do not combine -> ConsensusTools -> classify -> cluster -> collapse -> chain
4. do not combine -> ConsensusTools -> classify -> cluster -> merge sam -> collapse -> chain

Which one is better?

Thanks a lot

↧

cummeRbund error

September 28, 2015, 9:31 am

≫ Next: Searching for a sequence in WGS / RNA-seq data

≪ Previous: combine cells with same length?

Hi guys
for some reason cummerbund in R cannot read my diffout folder.
this is the command that I used after installing the library of course
cuff<-readCufflinks()
I got the following errors
No records found in Eutrema_STAR_Cuffdiff/tss_groups.fpkm_tracking
TSS FPKM tracking file was empty.
Reading Eutrema_STAR_Cuffdiff/tss_group_exp.diff
No records found in Eutrema_STAR_Cuffdiff/tss_group_exp.diff
Reading Eutrema_STAR_Cuffdiff/splicing.diff
No records found in Eutrema_STAR_Cuffdiff/splicing.diff
Reading Eutrema_STAR_Cuffdiff/tss_groups.count_tracking
No records found in Eutrema_STAR_Cuffdiff/tss_groups.count_tracking
Reading read group info in Eutrema_STAR_Cuffdiff/tss_groups.read_group_tracking
No records found in Eutrema_STAR_Cuffdiff/tss_groups.read_group_tracking
Reading Eutrema_STAR_Cuffdiff/cds.fpkm_tracking

and then when I ran the cuff command I got the wrong number of samples and zero TSS, CDS...etc

Help plz!

↧

Searching for a sequence in WGS / RNA-seq data

September 28, 2015, 10:16 am

≫ Next: bedtools sort by faidx

≪ Previous: cummeRbund error

I am interested in searching for a specific sequence in both my RNA-seq and WGS data, and the sequence is quite a bit above the read lengths for either experiment. I have access to all BAM files, some VCF files for WGS, raw fastq files, and everything else you can imagine coming from the sequencing. I want to see if a sequence is present in the data, and if it is, if it's present in the aligned or unaligned BAM files.

The background to my question would be that the sequence in question is a sequence that I believe would not be successfully mapped to the reference, but might still exist in the data/reads. I am unsure of how to go about this, or if it's even something that can be done.

My initial idea was to create some kind of consensus sequence from the RNA-seq BAM-files (both unaligned and aligned), and simply search the resulting sequencing against my sequence of interest. This, however, has proven to be hard, as there seems to be numerous ways of doing it according to Google, and none being the best (the "best" of which involving vcftools, which I for the life of me I cannot get to install on my Mac; no make files, although the documentation says there should be!)

In essence, I just want to find my sequence in my data. How do I do this?

↧

bedtools sort by faidx

September 28, 2015, 12:15 pm

≫ Next: Complete bioinformatics newbie here...hello

≪ Previous: Searching for a sequence in WGS / RNA-seq data

The sort order of my bam file is:

Code:

cmccabe@DTV-A5211QLM:~/Desktop/NGS/pool_I_090215$ samtools view -H IonXpress_008_150902_newheader.bam | grep SQ | cut -f 2 | awk '{ sub(/^SN:/, ""); print;}'

chr1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr10

chr11

chr12

chr13

chr14

chr15

chr16

chr17

chr18

chr19

chr20

chr21

chr22

chrX

chrY

chrM

So I created a names.txt to do a sortBed in bedtools but it appears that the option I need is not there.

Code:

Tool:    bedtools sort (aka sortBed)

Version: v2.25.0

Summary: Sorts a feature file in various and useful ways.



Usage:   bedtools sort [OPTIONS] -i <bed/gff/vcf>



Options: 

        -sizeA                        Sort by feature size in ascending order.

        -sizeD                        Sort by feature size in descending order.

        -chrThenSizeA                Sort by chrom (asc), then feature size (asc).

        -chrThenSizeD                Sort by chrom (asc), then feature size (desc).

        -chrThenScoreA                Sort by chrom (asc), then score (asc).

        -chrThenScoreD                Sort by chrom (asc), then score (desc).

        -faidx (names.txt)        Sort according to the chromosomes declared in "names.txt"

        -header        Print the header from the A file prior to results.



cmccabe@DTV-A5211QLM:~/Desktop/NGS$ sortBed faidx -i /home/cmccabe/Desktop/NGS/bed/bedtools/xgen_targets.bed > /home/cmccabe/Desktop/NGS/bed/bedtools/xgen_targets_sorted.bed



*****ERROR: Unrecognized parameter: faidx *****

Basically, since the sort order of my bam is in "human ordering" I wanted to sort my bed in the same way. Thank you :).

↧

Complete bioinformatics newbie here...hello

September 28, 2015, 12:18 pm

≫ Next: DESeq2 multivariate analysis - retrieving certain stats

≪ Previous: bedtools sort by faidx

So, as the title implies, I am a complete greenhorn when it comes to bioinformatics and have much to learn. I am also just learning how to use the terminal on a MAC. I am a brand new graduate student here at the University of Kentucky and am working on Sea Lamprey genomics and development.

I will probably be posting a lot of softball questions in these forums out of desperation so please keep that in mind when I make posts that seem "too easy". Other than that I am determined to master everything I need to know to do bioinformatics and to learn all that I can. Hello!

↧

DESeq2 multivariate analysis - retrieving certain stats

September 28, 2015, 12:23 pm

≫ Next: Error in normalizeDoubleBracketSubscript

≪ Previous: Complete bioinformatics newbie here...hello

Dear All,

I am running a DESeq2 within-subject treatment response analysis. I have RNA seq data on 75 individuals, for most of these I have data before and after treatment (but not all). I also have a variable indicating successful treatment response across all individuals. I want to model gene expression as dependent on treatment and treatment response (res01) within each subject. counts is an R object with counts across all samples and genes.

The design matrix:

Code:

> head(A_design)

     sampleID subject treatment res01

A10a     A10a     A10         0     0

A10b     A10b     A10         1     1

A11a     A11a     A11         0     0

A11b     A11b     A11         1     0

A12a     A12a     A12         0     0

A12b     A12b     A12         1     1

The analysis is ran as follows:

Code:

dds <- DESeqDataSetFromMatrix(countData = counts[,A_design[,1]], colData = as.data.frame(A_design), design = ~ subject + treatment + res01)

Atreat.dds <- DESeq(dds)

Originally I though that I could extract the effect of res01 on gene expression as:

Code:

> res <- results(Atreat.dds,name='res01',pAdjustMethod='BH')

Error in results(Atreat.dds, name = "res01", pAdjustMethod = "BH") : 

  cannot find appropriate results in the DESeqDataSet.

possibly nbinomWaldTest or nbinomLRT has not yet been run.

At this point I realized that the results looked differently than I expected:

Code:

> resultsNames(Atreat.dds)

 [1] "Intercept"  "subjectA01" "subjectA02" "subjectA03" "subjectA04"

 [6] "subjectA05" "subjectA06" "subjectA07" "subjectA08" "subjectA09"

.....

[76] "subjectA92" "treatment0" "treatment1" "res010"     "res011"

I have two questions regarding this:
1. Why do I get two res01 results and two treatment results? This seems to be due to the within-subject design. I don't understand which model that has been tested here explicitly.
2. Should I run the following code to extract the effect that res01 has on gene expression given all other covariates?

Code:

Ares_res01=results(Atreat.dds, contrast=list("res010",'res011'))

Thanks in advance for any help,
Boel

My sessionInfo:

Code:

> sessionInfo()

R version 3.2.1 (2015-06-18)

Platform: x86_64-redhat-linux-gnu (64-bit)

Running under: Scientific Linux release 6.7 (Carbon)



locale:

 [1] LC_CTYPE=sv_SE.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=sv_SE.UTF-8        LC_COLLATE=sv_SE.UTF-8    

 [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=sv_SE.UTF-8   

 [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C       



attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets 

[8] methods   base     



other attached packages:

[1] DESeq2_1.8.1              RcppArmadillo_0.5.400.2.0

[3] Rcpp_0.12.0               GenomicRanges_1.20.5     

[5] GenomeInfoDb_1.4.2        IRanges_2.2.7            

[7] S4Vectors_0.6.3           BiocGenerics_0.14.0      



loaded via a namespace (and not attached):

 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3          

 [4] XVector_0.8.0        futile.options_1.0.0 tools_3.2.1         

 [7] rpart_4.1-10         digest_0.6.8         RSQLite_1.0.0       

[10] annotate_1.46.1      gtable_0.1.2         lattice_0.20-33     

[13] DBI_0.3.1            proto_0.3-10         gridExtra_2.0.0     

[16] genefilter_1.50.0    stringr_1.0.0        cluster_2.0.3       

[19] locfit_1.5-9.1       nnet_7.3-10          grid_3.2.1          

[22] Biobase_2.28.0       AnnotationDbi_1.30.1 XML_3.98-1.3        

[25] survival_2.38-3      BiocParallel_1.2.20  foreign_0.8-66      

[28] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.46.0  

[31] ggplot2_1.0.1        reshape2_1.4.1       lambda.r_1.1.7      

[34] magrittr_1.5         scales_0.2.5         Hmisc_3.16-0        

[37] MASS_7.3-43          splines_3.2.1        xtable_1.7-4        

[40] colorspace_1.2-6     stringi_0.5-5        acepack_1.3-3.3     

[43] munsell_0.4.2

↧

Error in normalizeDoubleBracketSubscript

September 28, 2015, 1:03 pm

≫ Next: Senior Financial Analyst

≪ Previous: DESeq2 multivariate analysis - retrieving certain stats

Hello,

I am getting an error while trying to use a function to count reads for every chromosome across the genome in bins of 50bp (windowAnalysis function from groHMM package). The error is only for one chromosome and runs fine for rest of the chromosomes. Upon reading, I found that it is related to IRanges package. Following is the error:

$chrM
1] "Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) : \n subscript is out of bounds\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE): subscript is out of bounds>

Example for how it works fine for other chromosomes:

$chr1
integer-Rle of length 4985012 with 64 runs
Lengths: 935331 1 1 478754 1 ... 228032 1 1 1650
Values : 0 47 4 0 1 ... 0 1 50 0

Any suggestions to fix this is greatly appreciated.

Thanks,
Anusha

↧

Senior Financial Analyst

September 28, 2015, 4:46 pm

≫ Next: Hey There!!

≪ Previous: Error in normalizeDoubleBracketSubscript

Senior Financial Analyst

Responsibilities: Job will evolve over time depending on candidates capabilities and could include the following:
Responsible for monthly P&L reporting
Responsibility for weekly and monthly sales reporting to commercial team
Support accounting team with analysis of monthly general ledger entries
Responsible for monthly commission and bonus accrual and payout calculations
Responsible for quarterly P&L flux analysis for outside auditors
Support monthly product demand and supply-side forecasting process
Administration of corporate bonus plans
Support of annual budget process
Support long range forecasting process and modeling
Support annual insurance renewal process and audits
Various ad-hoc financial modeling requirements

Requirements
4-6 yrs related experience in financial analysis role
Understanding of general accounting process and general accounting principles
Effective communication skills (written & verbal)
General understanding ERP systems and data extraction and analysis
Ability to work and succeed in a team environment
Ability to adapt quickly and learn new tasks independently
High level of curiosity and creativity
Strong analytical skills
Excellent organization skills
Ability to manage competing priorities
Finance/Accounting degree or undergraduate degree which requires high level of analytics

Nice to Have
MBA preferred but not necessary

All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, national origin, protected veteran status, or on the basis of disability, gender identity, and sexual orientation.

Application Instructions:

For immediate consideration, please follow this link to submit your resume: Senior Financial Analyst

Pacific Biosciences is an Equal Opportunity Employer

↧

Hey There!!

September 28, 2015, 5:49 pm

≫ Next: Upcoming: Epigenomics Hands-On Workshop

≪ Previous: Senior Financial Analyst

A quick hello to the SEQanswers community!

I am currently working with the HiSeq, MiSeq, & soon the NextSeq in the cancer genomics world. I am looking forward to learning all that I can from this group of intelligent people!!

↧

Upcoming: Epigenomics Hands-On Workshop

September 29, 2015, 12:54 am

≫ Next: NGS beginner

≪ Previous: Hey There!!

DNA Methylation Data Analysis
How to use bisulfite-treated sequencing to study DNA methylation

Link to workshop page

When?
15 - 17 December 2015

Where?
iad Pc-Pool, Rosa-Luxemburg-Straße 23, Leipzig, Germany

Scope and Topics
The purpose of this workshop is to get a deeper understanding of the use of bisulfite-treated DNA in order to analyze the epigenetic layer of DNA methylation. Advantages and disadvantages of the so-called 'bisulfite sequencing' and its implications on data analyses will be covered. The participants will be trained to understand bisulfite-treated NGS data, to detect potential problems/errors and finally to implement their own pipelines. After this course they will be able to analyze DNA methylation and create ready-to-publish graphics.

By the end of this workshop the participants will:

be familiar with the sequencing method of Illumina
understand how bisulfite sequencing works
be aware of the mapping problem of bisulfite-treated data
understand how bisulfite-treated reads are mapped to a reference genome
be familiar with common data formats and standards
know relevant tools for data processing
automate tasks with shell scripting to create reusable data pipelines
perform basic analyses (call methylated regions, perform basic downstream analyses)
plot and visualize results (ready-to-publish)
be able to reuse all analyses

Target Audience

biologists or data analysts with no or little experience in analyzing bisulfite sequencing data

Requirements

basic understanding of molecular biology (DNA, RNA, gene expression, PCR, ...)
the data analysis will partly take place on the linux commandline. Is is therefore beneficial to be familiar with the commandline and in particular the commands covered in the Learning the Shell Tutorial

Included in the Course

Course materials
Catering
Conference Dinner

Trainers

Helene Kretzmer (University of Leipzig) is working on DNA methylation analyses using high-throughput sequencing since 2011. She is responsible for the bioinformatic analysis of MMML-Seq study of the International Cancer Genome Consortium (ICGC).
Dr. Christian Otto (CCR-BioIT) is one of the developers of the bisulfite read mapping tool segemehl and is an expert on implementing efficient algorithms for HTS data analyses.
Dr. David Langenberger (ecSeq Bioinformatics) started working with small non-coding RNAs in 2006. Since 2009 he uses HTS technolgies to investigate these short regulatory RNAs as well as other targets. He has been part of several large HTS projects, for example the International Cancer Genome Consortium (ICGC).
Dr. Mario Fasold (ecSeq Bioinformatics) has developed several bioinformatics tools such as the Bioconductor package AffyRNADegradation and the Larpack program package. Since 2011 he is specialized in the field of HTS data analysis and helped analysing sequecing data of several large consortium projects.

Key Dates
Opening Date of Registration: 1 June 2015
Closing Date of Registration: 15 November 2015
Workshop: 15 - 17 December 2015 (8 am - 5 pm)

Attendance
Location: iad Pc-Pool, Rosa-Luxemburg-Straße 23, Leipzig, Germany
Language: English
Available seats: 24 (first-come, first-served)

Registration fees:

998 EUR (without VAT)

Travel expenses and accommodation are not covered by the registration fee.

Contact
ecSeq Bioinformatics
Brandvorwerkstr.43
04275 Leipzig
Germany
Email: events@ecSeq.com

Visit: http://www.ecseq.com/workshops/workshop_2015-02

↧

NGS beginner

September 29, 2015, 6:36 am

≫ Next: read number in fastq does not match bwa-mem produced sam file

≪ Previous: Upcoming: Epigenomics Hands-On Workshop

Hello all!

A couple of months ago we decided to try Illumina NGS for metagenomics purposes. Data analysis is completely new for me and also for the lab where I work. I registered to Seqanswers because I expect to have some questions along the way.

Kind regards,

Karel

↧

read number in fastq does not match bwa-mem produced sam file

September 29, 2015, 11:39 am

≫ Next: Best assembly

≪ Previous: NGS beginner

Hi all,

I used bwa mem to align my quality trimmed reads to a reference. I have 5M reads to align but in the sam file there are only 3M of them included (I grep the read names' initial part and count).

There is the "4" flag for many of the reads that are present in the sam file as unaligned. So I do not understand where the rest of the reads are. Does bwamem has a preference for which reads to align and report? Any ideas on this?

bwa mem index reads.fastq > align.sam

Thank y'all!
Melis

↧

Best assembly

September 29, 2015, 11:41 am

≫ Next: novoalign parameters_alignment scoring options

≪ Previous: read number in fastq does not match bwa-mem produced sam file

I've run QUAST to assess the quality of a genome assembled with 3 different tools (Abyss, Velvet,SoapDeNovo) see Attachment.
According to you which is the best?
Why contigs in the last genome (SoapDeNovo) are 513 and in the other genomes 225/228?

Kind regards

Attached Images

assembly.jpg (34.4 KB)

↧

novoalign parameters_alignment scoring options

September 29, 2015, 12:28 pm

≫ Next: Bowtie alignments and --local function

≪ Previous: Best assembly

Dear All,

I am confusing about the parameter in novoalign, the -t, -g, -x, I read the manual, and it seems that these three parameters can be used for setting mismatch when you map your reads to the reference. However, how could I set them? Is there any way to calculate? How could I know which number should I set for the -t, -g, -x. I checked the forum, and I did not get it why eg. -t=60, then it is around 3 mismatch... Could anyone help me?

Thanks in advance!

Cheers,

Sadiexiaoyu

↧

Bowtie alignments and --local function

September 29, 2015, 2:07 pm

≫ Next: coverageBed -g option error

≪ Previous: novoalign parameters_alignment scoring options

I've just starting sequencing using MiSeq and am stumbling a bit in the analysis of my output reads, specifically using bowtie2 to align the reads to the reference genome. I'm using bowtie to get a rough estimate of what my amplification looks like and it's generally intuitive and quick. However:

When aligning reads to a genome sequence, what is % reads align exactly 1 time vs % reads align >1 times in the output? From my understanding, bowtie only records the best possible match (by default).
Is this saying that, to use an example from one of my samples, 40.26% of my reads align equally well to several places in the genome, suggesting a repeat region, while only 17.35% of my reads align to a unique region?
Or should this be interpreted as 40% of the reads align somewhere in the genome that already has one or more reads assembled to it, and that 17% of my reads are unique?

Also, my -local alignments are substantially different (99% of reads align using -local while only <70% without) from those where I do not specify -local, even though I trim adapters and low quality reads before aligning (using trimmomatic). From my understanding of -local, it should give a slightly more liberal alignment, but the two should be closer especially with trimming beforehand.

Forgive my ignorance and thanks for any advice.

Alex

↧

coverageBed -g option error

September 29, 2015, 2:46 pm

≫ Next: DESeq2 plotMA: data appears incorrect, patterned

≪ Previous: Bowtie alignments and --local function

I am using:

Code:

cmccabe@DTV-A5211QLM:~/Desktop/NGS$ coverageBed -d -sorted -g /home/cmccabe/Desktop/NGS/bedtools2-25.0/genomes/human.hg19.genome -a /home/cmccabe/Desktop/NGS/bed/bedtools/xgen_targets_sorted.bed -b /home/cmccabe/Desktop/NGS/pool_I_090215/IonXpress_008_150902_newheader.bam > /home/cmccabe/Desktop/NGS/pool_I_090215/IonXpress_008_150902_output.txt

Error: Sorted input specified, but the file /home/cmccabe/Desktop/NGS/bed/bedtools/xgen_targets_sorted.bed has the following record with a different sort order than the genomeFile /home/cmccabe/Desktop/NGS/bedtools2-25.0/genomes/human.hg19.genome

chr20        126045        126343        +        DEFB126:exon.2;DEFB126:exon.3

The newheader.bam is sorted like so:

Code:

cmccabe@DTV-A5211QLM:~/Desktop/NGS/pool_I_090215$ samtools view -H IonXpress_008_150902_newheader.bam | grep SQ | cut -f 2 | awk '{ sub(/^SN:/, ""); print;}'

chr1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr10

chr11

chr12

chr13

chr14

chr15

chr16

chr17

chr18

chr19

chr20

chr21

chr22

chrX

chrY

chrM

Since the bam file is uses "human ordering, I sorted the bed file in the same way using the -faidx option in bedtools.

Code:

cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$ awk '!_[$1]++' | cut -f1 xgen_targets_sorted.bed | uniq

chr1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr10

chr11

chr12

chr13

chr14

chr15

chr16

chr17

chr18

chr19

chr20

chr21

chr22

chrX

chrY

The output file that results stops after chr19. I guess my question is if there was an error sorting wouldn't all the records be a problem and if I made my own genome file using the coordinates in the bedtools genome file but re-ordered them to match mine, would that work? Or is there another problem I am overlooking? Thank you :).

↧

DESeq2 plotMA: data appears incorrect, patterned

September 29, 2015, 3:24 pm

≫ Next: Senior Engineer, Consumable Design

≪ Previous: coverageBed -g option error

Hi,
I'm trying to learn DESeq2 using a simplified data set with 3 controls ("ctr"), and 3 treatments ("koh"). I used the summary from an earlier thread (MDonlin;120724) to get started. I am not getting any error messages, but the output from plotMA does not appear as it does in the DESeq2 "airway" vignette. It looks patterned in a non-random way suggesting something is incorrect (plot attached).

If anyone has any suggestions, I'd much appreciate the help.

Best,
Byron

code:
> library("DESeq2")
>
> #generate count table from text file
> GeneCountTable <- read.table("KM272.d3.ctr.koh.txt", header=TRUE, row.names=1)
>
> head(GeneCountTable)
ctr1d3 ctr2d3 ctr3d3 koh1d3 koh2d3 koh3d3
gi|10000000001|loc|edl|EDL_NS211000002.1| 0 3 0 0 0 0
gi|10000000002|loc|edl|EDL_NS211000003.1| 0 0 0 0 0 0
gi|10000000003|loc|edl|EDL_NS211000004.1| 0 0 0 0 0 0
gi|10000000008|loc|edl|EDL_NS211000009.1| 0 0 0 0 0 0
gi|10000000011|loc|edl|EDL_NS211000012.1| 0 0 0 0 0 0
gi|10000000018|loc|edl|EDL_NS211000019.1| 0 0 0 0 0 0
>
> #define samples
> samples <- data.frame(row.names=c("ctr1d3","ctr2d3","ctr3d3","koh1d3","koh2d3","koh3d3"), condition=as.factor(c(rep("ctr",3),rep("koh",3))))
> samples
condition
ctr1d3 ctr
ctr2d3 ctr
ctr3d3 ctr
koh1d3 koh
koh2d3 koh
koh3d3 koh
>
> #generate DESeq dataset
> KM272dds <- DESeqDataSetFromMatrix(countData = GeneCountTable, colData=samples, design=~condition)
> KM272dds
class: DESeqDataSet
dim: 559312 6
exptData(0):
assays(1): counts
rownames(559312): gi|10000000001|loc|edl|EDL_NS211000002.1|
gi|10000000002|loc|edl|EDL_NS211000003.1| ... gi|9964628|ref|NP_064758.1|
gi|99878752|ref|YP_615055.1|
rowRanges metadata column names(0):
colnames(6): ctr1d3 ctr2d3 ... koh2d3 koh3d3
colData names(1): condition
>
> #run DESeq on dataset
> KM272dds_1 <- DESeq(KM272dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>
> #generate results table
> KM272_res <- results(KM272dds_1)
>
> #reorder results table by lowest adjusted P-value
> KM272_resOrdered <- KM272_res[order(KM272_res$padj),]
> head(KM272_resOrdered)
log2 fold change (MAP): condition koh vs ctr
Wald test p-value: condition koh vs ctr
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
gi|426411412|ref|YP_007031511.1| 1257.5658 8.121059 0.8434747 9.628101 6.084362e-22 1.696016e-17
gi|537453526|ref|YP_008487251.1| 967.6043 7.914203 0.9057204 8.738021 2.372297e-18 3.306389e-14
gi|152995336|ref|YP_001340171.1| 294.0563 11.018376 1.3738295 8.020192 1.055802e-15 7.357622e-12
gi|224584543|ref|YP_002638341.1| 236.0105 9.143821 1.1382139 8.033482 9.474453e-16 7.357622e-12
gi|333907837|ref|YP_004481423.1| 291.8945 11.013159 1.3848704 7.952484 1.828092e-15 1.019161e-11
gi|285019583|ref|YP_003377294.1| 147.3368 6.864467 0.8663821 7.923140 2.315875e-15 1.075917e-11
>
> #write CSV file
> write.csv(KM272_resOrdered,file="KM272_RNA_results.csv")
>
> #summarize results
> summary(KM272_res)

out of 139894 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 1927, 1.4%
LFC < 0 (down) : 17, 0.012%
outliers [1] : 272, 0.19%
low counts [2] : 111747, 80%
(mean count < 0.5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> #VISUALIZE
> #In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.
> plotMA(KM272_res, main="DESeq2", ylim=c(-15,15))

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.8.1 RcppArmadillo_0.5.600.2.0 Rcpp_0.12.1
[4] GenomicRanges_1.20.8 GenomeInfoDb_1.4.3 IRanges_2.2.7
[7] S4Vectors_0.6.6 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 XVector_0.8.0
[5] futile.options_1.0.0 tools_3.2.2 rpart_4.1-10 digest_0.6.8
[9] RSQLite_1.0.0 annotate_1.46.1 gtable_0.1.2 lattice_0.20-33
[13] DBI_0.3.1 proto_0.3-10 gridExtra_2.0.0 genefilter_1.50.0
[17] stringr_1.0.0 cluster_2.0.3 locfit_1.5-9.1 nnet_7.3-11
[21] grid_3.2.2 Biobase_2.28.0 AnnotationDbi_1.30.1 XML_3.98-1.3
[25] survival_2.38-3 BiocParallel_1.2.21 foreign_0.8-66 latticeExtra_0.6-26
[29] Formula_1.2-1 geneplotter_1.46.0 ggplot2_1.0.1 reshape2_1.4.1
[33] lambda.r_1.1.7 magrittr_1.5 scales_0.3.0 Hmisc_3.17-0
[37] MASS_7.3-44 splines_3.2.2 xtable_1.7-4 colorspace_1.2-6
[41] stringi_0.5-5 acepack_1.3-3.3 munsell_0.4.2
>

Attached Files

KM272_RplotMA_1.pdf (2.26 MB)

↧