We added two new sets of charts in our charts gallery.
The first one shows the global distribution of reported transcriptomic datasets in NCBI GEO database. The number of measurements submitted by US-based researchers far exceed any other country. I do not know, whether some European researchers submit their data only to Arrayexpress (European database like GEO) creating some bias in the following chart.
Continue here
Thursday, August 18, 2011
Wednesday, August 17, 2011
Using Mate Pair Information in de Bruijn Graphs
This is our last post in the de Bruijn graph series. Although we did not mention explicitly, each of the previous posts was written to present a conceptual building block for the operations of de Bruijn graph-based assemblers. Only thing we did not discuss yet is how the assembly is really done.
For argument’s sake, imagine you are sequencing a very short gene using the shotgun approach. The sequencer gave you fragments for various parts of the gene, and you have to stitch them together to reconstruct the entire gene.
For a small gene, the stitching together is usually done by aligning all reads and taking the consensus sequence. Sequencing errors are expected to be removed by the consensus generating process.
Continue here
For argument’s sake, imagine you are sequencing a very short gene using the shotgun approach. The sequencer gave you fragments for various parts of the gene, and you have to stitch them together to reconstruct the entire gene.
For a small gene, the stitching together is usually done by aligning all reads and taking the consensus sequence. Sequencing errors are expected to be removed by the consensus generating process.
Continue here
A Bird's Eye View of NCBI GEO Database - II
We presented the structure of NCBI GEO database in our earlier commentary – A Bird’s Eye View of NCBI GEO Database. Today we will inspect the contents of GEO database more closely.
As we explained earlier, GEO data sets are organized in terms of both GPLs (platforms/array design) and GSEs (collection of many measurements on one or more array designs). As an example, GPL570 is the human gene array designed by Affymetrix. At NCBI GEO database, all experiments using the above array can be downloaded together from their GPL570 link. On the other hand, a GSE ID typically represents all data from a researcher related to a publication. That GSE file may include any number of platforms (GPLs) depending on how the experiment was designed.
In the following chart, we show the most popular GPLs, i.e. the ones used by the highest number of GSEs. Please click on the chart to see it in a larger form. GPL570 is clearly the winner closely followed by GPL1261 (Affymetrix mouse array). Each of those arrays was used by over 1,000 publications. GEO also assigned single GPL IDs for all Illumina short read submissions for each organism. Those sets (GPL9052, GPL9058, etc.) are catching up fast given their limited history.
Continue here
As we explained earlier, GEO data sets are organized in terms of both GPLs (platforms/array design) and GSEs (collection of many measurements on one or more array designs). As an example, GPL570 is the human gene array designed by Affymetrix. At NCBI GEO database, all experiments using the above array can be downloaded together from their GPL570 link. On the other hand, a GSE ID typically represents all data from a researcher related to a publication. That GSE file may include any number of platforms (GPLs) depending on how the experiment was designed.
In the following chart, we show the most popular GPLs, i.e. the ones used by the highest number of GSEs. Please click on the chart to see it in a larger form. GPL570 is clearly the winner closely followed by GPL1261 (Affymetrix mouse array). Each of those arrays was used by over 1,000 publications. GEO also assigned single GPL IDs for all Illumina short read submissions for each organism. Those sets (GPL9052, GPL9058, etc.) are catching up fast given their limited history.
Continue here
Monday, August 15, 2011
De Novo Transcriptome Assemblers - Oases, Trinity, etc.
We are working on an article on de novo transcriptome assemblers, and it will be released in following four parts this week. Stay tuned !!
Part 1 – Larger Perspective
Part 2 – Trinity and Oases – Quick Start Guide
Part 3 – Assembly Algorithms of Trinity and Oases
Part 4 – Tricks and Tips for Better Results
If you are in a hurry, I also reported an example of our experience with trinity assembler in the forum. It discusses how to run the program, gives some estimates on how much time the program takes to run, RAM size, explains the directory structures, etc.
Part 1 – Larger Perspective
Part 2 – Trinity and Oases – Quick Start Guide
Part 3 – Assembly Algorithms of Trinity and Oases
Part 4 – Tricks and Tips for Better Results
If you are in a hurry, I also reported an example of our experience with trinity assembler in the forum. It discusses how to run the program, gives some estimates on how much time the program takes to run, RAM size, explains the directory structures, etc.
Thursday, August 11, 2011
A Drawback of de Bruijn Graph Approach
In our first commentary on de Bruijn graphs, we explained how de Bruijn graph can be constructed for any genome.
In the second commentary, we argued that a de Bruijn graph created from millions of short reads is identical to the de Bruijn graph of the underlying genome, if coverage is perfect. So, the assembly problem reduces to figuring out the genome from its de Bruijn graph, i.e. the inverse problem of graph creation.
At this point, you may be wondering why de Bruijn graphs were not widely used in the prehistoric era of sequencing (Sanger sequencing era) given that they are the greatest things invented since sliced bread. The answer is simple. De Bruijn graphs do not preserve positional information.
Continue here
In the second commentary, we argued that a de Bruijn graph created from millions of short reads is identical to the de Bruijn graph of the underlying genome, if coverage is perfect. So, the assembly problem reduces to figuring out the genome from its de Bruijn graph, i.e. the inverse problem of graph creation.
At this point, you may be wondering why de Bruijn graphs were not widely used in the prehistoric era of sequencing (Sanger sequencing era) given that they are the greatest things invented since sliced bread. The answer is simple. De Bruijn graphs do not preserve positional information.
Continue here
Wednesday, August 10, 2011
Why no comment section?
Some of you asked why we do not activate the comment section of our blog unlike 99.7% of other blogs. Short answer – 99.7% of those blogs are not as popular as ours. Long answer follows.
In February 2010, we wrote on The Mathematics of Color Space Sequencing and took a small break for 18 months. After coming back, we encountered 1000 comments, among which 997 were of the form – ‘A very well written article. Will you buy Viag** from us?’ We did not enjoy this surge of popularity, and the only way to stop our fans from following us was to shut down the comments section.
Now that we set up the forum, we find that it adds many benefits that the comments section would not have provided.
Firstly, format of comment section of a blog appears to give the impression that blog author is more knowledgeable than the commenters. Only thing we know for sure is that the depth of our ignorance is limitless.
Secondly, the discussion forum allows users to initiate discussions on topics that the blogger did not cover. Our forum is divided into ‘General Category’ and ‘Daily blogs’ sections. In the ‘General Category’ section, you are free to initiate conversation on any topic that is of relevance to our community.
In February 2010, we wrote on The Mathematics of Color Space Sequencing and took a small break for 18 months. After coming back, we encountered 1000 comments, among which 997 were of the form – ‘A very well written article. Will you buy Viag** from us?’ We did not enjoy this surge of popularity, and the only way to stop our fans from following us was to shut down the comments section.
Now that we set up the forum, we find that it adds many benefits that the comments section would not have provided.
Firstly, format of comment section of a blog appears to give the impression that blog author is more knowledgeable than the commenters. Only thing we know for sure is that the depth of our ignorance is limitless.
Secondly, the discussion forum allows users to initiate discussions on topics that the blogger did not cover. Our forum is divided into ‘General Category’ and ‘Daily blogs’ sections. In the ‘General Category’ section, you are free to initiate conversation on any topic that is of relevance to our community.
Large Computer, Distributed Cluster or Amazon Cloud?
A very popular comment in seqanswers forum starts with -
Hello – I use to think I was good with a computer
I wonder how many people are in the same boat as me.
1) Institute bought a couple of GAIIs
2) No one has money to use them
3) Institute has internal competition to pay for a couple of runs (makes the donors feel better about their donation if someone uses the machines), and you are lucky enough to get funded
4) You send a couple of samples off to never-never land and someone sends back a terabyte drive or two with “next-gen sequencing data”
5) You quickly realize people that use to do survival curves in your bioinformatics core don’t really know that Illumina fastq is different from Sanger fastq and the analysis they provide is limited at best
5) Now what do you do?
Please continue here
Hello – I use to think I was good with a computer
I wonder how many people are in the same boat as me.
1) Institute bought a couple of GAIIs
2) No one has money to use them
3) Institute has internal competition to pay for a couple of runs (makes the donors feel better about their donation if someone uses the machines), and you are lucky enough to get funded
4) You send a couple of samples off to never-never land and someone sends back a terabyte drive or two with “next-gen sequencing data”
5) You quickly realize people that use to do survival curves in your bioinformatics core don’t really know that Illumina fastq is different from Sanger fastq and the analysis they provide is limited at best
5) Now what do you do?
Please continue here
Subscribe to:
Comments (Atom)