Next Generation Sequencing: August 2011

Thursday, August 18, 2011

Transcriptomic Research around the Globe

We added two new sets of charts in our charts gallery.

The first one shows the global distribution of reported transcriptomic datasets in NCBI GEO database. The number of measurements submitted by US-based researchers far exceed any other country. I do not know, whether some European researchers submit their data only to Arrayexpress (European database like GEO) creating some bias in the following chart.

Continue here

Wednesday, August 17, 2011

Using Mate Pair Information in de Bruijn Graphs

This is our last post in the de Bruijn graph series. Although we did not mention explicitly, each of the previous posts was written to present a conceptual building block for the operations of de Bruijn graph-based assemblers. Only thing we did not discuss yet is how the assembly is really done.

For argument’s sake, imagine you are sequencing a very short gene using the shotgun approach. The sequencer gave you fragments for various parts of the gene, and you have to stitch them together to reconstruct the entire gene.

For a small gene, the stitching together is usually done by aligning all reads and taking the consensus sequence. Sequencing errors are expected to be removed by the consensus generating process.

Continue here

A Bird's Eye View of NCBI GEO Database - II

We presented the structure of NCBI GEO database in our earlier commentary – A Bird’s Eye View of NCBI GEO Database. Today we will inspect the contents of GEO database more closely.

As we explained earlier, GEO data sets are organized in terms of both GPLs (platforms/array design) and GSEs (collection of many measurements on one or more array designs). As an example, GPL570 is the human gene array designed by Affymetrix. At NCBI GEO database, all experiments using the above array can be downloaded together from their GPL570 link. On the other hand, a GSE ID typically represents all data from a researcher related to a publication. That GSE file may include any number of platforms (GPLs) depending on how the experiment was designed.

In the following chart, we show the most popular GPLs, i.e. the ones used by the highest number of GSEs. Please click on the chart to see it in a larger form. GPL570 is clearly the winner closely followed by GPL1261 (Affymetrix mouse array). Each of those arrays was used by over 1,000 publications. GEO also assigned single GPL IDs for all Illumina short read submissions for each organism. Those sets (GPL9052, GPL9058, etc.) are catching up fast given their limited history.

Continue here

Monday, August 15, 2011

De Novo Transcriptome Assemblers - Oases, Trinity, etc.

We are working on an article on de novo transcriptome assemblers, and it will be released in following four parts this week. Stay tuned !!

Part 1 – Larger Perspective
Part 2 – Trinity and Oases – Quick Start Guide
Part 3 – Assembly Algorithms of Trinity and Oases
Part 4 – Tricks and Tips for Better Results

If you are in a hurry, I also reported an example of our experience with trinity assembler in the forum. It discusses how to run the program, gives some estimates on how much time the program takes to run, RAM size, explains the directory structures, etc.

Thursday, August 11, 2011

A Drawback of de Bruijn Graph Approach

In our first commentary on de Bruijn graphs, we explained how de Bruijn graph can be constructed for any genome.

In the second commentary, we argued that a de Bruijn graph created from millions of short reads is identical to the de Bruijn graph of the underlying genome, if coverage is perfect. So, the assembly problem reduces to figuring out the genome from its de Bruijn graph, i.e. the inverse problem of graph creation.

At this point, you may be wondering why de Bruijn graphs were not widely used in the prehistoric era of sequencing (Sanger sequencing era) given that they are the greatest things invented since sliced bread. The answer is simple. De Bruijn graphs do not preserve positional information.

Continue here

Wednesday, August 10, 2011

Why no comment section?

Some of you asked why we do not activate the comment section of our blog unlike 99.7% of other blogs. Short answer – 99.7% of those blogs are not as popular as ours. Long answer follows.

In February 2010, we wrote on The Mathematics of Color Space Sequencing and took a small break for 18 months. After coming back, we encountered 1000 comments, among which 997 were of the form – ‘A very well written article. Will you buy Viag** from us?’ We did not enjoy this surge of popularity, and the only way to stop our fans from following us was to shut down the comments section.

Now that we set up the forum, we find that it adds many benefits that the comments section would not have provided.

Firstly, format of comment section of a blog appears to give the impression that blog author is more knowledgeable than the commenters. Only thing we know for sure is that the depth of our ignorance is limitless.

Secondly, the discussion forum allows users to initiate discussions on topics that the blogger did not cover. Our forum is divided into ‘General Category’ and ‘Daily blogs’ sections. In the ‘General Category’ section, you are free to initiate conversation on any topic that is of relevance to our community.

Large Computer, Distributed Cluster or Amazon Cloud?

A very popular comment in seqanswers forum starts with -

Hello – I use to think I was good with a computer
I wonder how many people are in the same boat as me.

1) Institute bought a couple of GAIIs
2) No one has money to use them
3) Institute has internal competition to pay for a couple of runs (makes the donors feel better about their donation if someone uses the machines), and you are lucky enough to get funded
4) You send a couple of samples off to never-never land and someone sends back a terabyte drive or two with “next-gen sequencing data”
5) You quickly realize people that use to do survival curves in your bioinformatics core don’t really know that Illumina fastq is different from Sanger fastq and the analysis they provide is limited at best
5) Now what do you do?

Please continue here

Tuesday, August 9, 2011

De Bruijn Graphs for Alternative Splicing and Repetitive Regions

Today we shall examine de Bruijn graphs for two structures that occur frequently in genomes or transcriptomes. The reason for studying them together will be apparent by the end of this post.

Let us first construct a graph for two alternatively spliced transcripts A and B for a gene. The regions shown in yellow and red are transcribed in both isoforms, whereas the green region is present only in A.

The de Bruijn graph is shown in circle and arrow format, and the paths for two transcripts are marked by dotted lines. We shall explain the graph construction qualitatively instead of going into nucleotide level detail. We recommend you to pick your favorite gene and do the detailed construction yourself by following rules explained earlier.

Please continue here

Monday, August 8, 2011

A Road Map for our Journey through the Transcriptomic Maze

It has not escaped our mind that, in our writing so far, we did not explain what we intend to achieve here. Today we will take a brief pause from scientific topics to reflect on our limited goals.

As you are well aware, biology changed rapidly over the last few decades, and the pace of change seems to be accelerating in recent years. Sequencing and assembly of human genome, hailed as one of the greatest achievements of mankind, was done at a staggering cost of ~$3B and the project needed decade long collaboration of the brightest minds in the world. Today, only ten short years later, a small group of researchers with limited funds can think of assembling a comparable-sized genome.

This change has been brought forth by technological progress on two fronts. Firstly, new innovations in chemistry and semiconductor processing continue to drive down the costs of sequencing, arraying and other large-scale biochemical experiments. Secondly, an equally fast-paced set of innovations in the computational arena allows proper utilization of data derived from large-scale experiments. For example, it would not have been possible to quickly align billions of short reads to a reference genome, if algorithms incorporating Burrows Wheeler transform were not developed for bioinformatics. Similarly, using innovative graph-theoretic concepts, such as de Bruijn graphs, de novo assembly of genomes from short reads has become possible.

Continue here

Using Hadoop for Transcriptomics

Those of us trying to analyze next-gen sequencing data often feel constrained by the availability of computing power. Buying a very large computer (‘large’ measured by RAM size, not body mass index) is the most conventional solution, but that solution comes with a hefty price tag. Many institutions already invested heavily into distributed computing centers, and they encourage users to take full advantage of the existing resources.

Typically, distributed systems in computing clusters and supercomputing centers implement MPI-based architecture for parallel computing. Another type of distributed architecture named Hadoop/MapReduce has become popular among the internet companies processing terabytes of data. Hadoop is accessible to bioinformaticians through Amazon cloud (Elastic MapReduce), but many researchers do not understand what advantage Hadoop would provide over conventional parallel architecture. Here we explain the difference with simple examples.

Continue here

Friday, August 5, 2011

Comparison of data analysis packages

I came across this post comparing R, Matlab, SciPy, SAS, SPSS, Stata and biologists' greatest tool Excel, and thought you would find it useful.

Also, please read this comment comparing Matlab, octave and R. Quoting the author,

Continue here

Thursday, August 4, 2011

Using R for Transcriptome Analysis - I

R is a very powerful software tool for analysis of biological data. It is an excellent choice for biologists reaching the limits of Excel, because learning R is very easy. The learning curve is minimal for Matlab or Mathematica users, and R comes with an added benefit of costing less than a cup of coffee. In fact, it costs less than the paper napkin to wipe coffee table - R is free. Primarily for this reason, many users have contributed powerful libraries to R. Those libraries can make statistical analysis of bioinformatic data straightforward. We are particularly attracted to Bioconductor suite of packages.

Continue here

Wednesday, August 3, 2011

A Bird's Eye View of NCBI GEO Database

NCBI GEO database is the world’s largest public online repository for transcriptome datasets. It includes transcriptome data from several types of experiments – arrays, next-gen sequencing, MPSS, SAGE, RT-PCR, etc., although major share of data comes from array measurements. For short read datasets (NGS), some researchers prefer to use NCBI SRA database as a repository. SRA database includes both transcriptomic and genomic sequences, and we will cover its transcriptomic component in a later post.

Majority of GEO users typically download and analyze only one or two measurement sets related to their own research. Here we plan to look at the entire collection of measurements stored in GEO. This post is introductory, but over the next few days, we will present various interesting charts of GEO data to show trends in transcriptomics.

For non-users, let me first explain the structure of GEO. If you go to GEO website, you will notice the following stats on the right-hand corner near the top. They show the current contents of the GEO database.

Continue here

Monday, August 1, 2011

How do sequencing errors affect de Bruijn graphs?

Today's commentary is the fourth in our de Bruijn graph series, but I did not like Roman characters to pile up in names as in Rocky movies ('de Bruijn graphs - IV', 'V', 'Balboa') and instead chose a more descriptive title. For your convenience, our earlier discussions are linked here -

de Bruijn graphs - I
de Bruijn graphs - II
de Bruijn graphs - III

Continue here

Next Generation Sequencing