Next Generation Sequencing: July 2011

Sunday, July 31, 2011

De Bruijn Graphs - III

In earlier commentaries, we introduced the concept of de Bruijn graphs and showed how they were used for de novo assembly of short read sequences. If you read the posts, you likely left with an impression that de Bruijn graphs were useful weapons to be included in all bioinformaticians’ arsenals. However, none of the posts clearly explained, why they became the primary weapons applied by all popular short read assemblers. Let us do that at the outset here.

We will borrow the following figure from our previous post. It shows de Bruijn graph of a genome, and few short reads aligned to the genome and the de Bruijn graph. Like before, we will restrict our discussions to the perfect world with no countries, no religion, no greed or hunger and, most importantly, no sequencing error.

Continue reading De Bruijn Graphs – III

Highlights - 7/30/2011

In the highlights section, I plan to add papers and news that I feel compelled to bookmark. When I manage to read those same papers or try out their algorithms, there will be a longer post. Please feel free to suggest any interesting link for our highlights section. Even an interesting blog post or Seqanswers thread will be good for us.

1. BGI plans to offer bioinformatics cloud service

“The world’s largest DNA-sequencing outfit is looking to the clouds. The BGI (formerly the Beijing Genomics Institute) this month announced plans to roll out its cloud computing capabilities, which it hopes will help it to dominate bioinformatics in the same way it does the world of sequencing.”

This is an impressive move by BGI. The day I started to use Amazon cloud, I felt there would be a need for cloud service customized for bioinformatics applications.

Continue here

Friday, July 29, 2011

De Bruijn graphs - II

In the previous post, we discussed how de Bruijn graphs can be constructed for a genome or a large sequence. Today we will explain, why this method is so popular for genome or transcriptome assembly using short reads. We will also explain why traditional short-read genome assemblers, such as Velvet or SOAPdenovo, cannot be directly applied to transcriptomes.

Any genome can be converted into a de Bruijn graph, as shown in the previous post. The graph may be large or small depending on how big the genome is, but its essential features are similar for all genome.

Let’s say, we have a genome, whose de Bruijn graph looks like the following figure -

Rest here

Thursday, July 28, 2011

De Bruijn graphs - I

New algorithms for short read assembly (categories B and D) often use de Bruijn graphs to store and represent sequence data. What is a de Bruijn graph and why is it so popular for analyzing short read sequences? We will explain the concept here.

De Bruijn graph is an efficient way to represent a sequence in terms of its k-mer components. Although de Bruijn graphs can be used for a broad range of problems, our discussion will be limited to nucleotide sequences. Most papers talk about constructing de Bruijn graphs from short reads and derive the genome sequence from the de Bruijn graph. For simplicity, here we will first introduce de Bruijn graph of a genome, and then explain how short reads fit into the picture.

A de Bruijn graph can be constructed for any sequence, short or long. Here is a simple example -

Continue reading De Bruijn graphs – I

Monday, July 25, 2011

Highlights - 7/25/2011

Each day, we will present a short summary of interesting papers. Some of these will be considered in further detail later, when time permits. Please feel free to suggest any good paper you come across.

Trinity – Full-length transcriptome assembly from RNA-Seq data without a reference genome - This is an excellent Category D algorithm that we have been using for the last few months for our RNAseq with very satisfactory result. The manuscript just came out in the paper version of Nature Biotech, but electronic version was available from their website for a few months. Previously we had been struggling with Velvet+Oases, which consumed large amount of memory. The memory handling of Trinity is excellent.

Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads

More here

Algorithms for Next-gen Sequence Analysis

The field of next-gen sequence analysis is advancing so rapidly that new algorithms come out almost every day. Here we provide a broad categorization for such algorithms and describe critical challenges for each category. This will help us understand the approaches presented in various algorithms, when we look into each one in more detail.

Usually a scientist submits DNA samples for an organism to a core facility for sequencing and receives a large disk full of sequence data. The first categorization comes from whether the organism has a reference sequence. Second categorization comes from whether the sample is genomic or transcriptomic.

As a 2×2 matrix, the categories are

More here

A beginner's guide to bioinformatics - part II

In part I, I described Layers 1-3 of learning bioinformatics for solving biological problems. Let me cover two expert levels here.

Layer 4 – High level coding in C/C++, Java for implementing existing algorithms or modifying existing codes for new functionality

Those who try to be on the cutting edge of solving biological problems often reach the limits of layers 1-3 very quickly. This is even more important these days, because biology is getting very data intensive. Let me give few examples -

i) Let’s say a lab invested money for an ABI SOLiD sequencing machine, but a new paper on aligning sequences came out that does not cover SOLiD color-space data yet. Should the lab throw away its sequencing instrument and buy something new?

More here

A beginner's guide to bioinformatics - part I

“What topics should I study to learn bioinformatics?” – I often get asked this question by biology students and sometimes even biology professors. The answer depends on what you want to do, but how would someone, who is new to using computers for biology, know what can be done? To add to our difficulties, technology is changing so rapidly that even experts do not know what software tools will remain useful next year.

Keeping the above complications in mind, I created a general framework to describe one’s abilities in bioinformatics that should be valid even with changing technologies. Another reason for creating this framework is to fit our new posts into one or other category for easy description.

From a top level, I prefer to divide the levels of expertise in bioinformatics into five layers with 5 being the most difficult.

Layer 1 – Using web to analyze biological data
Layer 2 – Ability to install and run new programs
Layer 3 – Writing own scripts for analysis in PERL, python or R
Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality
Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java

More here

Next Generation Sequencing