Now that we have contigs assembled from short Illumina reads aligned on to long PacBio reads, the question of which one to trust often pops usa in our mind. Let u.s. explain the issues more clearly.

-——————————————————————-

A. Often we come beyond regions, where 3/4th of a long Illumina contig matches very well with the PacBio read (later on allowing for 85% of mistake rate), but the remaining contig is not seen anywhere nearby.

Possibilities:

Illumina is correct.

1 can make a case that Illumina contig is built from hundreds of brusque overlapping regions, and therefore the Illumina contig is more accurate.

PacBio is correct.

One tin as well argue that the particular genomic region is different in two chromosomes and the PacBio read is capturing a different chromosome compared to what is assembled from Illumina. Perhaps the chromosomal region has a large insertion/deletion.

-——————————————————————-

B. We also come across regions, where the Illumina contig matches PacBio closely, only has a large gap inside. The gap is usually filled with homopolymers.

Possibilities:

Once more, ane can argue nigh both possibilities mentioned in A.

-——————————————————————-

C. 3rd case of ambivalence is multiple copies of the aforementioned Illumina contigs matching a PacBio contig.

Possibilities:

PacBio is correct.

By its design, k-mer based de Bruijn graph associates compresses duplicated regions into one block. Therefore, the contig assembly method used for Illumina reads is incapable of resolving tandem echo regions.

Illumina is correct.

PacBio technology circularizes the chromosomal fragments and then goes over them again and over again. Therefore, the raw PacBio reads have multiple copies of the aforementioned chromosomal region, simply the initial processing step splits them into unlike reads. It is possible that the processing step may accept missed a few circularized junctions.

-—————————————————————————-

Yesterday, we took time to meticulously work through a few cases to empathize what is going on, and we volition share the examples here. They are anecdotal cases rather than systematic analysis of the entire information prepare, only will illustrate the points mentioned in A, B and C to aid you appreciate the issues.

Mistake Case B.

Here are 2 examples of large homopolymer difference between long PacBio reads and assembled Illumina contigs. In both alignments, the sequence with 'len' in the name is the assembled contig and the other sequence is the PacBio read.

case1

case2

Notes.

one. In both cases, PacBio had big insertions and not deletion compared to Illumina contig.

2. We manually checked about xv alignments and constitute but two examples of such homopolymer expansion.

3. The genome is highly repetitive, and therefore such long stretches of Ts and Cs are not unexpected.

4. We BLASTed both the Illumina contig and the PacBio read confronting all other PacBio reads. We did non see any single match of the PacBio region with another different read, whereas we found several hits with the assembled Illumina contig. That makes united states believe that assembly from Illumina is correct and PacBio is wrong here. We are not seeing an effect differences betwixt ii copies of the chromosomes.

We checked with Jason Mentum, whether he sees similar differences while doing genome associates.

Capture

Capture2

Edit. Here is the beauty of twitter. We can use our collective brains to solve bug.

Capture3

-—————————————————————

Mistake Case A.

The contig '7117__len__1384' assembled from Illumina is 1384 nt long and it matches nine PacBio reads longer than 3KB (IDs: 10, xi, 112003, 491553, 655988, 772698, 1069707, 1744092, 2033153). We tried to marshal those PacBio reads on height of each other to build consensus and observed something puzzling.

ane. Three PacBio reads ( 772698, 11(R), 491553(R) ) aligned well with each other over their entire length. How about the residual?

2. To find out what is going on with the remaining five reads, we aligned each of the individually with the longest PacBio read (772698) from the above grouping of four.

i) Half of read #10 aligned well with #772698 and the remaining half did not. You tin can encounter the 'good alignment-bad alignment' boundary in the following image.

Capture

What does it mean? Perchance we have a chimeric PacBio read or heterozygosity or something else altogether. We took autonomously the part of read #10 (~2500 nt) that did non align well and compared against all PacBio reads and all assembled Illumina contigs using BLAST (blastn). This comparing did not requite a single hit, which suggests that the 2500 nt non-matching one-half is not part of the genome. Conspicuously, this was an instance of sequence fault.

ii) Two other PacBio reads - #1069707 and #655988 showed similar pattern of partial matching with #772698, and we initially assumed that they were besides bad sequences. Interestingly, those two reads aligned well with each other. So, you lot see a nice clustering outcome of three reads (772698, 11(R), 491553(R)) aligning with each other and two others (1069707 and 655988) adjustment with each other, merely the entire group have only l% overlap.

We also found that the non-matching halves of #1069707 and #655988 aligned well with an Illumina-assembled contig - 122456__len__2371. Therefore, unlike #x, these two are legitimate reads and represent some kind of complex structure within the genome.

iii) The remaining 2 PacBio reads aligned well with #772698 on their ii edges, but had large gaps or mismatches in the heart. An example is presented below.

Capture

Does the higher up case correspond sequencing error, or genomic complexity? We are yet to effigy out.

As you can see, a good hybrid genome assembler needs to resolve many complex situations like above to properly get together the genome. Nosotros did not option 7117__len__1384 through any systematic process to demonstrate assembly complication. It was rather the first contig we randomly selected to investigate in detail and it may represent average instance of circuitous problem in our hand.

Many large eukaryotic genomes were assembled from Illumina-only reads and it is unlikely that they had complex situations like above properly resolved. Will it be necessary to revisit all those genome projects or is the assembly we are currently involved in unusually circuitous?