Molecular biologists have finished sequencing the human genome in a special issue of the journal Science. In the previous version of the genome, which appeared in 2001, about 8 percent of the sequence remained undeciphered. These are mainly non-coding regions and central and terminal regions of chromosomes.
Six (1 , 2 , 3 , 4 , 5 , 6 ) articles are devoted to the results of the project at once. The full version of the genome allows more accurate identification of individual genetic characteristics of people and can become a new standard in genetics, despite the fact that it still lacks a whole chromosome.
Sequencing Human DNA: The Foundations
In 2000, the Human Genome Project and Craig Venter’s company Celera Genomics announced that they had finished sequencing the human DNA sequence. By 2001, they published their draft versions of the assembly, and by 2003 they combined their efforts and developments to assemble a single clean copy.
It became the first standard, or reference genome, against which everyone who deciphered new human genomes or searched for the genetic causes of diseases was checked. However, the work of reading human DNA did not end there.
The first version of the human genome was not complete
The authors of the first version of the human genome made no secret of the fact that it is far from complete. For example, it had 341 spaces left. In addition, in their work, the researchers relied on euchromatin, that fraction of DNA that is usually in a loosely packed state in a cell and information from which can be read.
Thus, the first version of the genome did not include many regions of heterochromatin, the “twisted” fraction of DNA. It consists mainly of sequences that do not code for proteins, but perform various technical and structural (and often not fully understood) functions – therefore, they can also affect the life and work of the cell.
In the first version of the genome, it was also not completely clear which genes and non-coding regions were responsible for what. This is being clarified, for example, by the ENCODE project.
Finally, the reference genome did not fully take into account the genetic diversity of people – despite the fact that it was collected from random amounts of DNA from several dozen people. Other projects, such as Thousand Genomes, have been undertaken to fill these gaps.
Since then, the genome has been repeatedly refined, several updated references have appeared. The last one, GRCh38.p13, was published in 2019. But even in it, there were many white spots – areas where the letters N were listed instead of nucleotides, or where some kind of surrogate sequences were substituted.
Scientists sequenced the complete human genome
The Consortium “From Telomere to Telomere” (T2T-Consortium, telomere – the terminal section of the chromosome) undertook the mission to complete the missing parts of the human genome.
It included scientists from 54 institutes and laboratories from different countries, and the result of their work was the first full-fledged assembly of the genome – which they described in six articles in the journal Science.
Everything we know about the CHM13 genome
The first article is a presentation of the new assembly, in which the authors talk about the methods they used and summarize their work. The new genome was named CHM13 after the cell culture that became DNA donors.
This culture comes from a hydatidiform mole, an unusual human tumor that appears if a fertilized egg loses maternal chromosomes for some reason.
A cystic drift is convenient in that often its genome consists of a doubled chromosome set, which brought with it a spermatozoon. This means that both copies of each chromosome should be almost identical (with the exception of point mutations and accidental breakages), and when sequencing does not need to figure out which of the copies is located on one or another site.
The CHM13 assembly differs from its predecessors in sequencing technology. Previous versions of the genome were assembled from many short sequences – that is, DNA was first broken into small sections, read each separately, and then superimposed on each other.
But this method is not suitable for heterochromatin, since there are many repetitive sections, the location and number of which is easy to make a mistake (for example, some ribosomal RNA genes in humans can have 300-400 copies).
Therefore, the participants of the T2T Consortium used the method of long readings (long-read sequencing), that is, they broke the DNA into long parts and read them in their entirety.
As a result, CHM13 included 3,054,815,472 bp of nuclear DNA and 16,569 bp of mitochondrial DNA. Of these, 182 million pairs are completely new: they were not in the previous 2019 genome assembly. In this genome, the authors of the work note, there are no gaps and nucleotides that could not find a place – it is completely complete.
The vast majority of new sections are non-coding DNA, mostly centromeric (that is, from the middle of the chromosomes, in the place where they are fastened to each other in a characteristic cross during meiosis).
However, the researchers managed to find new genes – only 1956 pieces. Of these, about a hundred, they estimate, code for proteins (the rest may code for individual types of RNA or not work at all).
What did the other five articles include?
The remaining five articles in this issue are devoted to individual in-depth research within the project. For example, one of the works tells about centromeres, their diversity, structure, and evolution.
In the other, about repeats in the genome: the authors searched among them for retrotransposons (mobile genetic elements that can move around the genome or insert new copies of themselves into it), including active ones.
The third is about segmental duplications, long stretches with few copies, that likely played a role in primate evolution. The fourth is a methylation map of the newly sequenced regions.
Finally, another article is devoted to the practical applications of the new genome. Its authors tested how convenient it is to use the CHM13 assembly to compare individual genomes with it and look for special sequence variants.
To do this, they used the database of the Thousand Genomes project and, comparing sequences from the database with CHM13, found more than a million gene variants (those that were not shown by comparison with the GRCh38 assembly).
Therefore, the members of the consortium proposed to designate CHM13 as the new standard for genetic and genomic research.
But the decoding of the human genome will not end there either. CHM13 has its flaws – for example, this assembly does not have a Y chromosome.
This is due to the fact that hydatidiform mole cells carry two identical copies of each chromosome, and the YY genotype is not viable. Therefore, this chromosome will have to be collected separately.
In addition, CHM13 is not a synthetic genome from cells of different people, as was the case with previous assemblies, but the genome of a single cell line.
Therefore, the Consortium will have to collect other variants of genomes so that their standard takes into account not only the complete DNA sequence but also its different variants.
Join the discussion and participate in awesome giveaways in our mobile Telegram group. Join Curiosmos on Telegram Today. t.me/Curiosmos
*All sources are linked at the start of the article.