How Big and Vast is ‘Omics’?
The first omics technology, genomics, appeared in the literature in 1987. However, approximately 15 years elapsed until the full human genome sequence was published and the beginning of the so-called “post-genomic era,” which inspired the development of other omics technologies. Continued research has produced new omics technologies, including transcriptomics, proteomics, metabolomics, and the latest Spatial transcriptomics. These technologies, along with advances in bioinformatics, computational approaches, and artificial intelligence, enable detailed investigation of complex biological processes with greater sensitivity and resolution.
What is the Data size we are looking at?
A single human genome is roughly 3 billion base pairs, translating to around 30 GB of data when sequenced and stored. We will need an estimated 40 exabytes to store the genome-sequence data generated worldwide by 2025.
Here is a visualization of 40 exabytes of data represented as a sea with 40 cargo ships, each symbolizing 1 billion gigabytes of data
Year of Coinage for Omics Terms
Genomics: Coined in 1986 by Tom Roderick at the Jackson Laboratory during discussions about the Human Genome Project
Proteomics: Introduced in 1997 by Marc Wilkins to describe the study of the proteome, which includes all proteins produced by an organism.
Transcriptomics: The term was coined by Charles Auffray in 1996. It refers to the study of transcriptomes, the complete set of RNA transcripts produced by the genome.
Metabolomics: “Metabolome” was introduced in 1998, creating the basis for the field of metabolomics, it pertains to the study of metabolites and metabolic processes within organisms.
Lipidomics: Lipidomics, as a branch of metabolomics, was first introduced in 2003
Microbiomics: Knowledge of the human microbiome expanded appreciably after 2007, the year of the Human Microbiome Project (HMP)—a five-year-long international effort to characterize the microbial communities found in the human body
Phosphoproteomics: The term phosphoproteomics was coined by Larsen et al. in 2001. However, Ficarro et al. pioneered the large-scale analysis of protein phosphorylation sites in 2002.
Epigenomics: The term “epigenomics” is not explicitly stated as having a specific date of coinage, but it emerged alongside the field of epigenetics, first introduced by biologist Conrad Waddington in 1942. Epigenomics in the way we see today from 2007 onwards because of the advent of various sequencing tools and technologies, such as DNA microarrays, ChIP-Seq
Spatial Transcriptomics: The modern concept of spatial omics, particularly spatial transcriptomics, was first introduced by Ståhl et al. in 2016, but became widely available after the launch of Visium platform by 10x Genomics in 2022.
Multiomics: Multi-omics was first referenced in 2002. The number of scientific publications in the field more than doubled between 2022 and 2023
The Rise of Biobanks
In the past decade, biobanks—defined as large collections of biological, medical, and genetic data on the same individuals—have revolutionized human genomics research by deepening our understanding of the complex relationships between genomes and phenomes at different organizational levels (e.g. tissues, individuals, and populations). Although the biobank model is becoming the new standard for data collection, substantial heterogeneity subsists between biobanks across the world creating both new challenges and new opportunities for data analysis.
Some of the top genomic biobanks globally include the UK Biobank, Estonian Biobank, FinnGen Research Project, BioBank Japan (BBJ), China Kadoorie Biobank (CKB), Taiwan Biobank, Tohoku Medical Megabank Project, “All of Us” biobank, Shanghai Zhangjiang Biobank, and the Mayo Clinic Biobank. The UK Biobank which is often considered the most widely used and prominent in genomic research holds 500,000 samples, expect their data size to increase by 40 petabytes by 2025
Omics Literature explosion
Since 2000, the number of articles published in all Omics domains has significantly increased. In 2024, 102,000 articles about genomics were published in PubMed alone. There are 3146507 papers published in Pubmed about omics overall.
The impact of public omics data
The amount of omics data in the public domain is increasing every year. Innovative solutions for data management, data sharing, and discovering novel datasets are therefore increasingly required. There is an ever-increasing number of biological databases that archive, integrate, and share different types of biological data often with value-added curation. Database Commons is a manually curated catalog of worldwide biological databases, which has been frequently updated and enriched since its inception in 2015. Database Commons has a total of 6933 biological databases from 10417 publications, which are geographically distributed in 80 countries/regions and developed by 2309 institutions
In the upcoming years, omics data will grow more quickly. As the application of data science and artificial intelligence in biology develops, these data could potentially be of even greater value in their secondary use. One such success is AlphaFold, an artificial intelligence program that demonstrates the revolutionary potential of reusing existing raw and metadata from UniProt for protein sequences and the experimentally confirmed protein structures from the Protein Data Bank. Because omics data is high-throughput, it is perfect for machine learning (ML) applications, allowing for the study of complex patterns and reliable model training.
An interesting blog