Bioinformatics Glossary
Your essential dictionary for navigating the complex terminology of computational biology.
Accession Number
A unique identifier assigned to a sequence record in databases like GenBank or RefSeq (e.g., M12345, AC123456).
Algorithm
A step-by-step procedure or set of rules for solving a problem or making a calculation, commonly used in bioinformatics for sequence alignment and data analysis.
Alignment
The process of arranging sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships.
Annotation
The process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.
Assembly
The process of putting together small fragments of DNA sequences into a long, continuous sequence or a complete genome.
Base Pair (bp)
Two nitrogenous bases joined by hydrogen bonds; the unit of DNA length. Adenine pairs with Thymine (A-T) and Guanine pairs with Cytosine (G-C).
Bioinformatics
An interdisciplinary field that develops methods and software tools for understanding biological data, particularly when the data sets are large and complex.
BLAST (Basic Local Alignment Search Tool)
A popular algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences.
Central Dogma
The framework for understanding the transfer of sequence information. It states that DNA makes RNA, and RNA makes protein.
Clustering
The grouping of similar data points (e.g., gene expression profiles) into clusters, often used to find patterns in high-throughput data.
Codon
A sequence of three nucleotides which together form a unit of genetic code in a DNA or RNA molecule.
Contig
A set of overlapping DNA segments that together represent a consensus region of DNA.
Coverage
The number of times a specific nucleotide is sequenced in a sequencing experiment (e.g., 30x coverage).
Data Mining
The computational process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
De Novo Assembly
Assembling short reads to create full-length sequences without using a reference genome.
DNA (Deoxyribonucleic Acid)
The molecule that carries the genetic instructions used in the growth, development, functioning, and reproduction of all known living organisms.
E-value (Expect Value)
A parameter in BLAST searches that describes the number of hits one can expect to see by chance when searching a database of a particular size.
Exon
A segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence.
Expression
The process by which information from a gene is used in the synthesis of a functional gene product (RNA or protein).
Gene
A distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule.
Genome
The haploid set of chromosomes in a gamete or microorganism, or in each cell of a multicellular organism; the complete set of genes or genetic material present in a cell or organism.
Genomics
The branch of molecular biology concerned with the structure, function, evolution, and mapping of genomes.
Genotype
The genetic constitution of an individual organism.
Hidden Markov Model (HMM)
A statistical model used to describe the evolution of observable events that depend on internal factors, which are not directly observable. Widely used for sequence analysis (e.g., gene finding).
High-Throughput Sequencing
Technologies that sequence DNA and RNA much more quickly and cheaply than the previously used Sanger sequencing, revolutionizing the study of genomics.
Homology
Similarity in sequence of a protein or nucleic acid between organisms of the same or different species.
Indel
A molecular biology term for an insertion or deletion of bases in the genome of an organism.
Intron
A segment of a DNA or RNA molecule which does not code for proteins and interrupts the sequence of genes.
Machine Learning
The study of computer algorithms that improve automatically through experience and by the use of data. Essential for predictive modeling in bioinformatics.
Metagenomics
The study of genetic material recovered directly from environmental samples.
Motif
A nucleotide or amino-acid sequence pattern that is widespread and usually doubles as a biological significance.
Mutation
The changing of the structure of a gene, resulting in a variant form that may be transmitted to subsequent generations.
Next-Generation Sequencing (NGS)
Also known as high-throughput sequencing, the catch-all term used to describe a number of different modern sequencing technologies.
Nucleotide
The basic building block of nucleic acids (DNA and RNA).
Omics
A suffix used in biology to refer to a field of study in biology ending in -omics, such as genomics, proteomics, or metabolomics.
Open Reading Frame (ORF)
A part of a reading frame that leads to the translatable part of an RNA sequence.
Orthologs
Genes in different species that evolved from a common ancestral gene by speciation.
Paralogs
Genes related by duplication within a genome.
Phylogenetics
The study of the evolutionary history and relationships among or within groups of organisms.
Proteomics
The large-scale study of proteins.
Read
A sequence of nucleotides generated from a single DNA fragment in a sequencing experiment.
Reference Genome
A digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes.
RNA-Seq
A technique used to analyze the continuously changing cellular transcriptome.
Sequence Alignment
A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.
SNP (Single Nucleotide Polymorphism)
A variation in a single nucleotide that occurs at a specific position in the genome.
Transcriptomics
The study of the transcriptome, the complete set of RNA transcripts that are produced by the genome.
Translation
The process of translating the sequence of a messenger RNA (mRNA) molecule to a sequence of amino acids during protein synthesis.
Variant Calling
The process of identifying variants from sequence data.