Bioinformatics Glossary

Your essential dictionary for navigating the complex terminology of computational biology.

Jump to Letter

Accession Number

A unique identifier assigned to a sequence record in databases like GenBank or RefSeq (e.g., M12345, AC123456).

Algorithm

A step-by-step procedure or set of rules for solving a problem or making a calculation, commonly used in bioinformatics for sequence alignment and data analysis.

Alignment

The process of arranging sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships.

Annotation

The process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.

Assembly

The process of putting together small fragments of DNA sequences into a long, continuous sequence or a complete genome.

Base Pair (bp)

Two nitrogenous bases joined by hydrogen bonds; the unit of DNA length. Adenine pairs with Thymine (A-T) and Guanine pairs with Cytosine (G-C).

Bioinformatics

An interdisciplinary field that develops methods and software tools for understanding biological data, particularly when the data sets are large and complex.

BLAST (Basic Local Alignment Search Tool)

A popular algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences.

Central Dogma

The framework for understanding the transfer of sequence information. It states that DNA makes RNA, and RNA makes protein.

Clustering

The grouping of similar data points (e.g., gene expression profiles) into clusters, often used to find patterns in high-throughput data.

Codon

A sequence of three nucleotides which together form a unit of genetic code in a DNA or RNA molecule.

Contig

A set of overlapping DNA segments that together represent a consensus region of DNA.

Coverage

The number of times a specific nucleotide is sequenced in a sequencing experiment (e.g., 30x coverage).

Data Mining

The computational process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

De Novo Assembly

Assembling short reads to create full-length sequences without using a reference genome.

DNA (Deoxyribonucleic Acid)

The molecule that carries the genetic instructions used in the growth, development, functioning, and reproduction of all known living organisms.

E-value (Expect Value)

A parameter in BLAST searches that describes the number of hits one can expect to see by chance when searching a database of a particular size.

Exon

A segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence.

Expression

The process by which information from a gene is used in the synthesis of a functional gene product (RNA or protein).

Gene

A distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule.

Genome

The haploid set of chromosomes in a gamete or microorganism, or in each cell of a multicellular organism; the complete set of genes or genetic material present in a cell or organism.

Genomics

The branch of molecular biology concerned with the structure, function, evolution, and mapping of genomes.

Genotype

The genetic constitution of an individual organism.

Hidden Markov Model (HMM)

A statistical model used to describe the evolution of observable events that depend on internal factors, which are not directly observable. Widely used for sequence analysis (e.g., gene finding).

High-Throughput Sequencing

Technologies that sequence DNA and RNA much more quickly and cheaply than the previously used Sanger sequencing, revolutionizing the study of genomics.

Homology

Similarity in sequence of a protein or nucleic acid between organisms of the same or different species.

Indel

A molecular biology term for an insertion or deletion of bases in the genome of an organism.

Intron

A segment of a DNA or RNA molecule which does not code for proteins and interrupts the sequence of genes.

Machine Learning

The study of computer algorithms that improve automatically through experience and by the use of data. Essential for predictive modeling in bioinformatics.

Metagenomics

The study of genetic material recovered directly from environmental samples.

Motif

A nucleotide or amino-acid sequence pattern that is widespread and usually doubles as a biological significance.

Mutation

The changing of the structure of a gene, resulting in a variant form that may be transmitted to subsequent generations.

Next-Generation Sequencing (NGS)

Also known as high-throughput sequencing, the catch-all term used to describe a number of different modern sequencing technologies.

Nucleotide

The basic building block of nucleic acids (DNA and RNA).

Omics

A suffix used in biology to refer to a field of study in biology ending in -omics, such as genomics, proteomics, or metabolomics.

Open Reading Frame (ORF)

A part of a reading frame that leads to the translatable part of an RNA sequence.

Orthologs

Genes in different species that evolved from a common ancestral gene by speciation.

Paralogs

Genes related by duplication within a genome.

Phylogenetics

The study of the evolutionary history and relationships among or within groups of organisms.

Proteomics

The large-scale study of proteins.

Read

A sequence of nucleotides generated from a single DNA fragment in a sequencing experiment.

Reference Genome

A digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes.

RNA-Seq

A technique used to analyze the continuously changing cellular transcriptome.

Sequence Alignment

A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.

SNP (Single Nucleotide Polymorphism)

A variation in a single nucleotide that occurs at a specific position in the genome.

Transcriptomics

The study of the transcriptome, the complete set of RNA transcripts that are produced by the genome.

Translation

The process of translating the sequence of a messenger RNA (mRNA) molecule to a sequence of amino acids during protein synthesis.

Variant Calling

The process of identifying variants from sequence data.