**Hultman numbers: measuring genetic distance**If two genomes contain the same genes, but they have been rearranged by a sequence of reversals, translocations, fusions and fissions, how can we measure how closely related the two genomes are? It turns out that a key parameter in this problem is the number of cycles in the graph shown on the right of the picture. These cycles are enumerated by the

**Hultman numbers**.

It is possible for a genome to experience

*genome rearrangements*, which are evolutionary events that change the order of the genes in a genome. However, a single genome rearrangement will not scramble the genes in a random permutation, but instead it will typically cut the genome in a small number of places and reattach the pieces in a different order. A genome rearrangement that cuts the genome in exactly k places is known as a

**k-break**.

The most common types of genome rearrangement can be modelled as

**2-breaks**, (also known as

*Double-Cut-and-Join*, or DCJ) which means that the genome is cut in exactly two places and the resulting pieces reattached. Examples of such rearrangements include

*reversals*, which flip segments of a chromosome;

*translocations*, which exchange segments of two chromosomes;

*fusions*, which merge two chromosomes into one; and

*fissions*, which split a single chromosome into two. The

**2-break distance** (or DCJ distance) between two genomes is the smallest number of 2-breaks required to transform one of the genomes into the other.

An easier special case of the problem treats the case where the genome consists of one chromosome that is linear (as opposed to circular); in other words, the genome consists of a one unbroken string of genes. Each gene in the genome is considered to have a head and a tail, which gives it an orientation. It is also convenient to include a

*virtual gene* numbered 0 which, if it actually existed, would link the last gene in the genome to the first gene.

The diagram on the left of the picture shows the

**genome graph** of a toy model of a genome called

*Q*, which has five genes numbered 1 to 5, together with a virtual gene, numbered 0. The genes in

*Q* are denoted by solid line segments, with an arrow pointing from the tail of the gene to the head. (The tail and head of a gene are indicated by the superscripts

*t* and

*h*, respectively.) The dashed lines show the order in which the genes are connected in the genome. This information can be represented more concisely by the

**signed permutation** (1, –3, 5, 2, –4). If we ignore the signs, this gives (1, 3, 5, 2, 4), which is the order in which the genes appear in the genome, reading clockwise from the virtual gene 0. The negative signs appear in front of 3 and 4 to signify that the orientation of those genes is opposite to that of the other genes.

If we define the genome

*P* to be the one corresponding to the signed permutation (1, 2, 3, 4, 5), then we see that

*P* and

*Q* have the same genes, but in a different order and with different orientations. The diagram on the right of the picture shows that

**breakpoint graph**,

*G(P, Q)*, which can be used to visualise the evolutionary closeness of

*P* and

*Q*. The numbers with superscripts that appear around the perimeter of the breakpoint graph are the same ones that would appear in the genome graph of

*P* if we drew it, and the solid lines in the breakpoint graph correspond to the dashed lines in the genome graph of

*P*. On the other hand, the dashed lines in the breakpoint graph correspond to the dashed lines in the genome graph of

*Q*. For example, there is a dashed line from 2^t to 5^h in the breakpoint graph of

*G(P, Q)* because there is also a dashed line from 2^t to 5^h in

*Q*, even though the line itself appears in a different position.

**The key feature of the breakpoint graph is the way it decomposes into cycles.** For example, the graph

*G(P, Q)* shown on the right of the picture splits up into three cycles, each of which contains some dashed edges and an equal number of solid edges. A cycle containing r dashed and r solid edges is called an

*r-cycle*. In our example, the breakpoint graph contains one 1-cycle (involving the points 0^h and 1^t), one 2-cycle, and one 3-cycle. It is a theorem about breakpoint graphs that

*the 2-break distance between genomes P and Q is equal to n+1–c*, where n is the number of genes (5 in our case) and c is the number of cycles in the breakpoint graph (3 in our case). In other words, the 2-break distance between

*P* and

*Q* is 3 (because 3=5+1–3).

The (signed)

**Hultman number** H(n,d) is the number of unichromosomal genomes with n genes whose breakpoint graph (relative to a fixed genome) contains a total of d cycles. There is a closed formula for these numbers, but it is rather complicated. The recent paper

*Generalized Hultman Numbers and the Distribution of Multi-break Distances* by Nikita Alexeev, Anna Pologova, and Max A. Alekseyev (

http://arxiv.org/abs/1503.05285) generalizes the notion of Hultman numbers from the context of 2-break distance to the context of k-break distance. Although k-break distances have not been observed in evolution, they are used in

**cancer genomics** to measure a catastrophic event called

*chromothripsis* in which multiple breakages happen simultaneously. The authors give recurrence relations for the generalized Hultman numbers in the unichromosomal case, but no closed formula. They intend to treat the multichromosomal case in a future paper. Asymptotically, the 2-break distance is known to be normally distributed, but it is not known what happens for general values of k.

**Relevant links**The paper

*Efficient sorting of genomic permutations by translocation, inversion and block interchange* by S. Yancopoulos, O. Attie, and R. Friedberg contains many interesting results in this area. It is freely downloadable from

http://bioinformatics.oxfordjournals.org/content/21/16/3340.full.pdf+htmlThe unsigned Hultman numbers deal with the case where the orientations of the genes are ignored. The unsigned and signed numbers appear in the

*On-Line Encyclopedia of Integer Sequences* at

https://oeis.org/A164652 and

https://oeis.org/A189507 respectively.

Wikipedia on chromothripsis:

http://en.wikipedia.org/wiki/ChromothripsisThis post is about the applications to genomics that I mentioned in passing in a previous post about pancake sorting. You can find the earlier post here:

https://plus.google.com/101584889282878921052/posts/GyEPJkT5769The picture comes from page 3 of the arXiv paper by Alexeev, Pologova and Alekseyev discussed above.

#mathematics #genetics #sciencesunday #spnetwork arXiv:1503.05285