A Universe of Sorts

§ Burrows Wheeler

We aim to get the

O(n)

algorithm for burrows wheeler, by starting from the naive

O(n^2)

implementation and then slowly chipping away to get to the fastest algorithm

§ String rotations

Given a string

s

of length

n

, we can index it as

s[0]

s[1]

, upto

s[n-1]

. We can now build a table consisting of rotations of the string. We'll define:

lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]

We can build a table of left-rotations:

foobar
oobarf
obarfo
barfoo
arfoob
rfooba

We can immediately notice that it's a symmetric matrix . We can prove this as follows. We write

lrot(rot, s)

as an array such that:

lrot(rot, s)[i] = s[(rot + i) % n] \text{where $n = |s|$}

Now, we note that on row

r

of our array we have the string

lrot(r, s)

. So the value at row

r

, column

c

lrot(r, s)[c] = s[(r + c)%n]

. But this is symmetric in

c, r

so can be written as

s[(c + r)%n]

, which is equal to

lrot(c, s)[r]

. Formally:

lrot(r, s)[c] = s[(r + c) %n] = s[(c + r)%n] = lrot(c, s)[r]

§ Sorts of string rotations

We're next interested in considering sorted order of the string rotations. For example, in the case of the string "here, there", all the rotations are:

here-there  0
ere-thereh  1
re-therehe  2
e-thereher  3
-therehere  4
therehere-  5
herehere-t  6
erehere-th  7
rehere-the  8
ehere-ther  9

which is calculated with the help of the following haskell code:

lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)

main =  putStrLn $ intercalate "\n"
  (zipWith (\s i -> s <> "  " <> show i)
           (lrots "here-there") [0,1..])

If we now sort these rotations, we get:

-therehere  0
e-thereher  1
ehere-ther  2
ere-thereh  3
erehere-th  4
here-there  5
herehere-t  6
re-therehe  7
rehere-the  8
therehere-  9

we produce this by chainging the above definition of main to:

main =  putStrLn $ intercalate "\n"
  (zipWith (\s i -> s <> "  " <> show i)
           -- | extra `sort` here
           (sort $ lrots "here-there") [0,1..])

Let's look at the final column of that table. We have:

-thereher|e|  0
e-therehe|r|  1
ehere-the|r|  2
ere-there|h|  3
erehere-t|h|  4
here-ther|e|  5 <- ORIGINAL STRING
herehere-|t|  6
re-thereh|e|  7
rehere-th|e|  8
therehere|-|  9

0123456789
errhhetee-

Now, we'll show how to write a really cool function call bwt such that:

bwtinv ("errhhetee-",5) = "here-there"

The 5 is the index of the original string in the list. That is, given the jumbled up last-column and the index of the original string, we're able to reconstruct the original string. The reason this is useful is that the jumbled string "errhhetee-" is easier to compress: it has long runs of r, h, and e which makes it easier to compress. The process we performed of rotating, sorting the rotations and taking the final column is the Burrows-Wheeler transform. in code:

import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)

findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)

lastcol :: [[a]] -> [a]; lastcol = map last

bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')

Now we need to understand how bwtinv is defined and what it does. The definition is:

import Control.Arrow (&&&)

bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k

-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as

recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))


hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
         in sortBy cmp

consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))

OK, so much for the code. what does it do ? The idea is that:

we recreate the entire matrix from the last column using recreateand take the kth element.

recreate apprents a copy of the initial last column to a matrix of columns, and then sorts this.

§ Alternate explanation of why this is invertible.

Consider the string banana and its cyclic permutations:

banana$
$banana
a$banan
na$bana
ana$ban
nana$ba
anana$b

Then sorted (where $ is considered smallest letter):

$banan|a|
a$bana|n|
ana$ba|n|
anana$|b|
banana|$|
na$ban|a|
nana$b|a|

The BWT will be annb$aa. Notice that we also know the FIRST column of the BWT table, since it's just the string with its letters sorted! So given the last column, we really know:

|$|?????|a|
|a|?????|n|
|a|?????|n|
|a|?????|b|
|b|?????|$|
|n|?????|a|
|n|?????|a|

This means we know all "2-mers" of our string, which are: a$, na, na, ba, $b, an, an.
If we sort these, then we get the first two letters of our BWT matrix! That is, the sorted 2-mers:

$b
a$
an
an
ba
na
na

However, we see that we also know the last column of the BWT matrix, so now we know:

|$b|????|a|
|a$|????|n|
|an|????|n|
|an|????|b|
|ba|????|$|
|na|????|a|
|na|????|a|

We now know 3-mers! Sort again, find 4-mers, etc. till we reconstruct the whole string :) This is memory intensive. Can we do better?

§ First last property of BWT

Consider the cyclic permutations:

b1 a1 n1 a1 n2 a3 $1
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a1 n2 a2
a2 n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1 a1
a1 n1 a2 n2 a3 $1 b1

Now let's sort them:

$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n1
a1 n1 a2 n2 a3 $1 b1
b1 a1 n1 a1 n2 a3 $1
n2 a3 $1 b1 a1 n2 a2
n1 a2 n2 a3 $1 b1 a1

We see that the order of occurrence of a in the first column is [a3 a2 a1]. Similarly in the last column, it occurs in the order [a3 a2 a1]. Furthermore, this holds for n as well: we have the order [n2 n1] in both the first and last column.
Claim: this always occurs! Proof: consider the strings with [a3, a2, a1] in the first column:

a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n2
a1 n1 a2 n2 a3 $1 b1

These strings are sorted, since they come from BWT. Thus, I can chop off the a from all of them, and they remain sorted (we sort by lex):

$1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1

Furthermore, since we sort by lex, I can place the as at the end of the strings, still retaining lex ordering:

$1 b1 a1 n1 a2 n2 a3
n2 a3 $1 b1 a2 n2 a2
n1 a2 n2 a3 $1 b1 a1

These new strings are (a) still sorted, and (b) cyclic shifts of the original strings! Thus, these occur in the BWT table, and are exactly the strings in the last column which have {a1, a2, a3}. Since they're sorted, they'll occur in the order [a3 a2 a1] just like the did in the first column!
This shows that if the first column has a in the order [a3 a2 a1], then so too does the last column have a in the order [a3, a2, a1].

So now, given the last column of the BWT, we add the first column of the BWT and the numbers, since we know that we can always add the numbers by the first-last property:

1 $1 ? ? ? ? ? a3
2 a3 ? ? ? ? ? n2
3 a2 ? ? ? ? ? n1
4 a1 ? ? ? ? ? b1
5 b1 ? ? ? ? ? $1
6 n2 ? ? ? ? ? a2
7 n1 ? ? ? ? ? a1

In the first column, we have $1...a3, which we treat as the string $1 a3.
We see that a3 corresponds to 2nd column. We have a3...n2, which we combine with $1 a3 to get $1 a3 n2.
We see that n2 corresonds to the 6th column , where we have n2...a2. We combine this with $1 a3 n2 to get $1 a3 n2 a2.
We see that a2 corresonds to the 3r column , where we have a2...n1. We combine this with $1 a3 n2 a2 to get $1 a3 n2 a2 n1.
We see that n1 corresonds to the 7th column , where we have n1...a1. We combine this with $1 a3 n2 a2 n1 to get $1 a3 n2 a2 n1 a1.
We see that a1 corresonds to the 4th column , where we have a1...b1. We combine this with $1 a3 n2 a2 n1 a1 to get $1 a3 n2 a2 n1 a1 b1.
See that $1 a3 n2 a2 n1 a1 b1 is the reverse of banana, the string we encoded!
This lets us rediscover the BWT'd string in O(n) time using the neat property of the BWT!

§ Finding substrings with BWT

Since the BWT list is sorted, can keep top and bottom pointers which tell us where we are looking for current letter. Can find next letter using BWT matix.

§ Why BWT gives good run length encoding

Suppose we have a book with many words.
The book will have many occurrences of the word the.
In the BWT, we will have many strings of the form e.......thto reflect substrings that have a the.
This will give us many occurences of h in the last column!.

§ References

Richard Bird: Pearls of functional algorithm design
Algorithms on strings UCSD