§ Burrows Wheeler


We aim to get the O(n)O(n) algorithm for burrows wheeler, by starting from the naive O(n2)O(n^2) implementation and then slowly chipping away to get to the fastest algorithm

§ String rotations


Given a string ss of length nn, we can index it as s[0]s[0], s[1]s[1], upto s[n1]s[n-1]. We can now build a table consisting of rotations of the string. We'll define:
lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]

We can build a table of left-rotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba

We can immediately notice that it's a symmetric matrix . We can prove this as follows. We write lrot(rot,s)lrot(rot, s) as an array such that:
lrot(rot,s)[i]=s[(rot+i) lrot(rot, s)[i] = s[(rot + i) % n] \text{where $n = |s|$}

Now, we note that on row rr of our array we have the string lrot(r,s)lrot(r, s). So the value at row rr, column cc is lrot(r,s)[c]=s[(r+c)lrot(r, s)[c] = s[(r + c)%n]. But this is symmetric in c,rc, r so can be written as s[(c+r)s[(c + r)%n], which is equal to lrot(c,s)[r]lrot(c, s)[r]. Formally:
lrot(r,s)[c]=s[(r+c) lrot(r, s)[c] = s[(r + c) %n] = s[(c + r)%n] = lrot(c, s)[r]

§ Sorts of string rotations


We're next interested in considering sorted order of the string rotations. For example, in the case of the string "here, there", all the rotations are:
here-there  0
ere-thereh  1
re-therehe  2
e-thereher  3
-therehere  4
therehere-  5
herehere-t  6
erehere-th  7
rehere-the  8
ehere-ther  9

which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)

main =  putStrLn $ intercalate "\n"
  (zipWith (\s i -> s <> "  " <> show i)
           (lrots "here-there") [0,1..])

If we now sort these rotations, we get:
-therehere  0
e-thereher  1
ehere-ther  2
ere-thereh  3
erehere-th  4
here-there  5
herehere-t  6
re-therehe  7
rehere-the  8
therehere-  9

we produce this by chainging the above definition of main to:
main =  putStrLn $ intercalate "\n"
  (zipWith (\s i -> s <> "  " <> show i)
           -- | extra `sort` here
           (sort $ lrots "here-there") [0,1..])

Let's look at the final column of that table. We have:
-thereher|e|  0
e-therehe|r|  1
ehere-the|r|  2
ere-there|h|  3
erehere-t|h|  4
here-ther|e|  5 <- ORIGINAL STRING
herehere-|t|  6
re-thereh|e|  7
rehere-th|e|  8
therehere|-|  9

0123456789
errhhetee-

Now, we'll show how to write a really cool function call bwt such that:
bwtinv ("errhhetee-",5) = "here-there"

The 5 is the index of the original string in the list. That is, given the jumbled up last-column and the index of the original string, we're able to reconstruct the original string. The reason this is useful is that the jumbled string "errhhetee-" is easier to compress: it has long runs of r, h, and e which makes it easier to compress.
The process we performed of rotating, sorting the rotations and taking the final column is the Burrows-Wheeler transform. in code:
import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)

findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)

lastcol :: [[a]] -> [a]; lastcol = map last

bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')

Now we need to understand how bwtinv is defined and what it does. The definition is:
import Control.Arrow (&&&)

bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k

-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as

recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))


hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
         in sortBy cmp

consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))

OK, so much for the code. what does it do ?
The idea is that:

§ Alternate explanation of why this is invertible.


Consider the string banana and its cyclic permutations:
banana$
$banana
a$banan
na$bana
ana$ban
nana$ba
anana$b

Then sorted (where $ is considered smallest letter):
$banan|a|
a$bana|n|
ana$ba|n|
anana$|b|
banana|$|
na$ban|a|
nana$b|a|

The BWT will be annb$aa. Notice that we also know the FIRST column of the BWT table, since it's just the string with its letters sorted! So given the last column, we really know:
|$|?????|a|
|a|?????|n|
|a|?????|n|
|a|?????|b|
|b|?????|$|
|n|?????|a|
|n|?????|a|


$b
a$
an
an
ba
na
na

However, we see that we also know the last column of the BWT matrix, so now we know:
|$b|????|a|
|a$|????|n|
|an|????|n|
|an|????|b|
|ba|????|$|
|na|????|a|
|na|????|a|

We now know 3-mers! Sort again, find 4-mers, etc. till we reconstruct the whole string :) This is memory intensive. Can we do better?

§ First last property of BWT


Consider the cyclic permutations:
b1 a1 n1 a1 n2 a3 $1
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a1 n2 a2
a2 n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1 a1
a1 n1 a2 n2 a3 $1 b1

Now let's sort them:
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n1
a1 n1 a2 n2 a3 $1 b1
b1 a1 n1 a1 n2 a3 $1
n2 a3 $1 b1 a1 n2 a2
n1 a2 n2 a3 $1 b1 a1


a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n2
a1 n1 a2 n2 a3 $1 b1


$1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1


$1 b1 a1 n1 a2 n2 a3
n2 a3 $1 b1 a2 n2 a2
n1 a2 n2 a3 $1 b1 a1



1 $1 ? ? ? ? ? a3
2 a3 ? ? ? ? ? n2
3 a2 ? ? ? ? ? n1
4 a1 ? ? ? ? ? b1
5 b1 ? ? ? ? ? $1
6 n2 ? ? ? ? ? a2
7 n1 ? ? ? ? ? a1


§ Finding substrings with BWT


Since the BWT list is sorted, can keep top and bottom pointers which tell us where we are looking for current letter. Can find next letter using BWT matix.

§ Why BWT gives good run length encoding



§ References