§ Burrows Wheeler
We aim to get the O(n) algorithm for burrows wheeler, by starting from the
naive O(n2) implementation and then slowly chipping away to get to the
fastest algorithm
§ String rotations
Given a string s of length n, we can index it as s[0], s[1], upto
s[n−1]. We can now build a table consisting of rotations of the string.
We'll define:
lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]
We can build a table of left-rotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba
We can immediately notice that it's a symmetric matrix . We can prove
this as follows. We write lrot(rot,s) as an array such that:
lrot(rot,s)[i]=s[(rot+i)
Now, we note that on row r of our array we have the string lrot(r,s).
So the value at row r, column c is lrot(r,s)[c]=s[(r+c).
But this is symmetric in c,r so can be written as s[(c+r),
which is equal to lrot(c,s)[r]. Formally:
lrot(r,s)[c]=s[(r+c)
§ Sorts of string rotations
We're next interested in considering sorted order of the string
rotations. For example, in the case of the string "here, there",
all the rotations are:
here-there 0
ere-thereh 1
re-therehe 2
e-thereher 3
-therehere 4
therehere- 5
herehere-t 6
erehere-th 7
rehere-the 8
ehere-ther 9
which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)
main = putStrLn $ intercalate "\n"
(zipWith (\s i -> s <> " " <> show i)
(lrots "here-there") [0,1..])
If we now sort these rotations, we get:
-therehere 0
e-thereher 1
ehere-ther 2
ere-thereh 3
erehere-th 4
here-there 5
herehere-t 6
re-therehe 7
rehere-the 8
therehere- 9
we produce this by chainging the above definition of main
to:
main = putStrLn $ intercalate "\n"
(zipWith (\s i -> s <> " " <> show i)
-- | extra `sort` here
(sort $ lrots "here-there") [0,1..])
Let's look at the final column of that table. We have:
-thereher|e| 0
e-therehe|r| 1
ehere-the|r| 2
ere-there|h| 3
erehere-t|h| 4
here-ther|e| 5 <- ORIGINAL STRING
herehere-|t| 6
re-thereh|e| 7
rehere-th|e| 8
therehere|-| 9
0123456789
errhhetee-
Now, we'll show how to write a really cool function call bwt
such that:
bwtinv ("errhhetee-",5) = "here-there"
The 5
is the index of the original string in the list. That is, given
the jumbled up last-column and the index of the original string, we're
able to reconstruct the original string. The reason this is useful is
that the jumbled string "errhhetee-"
is easier to compress: it has
long runs of r
, h
, and e
which makes it easier to compress.
The process we performed of rotating, sorting the rotations and taking
the final column is the Burrows-Wheeler transform. in code:
import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)
findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)
lastcol :: [[a]] -> [a]; lastcol = map last
bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')
Now we need to understand how bwtinv
is defined and what it does.
The definition is:
import Control.Arrow (&&&)
bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k
-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as
recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))
hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
in sortBy cmp
consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))
OK, so much for the code. what does it do ?
The idea is that:
- we recreate the entire matrix from the last column using
recreate
and take the k
th element.
-
recreate
apprents a copy of the initial last column to a matrix of columns, and then sorts this.
§ Alternate explanation of why this is invertible.
Consider the string banana
and its cyclic permutations:
banana$
$banana
a$banan
na$bana
ana$ban
nana$ba
anana$b
Then sorted (where $
is considered smallest letter):
$banan|a|
a$bana|n|
ana$ba|n|
anana$|b|
banana|$|
na$ban|a|
nana$b|a|
The BWT will be annb$aa
. Notice that we also know the FIRST column of the BWT table,
since it's just the string with its letters sorted! So given the last column, we really know:
|$|?????|a|
|a|?????|n|
|a|?????|n|
|a|?????|b|
|b|?????|$|
|n|?????|a|
|n|?????|a|
- This means we know all "2-mers" of our string, which are:
a$
, na
, na
, ba
, $b
, an
, an
. - If we sort these, then we get the first two letters of our BWT matrix! That is, the sorted 2-mers:
$b
a$
an
an
ba
na
na
However, we see that we also know the last column of the BWT matrix, so now we know:
|$b|????|a|
|a$|????|n|
|an|????|n|
|an|????|b|
|ba|????|$|
|na|????|a|
|na|????|a|
We now know 3-mers! Sort again, find 4-mers, etc. till we reconstruct the whole string :)
This is memory intensive. Can we do better?
§ First last property of BWT
Consider the cyclic permutations:
b1 a1 n1 a1 n2 a3 $1
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a1 n2 a2
a2 n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1 a1
a1 n1 a2 n2 a3 $1 b1
Now let's sort them:
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n1
a1 n1 a2 n2 a3 $1 b1
b1 a1 n1 a1 n2 a3 $1
n2 a3 $1 b1 a1 n2 a2
n1 a2 n2 a3 $1 b1 a1
- We see that the order of occurrence of
a
in the first column is [a3 a2 a1]
. Similarly in the last column, it occurs in the order [a3 a2 a1]
. Furthermore, this holds for n
as well: we have the order [n2 n1]
in both the first and last column. - Claim: this always occurs! Proof: consider the strings with
[a3, a2, a1]
in the first column:
a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n2
a1 n1 a2 n2 a3 $1 b1
- These strings are sorted, since they come from BWT. Thus, I can chop off the
a
from all of them, and they remain sorted (we sort by lex):
$1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1
- Furthermore, since we sort by lex, I can place the
a
s at the end of the strings, still retaining lex ordering:
$1 b1 a1 n1 a2 n2 a3
n2 a3 $1 b1 a2 n2 a2
n1 a2 n2 a3 $1 b1 a1
- These new strings are (a) still sorted, and (b) cyclic shifts of the original strings! Thus, these occur in the BWT table, and are exactly the strings in the last column which have
{a1, a2, a3}
. Since they're sorted, they'll occur in the order [a3 a2 a1]
just like the did in the first column! - This shows that if the first column has
a
in the order [a3 a2 a1]
, then so too does the last column have a
in the order [a3, a2, a1]
.
- So now, given the last column of the BWT, we add the first column of the BWT and the numbers, since we know that we can always add the numbers by the first-last property:
1 $1 ? ? ? ? ? a3
2 a3 ? ? ? ? ? n2
3 a2 ? ? ? ? ? n1
4 a1 ? ? ? ? ? b1
5 b1 ? ? ? ? ? $1
6 n2 ? ? ? ? ? a2
7 n1 ? ? ? ? ? a1
- In the first column, we have
$1...a3
, which we treat as the string $1 a3
. - We see that
a3
corresponds to 2nd column. We have a3...n2
, which we combine with $1 a3
to get $1 a3 n2
. - We see that
n2
corresonds to the 6th column , where we have n2...a2
. We combine this with $1 a3 n2
to get $1 a3 n2 a2
. - We see that
a2
corresonds to the 3r column , where we have a2...n1
. We combine this with $1 a3 n2 a2
to get $1 a3 n2 a2 n1
. - We see that
n1
corresonds to the 7th column , where we have n1...a1
. We combine this with $1 a3 n2 a2 n1
to get $1 a3 n2 a2 n1 a1
. - We see that
a1
corresonds to the 4th column , where we have a1...b1
. We combine this with $1 a3 n2 a2 n1 a1
to get $1 a3 n2 a2 n1 a1 b1
. - See that
$1 a3 n2 a2 n1 a1 b1
is the reverse of banana
, the string we encoded! - This lets us rediscover the BWT'd string in O(n) time using the neat property of the BWT!
§ Finding substrings with BWT
Since the BWT list is sorted, can keep top and bottom pointers which tell us where we are looking
for current letter. Can find next letter using BWT matix.
§ Why BWT gives good run length encoding
- Suppose we have a book with many words.
- The book will have many occurrences of the word
the
. - In the BWT, we will have many strings of the form
e.......th
to reflect substrings that have a the
. - This will give us many occurences of
h
in the last column!.
§ References