lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]
We can build a table of left-rotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba
We can immediately notice that it's a symmetric matrix . We can prove
this as follows. We write as an array such that:
here-there 0
ere-thereh 1
re-therehe 2
e-thereher 3
-therehere 4
therehere- 5
herehere-t 6
erehere-th 7
rehere-the 8
ehere-ther 9
which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)
main = putStrLn $ intercalate "\n"
(zipWith (\s i -> s <> " " <> show i)
(lrots "here-there") [0,1..])
If we now sort these rotations, we get:
-therehere 0
e-thereher 1
ehere-ther 2
ere-thereh 3
erehere-th 4
here-there 5
herehere-t 6
re-therehe 7
rehere-the 8
therehere- 9
we produce this by chainging the above definition of main
to:
main = putStrLn $ intercalate "\n"
(zipWith (\s i -> s <> " " <> show i)
-- | extra `sort` here
(sort $ lrots "here-there") [0,1..])
Let's look at the final column of that table. We have:
-thereher|e| 0
e-therehe|r| 1
ehere-the|r| 2
ere-there|h| 3
erehere-t|h| 4
here-ther|e| 5 <- ORIGINAL STRING
herehere-|t| 6
re-thereh|e| 7
rehere-th|e| 8
therehere|-| 9
0123456789
errhhetee-
Now, we'll show how to write a really cool function call bwt
such that:
bwtinv ("errhhetee-",5) = "here-there"
The 5
is the index of the original string in the list. That is, given
the jumbled up last-column and the index of the original string, we're
able to reconstruct the original string. The reason this is useful is
that the jumbled string "errhhetee-"
is easier to compress: it has
long runs of r
, h
, and e
which makes it easier to compress.
The process we performed of rotating, sorting the rotations and taking
the final column is the Burrows-Wheeler transform. in code:
import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)
findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)
lastcol :: [[a]] -> [a]; lastcol = map last
bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')
Now we need to understand how bwtinv
is defined and what it does.
The definition is:
import Control.Arrow (&&&)
bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k
-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as
recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))
hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
in sortBy cmp
consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))
OK, so much for the code. what does it do ?
The idea is that:
recreate
and take the k
th element. recreate
apprents a copy of the initial last column to a matrix of columns, and then sorts this. banana
and its cyclic permutations:
banana$
$banana
a$banan
na$bana
ana$ban
nana$ba
anana$b
Then sorted (where $
is considered smallest letter):
$banan|a|
a$bana|n|
ana$ba|n|
anana$|b|
banana|$|
na$ban|a|
nana$b|a|
The BWT will be annb$aa
. Notice that we also know the FIRST column of the BWT table,
since it's just the string with its letters sorted! So given the last column, we really know:
|$|?????|a|
|a|?????|n|
|a|?????|n|
|a|?????|b|
|b|?????|$|
|n|?????|a|
|n|?????|a|
a$
, na
, na
, ba
, $b
, an
, an
. $b
a$
an
an
ba
na
na
However, we see that we also know the last column of the BWT matrix, so now we know:
|$b|????|a|
|a$|????|n|
|an|????|n|
|an|????|b|
|ba|????|$|
|na|????|a|
|na|????|a|
We now know 3-mers! Sort again, find 4-mers, etc. till we reconstruct the whole string :)
This is memory intensive. Can we do better?
b1 a1 n1 a1 n2 a3 $1
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a1 n2 a2
a2 n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1 a1
a1 n1 a2 n2 a3 $1 b1
Now let's sort them:
$1 b1 a1 n1 a2 n2 a3
a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n1
a1 n1 a2 n2 a3 $1 b1
b1 a1 n1 a1 n2 a3 $1
n2 a3 $1 b1 a1 n2 a2
n1 a2 n2 a3 $1 b1 a1
a
in the first column is [a3 a2 a1]
. Similarly in the last column, it occurs in the order [a3 a2 a1]
. Furthermore, this holds for n
as well: we have the order [n2 n1]
in both the first and last column. [a3, a2, a1]
in the first column: a3 $1 b1 a1 n1 a2 n2
a2 n2 a3 $1 b1 a2 n2
a1 n1 a2 n2 a3 $1 b1
a
from all of them, and they remain sorted (we sort by lex): $1 b1 a1 n1 a2 n2
n2 a3 $1 b1 a2 n2
n1 a2 n2 a3 $1 b1
a
s at the end of the strings, still retaining lex ordering: $1 b1 a1 n1 a2 n2 a3
n2 a3 $1 b1 a2 n2 a2
n1 a2 n2 a3 $1 b1 a1
{a1, a2, a3}
. Since they're sorted, they'll occur in the order [a3 a2 a1]
just like the did in the first column! a
in the order [a3 a2 a1]
, then so too does the last column have a
in the order [a3, a2, a1]
. 1 $1 ? ? ? ? ? a3
2 a3 ? ? ? ? ? n2
3 a2 ? ? ? ? ? n1
4 a1 ? ? ? ? ? b1
5 b1 ? ? ? ? ? $1
6 n2 ? ? ? ? ? a2
7 n1 ? ? ? ? ? a1
$1...a3
, which we treat as the string $1 a3
. a3
corresponds to 2nd column. We have a3...n2
, which we combine with $1 a3
to get $1 a3 n2
. n2
corresonds to the 6th column , where we have n2...a2
. We combine this with $1 a3 n2
to get $1 a3 n2 a2
. a2
corresonds to the 3r column , where we have a2...n1
. We combine this with $1 a3 n2 a2
to get $1 a3 n2 a2 n1
. n1
corresonds to the 7th column , where we have n1...a1
. We combine this with $1 a3 n2 a2 n1
to get $1 a3 n2 a2 n1 a1
. a1
corresonds to the 4th column , where we have a1...b1
. We combine this with $1 a3 n2 a2 n1 a1
to get $1 a3 n2 a2 n1 a1 b1
. $1 a3 n2 a2 n1 a1 b1
is the reverse of banana
, the string we encoded! the
. e.......th
to reflect substrings that have a the
. h
in the last column!.