In toto, the prefix function is:
ababcaab
01234012
§ s[0..i]
has a border of length at pr(i+1)-1
- That is, given the substring
s[0..i]
, we can predict that s[0..i]
will have some border (perhaps not the longest border) of length pr(i+1)-1
. - Suppose
s[0..i+1]
has longest border of length L=pr(i+1)
(by definition of pr
). Suppose pr(i+1) >= 1
. Then I can write s[0..i+1]
as:
s[0..i+1] = p[0]p[1]...p[L]|s[L+1]...s[i+1-L-1]|p[0]p[1]...p[L]
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
If I now want to consider s[0..i]
, I need to drop the last letter p[L]
, which leaves me with:
s[0..i] = p[0]p[1]...p[L-1]p[L]|s[L+1]...s[i+1-L-1]|p[0]p[1]...p[L-1]
^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
- where we have a border of
p[0]...p[L-1]
. - There maybe a longer border, involving other terms of
s[0..i]
. But we know that the border is at least pr(i) >= pr(i+1)-1
. - Upon re-arranging, we see that
pr(i+1) <= pr(i) + 1
. - This tells us that the border can increase by at most 1 (it can drop to zero, no lower bound!). So we have:
0 <= pr(i+1) <= pr(i) + 1
. - So if we think of borders of
s[0..i+1]
, we know that the longest border can be of length of border pr(i) + 1
. All other borders will be of length <= pr(i)
, so these other borders will be borders of s[0..i]
! - Thus, to move from
s[0..i]
to s[0..(i+1)]
, we simply need to be able to find the longest border of s[0..(i+1)]
. All other borders will come from s[0..i]
.
§ Lemma: Enumerating borders (what is a good name for this lemma?)
- Think of the longest border
123456
:
123456----123456
^ ^
L N
- Now consider a shorter border
ab
:
ab3456----1234ab
^ ^
L N
- But we must have
a~1
and b~2
since it's the same string! So the border is really ab34ab
, and the string is:
ab34ab----ab34ab
^ ^
L N
- This shows that given the longest border
123456
of s[0:N]
which has length L
, any other border of s[0:N]
(such as ab
) is also a border of s[0:L]
. - Generalizing, given the longest border of
s[0..n]
of length L
, any smaller border of s[0:N]
is a border of s[0:L]
.
§ Algorithm to enumerate borders.
All borders of s[0:N]
can be enumerated by:
- Taking the longest border of length
L
of s[0:N]
(given by pr(N)
). - considering the border as being the substring
s[0:L]
(given by pr(L)
). - Taking the longest border of
s[0:L]
- ... recurse till we hit empty string.
§ Computing pr(i+1)
- We know that
pr(0) = 0
. - At
pr(i+1)
, we know that s[0:pr(i)] = s[-pr(i):-1]
§ Border function is a fractal
Consider a general string which looks like:
abcdef--abcdef
If we think of the second longest border, say 12
, we will get that
the string must be:
12cdef--abcd12
^^ &&
But this implies that the occurrence of ef
(marked with &&
) and the occurrence
of ab
(marked with ##
) must be 12, 12
respectively. This means that the
string is:
12cd12--12cd12
So we started with a full string:
------------------
Then matched the left and right (very 1/3 cantor):
------- ------
----
Then matched the left and right of these again and unified the leftovers,
finding that it looks like:
-- -- -- --
-- --
---
and so on. Isn't this so cool? Borders of a string are a fractal-like object!