§ Efficient tree transformations on GPUs (TODO)


All material lifted straight from Aaron Hsu's PhD thesis . I'll be converting APL notation to C++-like notation.

§ Tree repsentation as multi-dimensional ragged nested arrays


We're interested in this tree:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z

I'll be writing APL commands in front of a $ to mimic bash, and I'll write some arrays as multi-line. To run them, collapse them into a single line. The ast object is represented in memory as:
$ ast ← ('∘'
           ('a' ('p'))
           ('b'
             ('q' ('v'))
             ('r'))
           ('c'
             ('s' ('w' 'x'))
             ('t' ('y' 'z'))
             ('u')))
$ ]disp ast
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘

Here's how read the array representation. Look at the top level of the tree. we have a root node with three children:
      ∘
┌──┬──┴────┐
a  b       c

┌→┬──┬────────┬─────────────┐
│∘│  │        │             │
│ │ a│   b    │     c       │
│ │  │        │             │
└─┴─→┴───────→┴────────────→┘

With the first being the root node, and the three adjacent cells being the children.
Next, we look at how x is represented. This is predictably recursive. Let's see the subtree under x:
      ∘
┌──┬──┴────┐
a  b       c
│
p

┌→┬──┬────────┬─────────────┐
│∘│ap│        │             │
│ │  │  b     │   c         │
│ │  │        │             │
└─┴─→┴───────→┴────────────→┘


Similarly for y:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐
p q r

┌→┬──┬────────┬─────────────┐
│∘│ap│┌→┬──┬─┐│             │
│ │  ││b│q │r││   c         │
│ │  │└─┴─→┴─┘│             │
└─┴─→┴───────→┴────────────→┘

And so on, leading to the final representation:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘

Note that for this representation to work, we need to be able to:

I don't understand the memory layout of this, to be honest. I feel like to represent this in memory would still rely on pointer-chasing, since we need to box all the arrays. This is possibly optimised by APL to not be too bad.

§ The depth vector representation


      ∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3

If we visit this tree and record depths in pre-order (node left right), we arrive at the list:
(∘:0
  (a:1 (p:2)) (b:1 (q:2 (v:3)) (r:2))
  (c:1 (s:2 (w:3 x:3)) (t:2 (y:3 z:3)) (u:2)))

formatted as:
(∘:0
  (a:1
    (p:2))
  (b:1
    (q:2 (v:3))
    (r:2)
  )
  (c:1 (s:2 (w:3 x:3))
       (t:2 (y:3 z:3))
       (u:2))
)

This linearlized is the list:
    (∘ a p b q v r c s w x t y z u)
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)

      ∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3

To convert the ast object into a depth vector representation, we can use the following call:
$ ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))
$ d ← ∊0{(⊢,(⍺+1)∇⊣)/⌽⍺,1↓⍵}ast
0 1 2 1 2 3 2 1 2 3 3 2 3 3 2

Let's break this down:
TODO

§ Inverted tables


We represent data associated with our nodes as follows:
$ data ← ⍪ ¨d(15⍴'T')(↑15⍴⊂'n.')
$ ]disp data
┌→┬─┬──┐
│0│T│n.│
│1│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2↓T↓n.↓
└→┴→┴─→┘

This is the same thing as a structure of arrays (SOA) representation , where each array of information (eg, the depth at data[1], the T information at data[2]) are each arrays which can be accessed well on SIMD instructions.

§ AST representation


TODO

§ Path matrices


We want information of how to go up and down the tree in ideally constant time. We store this information in what is known as a path matrix .
For our recurring example, the path matrix is:
∘ a p b q v r c s w x t y z u | preorder traversal
──────────────────────────────────────────────────
∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ | depth=0
- a a b b b b c c c c c c c c | depth=1
- - p - q q r - s s s t t t u | depth=2
- - - - - v - - - w x - y z - | depth=3

      ∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3

To efficiently compute this, we first replace every value in our tree with its preorder traversal visit time. This changes the tree to:
              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

The values we store in the tree are the integers. The old labels are represented for clarity.
The path matrix for this tree is:
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14  | preorder traversal
────────────────────────────────────────────────────────────
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  | depth=0
-  1  1  3  3  3  3  7  7  7  7  7  7  7  7  | depth=1
-  -  2  -  4  4  6  -  8  8  8 11 11 11 14  | depth=2
-  -  -  -  -  5  -  -  -  9 10  - 12 13  -  | depth=3

              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

§ Rendering the depth information in 2D


We use the incantation:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ ((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴'-'
0 - - - - - - - - -  -  -  -  -  -
- 1 - 3 - - - 7 - -  -  -  -  -  -
- - 2 - 4 - 6 - 8 -  - 11  -  - 14
- - - - - 5 - - - 9  -  - 12  -  -
- - - - - - - - - - 10  -  - 13  -

Let's break this down (the symbol means a lamp, for commenting/illumination)
$ ⍳ 3 ⍝ iota: make a list of n elements:.
1 2 3

$ d
0 1 2 1 2 3 2 1 2 3 4 2 3 4 2

$ ≢d ⍝ tally: ≢`. count no. of elements in d:
15

⍳≢d  ⍝ list of elements of len (no. of elements in d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

$ ]disp (1 2 3),(4 5 6) ⍝ ,:concatenate
┌→────┬─────┐
│1 2 3│4 5 6│
└~───→┴~───→┘

]disp (1 2 3) ,¨ (4 5 6)
┌→──┬───┬───┐
│1 4│2 5│3 6│
└~─→┴~─→┴~─→┘

The use of ¨ needs some explanation. ¨ is a higher order function which takes a function and makes it a mapped version of the original function. So, is a function which attemps to map the concatenation operator. Now, given two arrays (1 2 3) and (4 5 6), (1 2 3) ,¨ 4 5 6 attemps to run , on each pair 1 and 4, 2 and 5, 3 and 6. This gives us tuples ((1 4) (2 5) (3 6)). So, for our purposes, zip ← ,¨.
]disp (d,¨⍳≢d) ⍝ zip d with [1..len d].
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘

$ ((⌈/d) (≢d))⍴'-' ⍝ array of dim (max val in d) x (no. of elem in d)
---------------
---------------
---------------
---------------


TODO: explain @ and its use

§ Creating the path matrix


$ ⎕IO ← 0 ⍝ (inform APL that we wish to use 0-indexing.)
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ PM ← ⌈\((⍳≢d)@(d,¨⍳≢d))(((⌈/d+1)(≢d))⍴0)

0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 1 3 3 3 3 7 7 7  7  7  7  7  7
0 0 2 2 4 4 6 6 8 8  8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13

      0               0
┌──┬──┴───────┐
1  3          7       1
│ ┌┴┐  ┌──────┼───┐
2 4 6  8     11   14  2
  │    │      |
  │   ┌┴─┐   ┌┴──┐
  5   9  10  12  13   3

The incantation can be broken down into:

$ grid←(⌈/d+1) (≢d) ⍴ 0
$ grid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


$ writeixs ← (d,¨⍳≢d)
$ ]disp writeixs
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│3 10│2 11│3 12│3 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘


$ ixgrid ← ((⍳≢d)@writeixs) grid
$ ixgrid
0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 0 3 0 0 0 7 0 0  0  0  0  0  0
0 0 2 0 4 0 6 0 8 0  0 11  0  0 14
0 0 0 0 0 5 0 0 0 9 10  0 12 13  0


$ PM ← ⌈\ixgrid
$ PM
0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 1 3 3 3 3 7 7 7  7  7  7  7  7
0 0 2 2 4 4 6 6 8 8  8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13

§ Using the path matrix: distance of a node from every other node.


Note that the maximum distance between two nodes is to climb all the way to the top node, and then climb down:
dmax ← depth(a) + depth(b)

If we know the lowest common ancestor of two nodes, then the distance of one node to another is:
dcorrect ← dist(a, lca(a, b)) + dist(b, lca(a, b))

So, we can compute the depth as:
dcorrect ← dist(a, lca(a, b)) + dist(lca(a, b), b)
 = dist(a, lca(a, b)) + depth(lca(a, b)) +
   dist(b, lca(a, b)) + depth(lca(a, b)) +
   -2 * depth(lca(a, b))
 = depth(a) +
   depth(b) +
   -2 * depth (lca(a, b))

[TODO: picture ] [TODO: finish writing this ]

§ Parent vector representation


A parent vector is a vector of length n where Parent[i] denotes an index into Parent. Hence, the following condition will return 1 if V is a parent vector.
For example, for our given example, here is the parent vector:
d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) │ depths
    (∘  a  p  b  q  v  r  c  s  w  x  t  y  z  u) │ values
p ← (∘  ∘  a  ∘  b  q  b  ∘  c  s  s  c  t  t  c) │ parents
    (0  1  2  3  4  5  6  7  8  9 10 11 12 13 14) | indexes
P ← (0  0  1  0  3  4  3  0  7  8  8  7 11 11  7) │ parent indices



              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

The condition a parent vector must satisfy is:
∧/V ∊(⍳≢V) ⍝ [All elements of V belong in the list [1..len(V)] ]


The root node (parent of all elements) can be found using the fixpoint operator ( ):
I←{(⊂⍵)⌷⍺} ⍝ index into the left hand side param using right hand side param
I⍣≡⍨p ⍝ compute the fixpoint of the I operator using ⍨ and apply it to p

§ Converting from depth vector to parent vector, Take 1


As usual, let's consider our example:
d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) │ depths
    (∘  a  p  b  q  v  r  c  s  w  x  t  y  z  u) │ values
p ← (∘  ∘  a  ∘  b  q  b  ∘  c  s  s  c  t  t  c) │ parents
    (0  1  2  3  4  5  6  7  8  9 10 11 12 13 14) | indexes
P ← (0  0  1  0  3  4  3  0  7  8  8  7 11 11  7) │ parent indices



              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

Note that the depth vector already encodes parent-child information.
For example, to compute the parent of t:11, notice that it's at depth 2. So we should find all the nodes from d[0..11] which have depths equal to 2, and then pick the rightmost one. This translates to the expression:
$ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2)
$ t ← 11   ⍝ target node
$ ixs ← ⍳t   ⍝ array indexes upto this node
  0 1 2 3 4 5 6 7 8 9 10
$ d[ixs]   ⍝ depths of nodes to the left of the given node t
  0 1 2 1 2 3 2 1 2 3 3
$ d[ixs]  = d[t]-1  ⍝ boolean vector of nodes whose depth is that of t's parent
  0 1 0 1 0 0 0 1 0 0 0
$ eqds ← ⍸ (d[ixs] = d[t]-1)  ⍝ array indexes of nodes whose depth is that of t's parent
  1 3 7
$ ⌽ eqds ⍝ reverse of array indexes to extract `7`
  7 3 1
$ ⊃ ⌽ eqds ⍝ first of the reverse of the array indexes to extract `7`
  7
$ (⌽⍸(d[⍳t] = d[t]-1))[0]  ⍝ APL style one-liner of the above

While this is intuitive, this does not scale: It does not permit us to find the parent of all the nodes at once --- ie, it is not parallelisable over choices of t.

§ Converting from depth vector to parent vector, Take 2 (Or scan idiom)


Imagine we have a list of 0s and 1s, and we want to find the index of the rightmost 1 value. For example, given:
       0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0  0  0  0)

we want the answer to be f a = 8. We saw an implementation in terms of f←{(⌽⍸⍵)[0]} in Take 1. (recall that is the symbol for the right-hand-side argument of a function).
We're going to perform the same operation slightly differently. Let's consider the series of transformations:
⍝      0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0  0  0  0) ⍝ original array

$ ⌽a  ⍝ reverse of a
  0 0 0 0 1 0 1 0 0 0 1 0 0

$ ∨\ ⌽a ⍝ prefix scan(\) using the OR(∨) operator. Turn all
        ⍝ entries after the first 1 into a 1
  0 0 0 0 1 1 1 1 1 1 1 1 1

$ +/ (∨\ ⌽a)  ⍝ sum over the previous list, counting number of 1s
  9

$ ¯1 +  (+/ (∨\ ⌽a))  ⍝ subtract 1 from the previous number
  8

Why the hell does this work? Well, here's the proof:

This technique will work for every row of a matrix . This is paramount, since we can now repeat this for the depth vector we were previously interested in for each row, and thereby compute the parent index!

§ Converting from depth vector to parent vector, Take 3 (full matrix)


We want to extend the previous method we hit upon to compute the parents of all nodes in parallel. To perform this, we need to run the moral equivalent of the following:
$ ⎕IO ← 0 ⍝ 0 indexing
$ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depth vector
$ t ← 11 ⍝ node we are interested in
$ a←d[⍳t]=d[t]-1  ⍝ boolean vector of nodes whose depth is that of t's parent
  0 1 0 1 0 0 0 1 0 0 0
$ ¯1 +  (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector
7

for every single choice of t . To perform this, we can build a 2D matrix of d[⍳t]=d[t]-1 where t ranges over [0..len(d)-1] (ie, it ranges over all the nodes in the graph).
We begin by using:
$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ]display ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
┌→────────────────────────────┐
↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
└~────────────────────────────┘



$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ]display left ←  (⍳3) ∘.> (⍳3) ⍝ find `i > j` for all i, j.
┌→────┐
↓0 0 0│
│1 0 0│
│1 1 0│
└~────┘

Combining the three techniques, we can arrive at:
$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
$ preds ←  (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find `i > j` for all i, j.
$ pred_higher ←  ltdepth ∧ left   ⍝ predecessors tht are higher in the tree
$  parents_take_3 ← ¯1 +  +/∨\⌽pred_higher  ⍝ previous idiom for finding last 1.
¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7

For comparison, the actual value is:
    (0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
    (¯1  0  1  0  3  4  3  0  7  8  8  7 11 11  7) | parents, take 3

              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
We have an off-by-one error for the 0 node! That's easily fixed, we simply perform a maximum with 0 to move ¯1 -> 0:
$  parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽pred_higher
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

So, that's our function:
parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d))
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

Note that the time complexity for this is dominated by having to calculate the outer products, which even given infinite parallelism, take O(n)O(n) time. We will slowly chip away at this, to be far better.

§ Converting from depth vector to parent vector, Take 4 (log critial depth)

We will use the Key( ) operator which allows us to create key value pairs.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]disp (⍳≢d) ,¨ d ⍝ zip d with indexes
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘

$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍺ ⍵}⌸d  ⍝ each row i has tuple (i, js): d[js] = i
┌→──────────────────┐
↓   ┌→┐             │
│ 0 │0│             │
│   └~┘             │
│   ┌→────┐         │
│ 1 │1 3 7│         │
│   └~────┘         │
│   ┌→────────────┐ │
│ 2 │2 4 6 8 11 14│ │
│   └~────────────┘ │
│   ┌→───────────┐  │
│ 3 │5 9 10 12 13│  │
│   └~───────────┘  │
└∊──────────────────┘

In fact, it allows us to apply an arbitrary function to combine keys and values. We will use a function that simply returns all the values for each key.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i.
┌→──────────────┐
↓0 0  0  0  0  0│
│1 3  7  0  0  0│
│2 4  6  8 11 14│
│5 9 10 12 13  0│
└~──────────────┘

Our first try doesn't quite work: it winds up trying to create a numeric matrix, which means that we can't have different rows of different sizes. So, the information that only index 0 is such that d[0] = 0 is lost. What we can to is to wrap the keys to arrive at:
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i
┌→───────────────────────────────────────────┐
│ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │
│ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │
│ └~┘ └~────┘ └~────────────┘ └~───────────┘ │
└∊───────────────────────────────────────────┘
Consider the groups b[2] = (2 4 6 8 11 14) and b[3] = (5 9 10 12 13). All of 3's parents are present in 2. Every element in 3 fits at some location in 2. Here is what the fit would look like:
b[2]  2 4 _ 6 8 _  _ 11 __ __ 14   (nodes of depth 2)
b[3]      5     9  10   12 13      (leaf nodes)
          4     8   8   11 11      (parents: predecessor of b[3] in b[2])

              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

We use the Interval Index( ) operator to solve the problem of finding the parent / where we should sqeeze a node from b[3] into b[2] (This is formally known as the predecessor problem )
⍝ left[a[i]] is closest number < right[i]
⍝ left[a[i]] is the predecessor of right[i] in left[i].
$ a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 )
0 3 2 1 0

Now, we can use the technology of predecessor to find parents of depth 3 nodes among the depth 2 nodes:
$ depth2 ← 2 4 6 8 11 14
$ depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11
$ depth3parentixs ← depth2 ⍸ depth3
$ depth3parents  ← depth2[depth3parentixs]
4 8 8 11 11

              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

We need to know one-more APL-ism: the 2-scan. When we write a usual scan operation, we have:
$ ⍳5
1 2 3 4 5
$ +/⍳5 ⍝ reduce
15

$ 2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs)
3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)

$ 3+/⍳5 ⍝  apply + to 3-tuples
6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)

We begin by assuming the parent of i is i by using p←⍳≢d.
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ d2nodes ← {⊂⍵}⌸d
┌→┬─────┬─────────────┬─────────────┐
│1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│
└→┴~───→┴~───────────→┴~───────────→┘
$ p←⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Now comes the biggie:
$ findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}

This finally culminates in:
$ d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7


    (0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

Which can be further golfed to:
$ p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

The total time complexity of this method assuming infinite parallelism is as follows:
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p

§ 3.4: Computing nearest Parent by predicate


I'm going to simplify the original presentation by quite a bit.
     a b c d e f g h i  | names
     0 1 2 3 4 5 6 7 8  | indexes
P ← (0 0 1 2 0 4 5 6 7) | parents
X ← (0 1 0 0 1 1 0 0 0) | marked nodes

     a:0
┌────┴───┐
b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
         │
         h:7
         |
         i:8

We want to find nodes marked as X that are the closest parents to a given node. The X vector is a boolean vector that has a 1 at the index of each X node: (b, e, f). So, the indexes (1, 4, 5) are 1 in the X vector.
The output we want is the vector:
      0 1 2 3 4 5 6 7 8  | indexes
      a b c d e f g h i  | names
PX ← (0 0 1 1 0 4 5 5 5) | closest X parent index
      a a b b a e f f f  | closest X parent name

    a:0
┌────┴───┐
b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
         │
         h:7
         |
         i:8

The incantation is:
$ I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS | (100 101 102 103)[(3 1 2)] := 103 101 102
$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5

TODO. At any rate, since this does not require any writes and purely reads, and nor does it need any synchronization, this is fairly straightforward to implement on the GPU.

§ 3.5: Lifting subtrees to the root


Once we have marked our X nodes, we now wish to lift entire subtrees of X up to the root.
There is some good advice in the thesis:
When using APL primitives this way, it may be good to map
their names and definitions to the domain of trees. For example,
the primitive ⍸Predicate is read as "the nodes where Predicate holds"
and not as "the indexes where Predicate is 1".

For example, given the tree:
      0 1 2 3 4 5  | indexes
      a b c d e f  | names
P  ← (0 0 1 0 3 4) | parents
X  ← (0 1 0 1 1 0) | X nodes
PX ← (0 0 1 0 3 4) | closest X parent index

    a:0
┌────┴───┐
b:1(X)   d:3(X)
|        |
c:2      e:4(X)
         |
         f:5

we want the transformed tree to be:
    a:0
┌────┴───┐
bp:1(X)   ep:4(X)
---------
b:1(X)
|
c:2
---------
e:4
|
fp:5
---------
f:5(X)
|
g:6

We first look for nodes that need to be lifted. There are:
nodes←⍸(X ∧ p≠(⍳≢p))  ⍝ ⍸:pick indexes.

§ 3.6: Wrapping Expressions

§ 3.7: Lifting Guard Test Exprsessions

§ 3.8: Couting rank of index operators

§ 3.9: Flattening Expressions

§ 3.10: Associating Frame slots and variables

§ 3.11: Placing frames into a lexical stack

§ 3.12: Recording Exported names

§ 3.13: Lexical Resolution


§ 5.2.1 Traversal Idioms

§ 5.2.2 Edge Mutation Idioms

§ 5.2.3 Node Mutation Idioms