L2F

class: center, middle

name:opening

## Lifelong Learning Forests

.center[

Joshua T. Vogelstein
<br>
[jovo@jhu.edu](mailto:jovo at jhu dot edu)
| <http://brainx.io/L2F>
<br><br><br>
<img src="images/logo_jhu.png" STYLE="HEIGHT:100px;"/>

]

---

## Goals of this Talk

- Motivation
- Background
- Intuition
- Applications
- Everyone in room understands

Please ask questions!

---
class: center, middle

# Motivation

---

---

### A motivating example: psychopathy

- .blue[truth]: this guy is a psychopath
- .blue[sample]: psychopath index is high
- .blue[action]: kill him, keep him in jail, or release him
- .blue[loss]: cost of jailing
- .blue[risk]: false positives are quite bad!

Goal: release people if they are not "too" dangerous

But then....

- .blue[truth]: cured of psychopathy
- .blue[sample]: new index, now says low
- .blue[action]: can start treatment
- .blue[loss]: cost of treatment
- .blue[risk]: treating people that don't get better is not so bad

Goal: Can the judge (.blue[learner]) updates her ruling?

---
class: center, middle

# Background

---

### Statistical Decision Theory

| object | space | definition |
|:--- |:--- |
| $\xi \sim P_\xi$ | $\Xi$ | .blue[true] latent state |
| $z   \sim  P_{z \vert \xi} = h(\xi)$ | $ \mathcal{Z}$ | noisily .blue[sampled] state
| $a$ |  $\mathcal{A}$ | possible .blue[actions]
| $g: \mathcal{Z} \to \mathcal{A}$ |  $\mathcal{G}$ | decision rule
| $\ell(g(z), a) \to \mathbb{R}_+$ | $\mathcal{L}$ | .blue[loss] function
| $f_n: \mathcal{Z}^n \times \mathcal{L} \to \mathcal{G}$ | $\mathcal{F}$ | .blue[learner]
| $\mathbb{R}_P [ \ell (\hat{g}(z; f_n), a)]$ | $\mathcal{R}$ | .blue[risk] functional

- Let $\varepsilon_n = \mathbb{R}[f_n] - \mathbb{R}^{*}$ be the .blue[residual]
- Goal: choose $f_n$ such that for many $P \in \mathcal{P}$,
$\varepsilon_n \rightarrow 0$

---

## Lifelong Learning Theory

- At $t=t'$, any of the following may change: $P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$
-  We know when $\mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$ changes
- But not necessarily if $P=P\_{z,\xi}$ changes
-  Goal: choose $f_n$ such that for  many $P \in \mathcal{P}$

| constraint | definition |
|:--- |:--- |
 | $\varepsilon_n(t) \rightarrow 0$ |  consistency |
 | $\varepsilon_n(t>t') = \varepsilon_n(t)$ | no catestrophic forgetting
| $\varepsilon_{n+m}(t>t') \rightarrow 0$| continual learning
 | $\\#(f_{n+m}) - \\#(f_n)  \in \mathcal{O}(\log m)$ | space efficient

What class of $f_n$ might have these properties?

---

##  Random Forests Might!

- Excellent empirical performance
  - Caruana et al. 2006 (ICML): "With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. .blue[Random forests] are close second."
  - Caruana et al. 2008 (ICML): "the method that performs consistently well across all dimensions is .blue[random forests]."
  - Delgado et al. 2014 (JMLR): "The classifiers most likely to be the bests are the .blue[random forest]."

--
- Strong theoretical performance
  - .blue[consistent]: $\varepsilon_n(RF_n)  \rightarrow 0$ for any $P \in \mathcal{P}$
  - .blue[space efficient]: $\\#(RF_n) \in \mathcal{O}(T n )$

---

### Brief Introduction to Random Forests

- Let $z_i=(x_i,y_i)$, where $ x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- Given $z_i$ for $i \in [n]$,
- For T trees:
  - Subsample $n' < n$ samples
  - At each node $j$, select  feature $d_j$ and threshold $\tau_j$ to split
  - For each child of $j$, repeat until criteria is met
  - Tree is encoded by $( d_j,\tau_j )_j$

All the magic is in choosing the $d_j$'s and $\tau_j$'s

---

### A Simple Example: XOR

---
class: center, middle

# Intuition

---

### Scenario 1: Semisupervised

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- After time $t'$, given  $x_i$ for $i = n+1, \ldots n+m$
- Assume (for now), that $P_x(t < t') = P_x(t>t')$

### What would Tukey do?

---

### Strategy \# 1:  Update $\tau_j$'s

- ".blue[Thm]" (Telgarsky-Vattani):
  - The set of local optima of Hartigan’s method is a (possibly strict) subset of local optima of Lloyd’s method.

- .blue[Algorithm]:
  - Use Hartigan's method to recursively update $\tau_j$'s

- ".blue[Conjecture 1]":
  - Updating $\tau\_j$'s this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$

---

### Strategy \# 2:  Make trees deeper

- ".blue[Thm]" (DGL): For RF to be consistent:
  - Let $A(x)$ denote the "cell" containing $x$
  - diam$(A(x)) \rightarrow 0$ in probability
  - \# of points in $A(x) \rightarrow \infty$ in probability

- .blue[Algorithm]
  - Pass each new point down the tree,
  - At leaf node, decide whether to continue spliting
  - If so, split as per usual

- ".blue[Conjecture 2]"
    - Making trees deeper this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$

---

### Strategy \# 3: Update $d_j$'s

- Some  algorithm parameters can be estimated
- Let's think of the process at each node as a .blue[random projection]
- At each node, sample $A \sim F_A$

---

#### Strategy \# 3: Update $d_j$'s

- ".blue[Thm]" (Bayes): posterior = likelihood  $\times$ prior
  - $F_A$ has a low-dimensional parameterization
  - $d$ is the \# of non-zeros in the matrix
  - Option A: estimate .blue[which $d$'s] are doing well
  - Option B: estimate .blue[which dimensions's] are doing well

- .blue[Algorithm]:
  - build T trees using a prior distribution over $d$
  - sample different $d$'s for each node
  - estimate posterior over $d$/dimensions
  - build T' more trees using posterior

- ".blue[Conjecture 3]":
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$

---

#### Strategy \# 3: Make trees stronger & less correlated

- ".blue[Thm]" (Breiman): RF error is bounded by a function of tree "strength" and "correlation"

- .blue[Algorithm]
  - Add more parameters to  $F_A$
  - e.g., relax constraint of 1 non-zero per row

- ".blue[Conjecture 4]"
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$

---

#### $\hat{R}\_{RerF}(n) < \hat{R}\_{RF}(n)$

-  ~100 benchmark datasets from Delgado et al.
-  For continuous data, RerF is significantly better

---

#### Strategy \# 4: Make trees learn dictionaries

- Putting an identity matrix in there changes nothing

---

#### Strategy \# 3: Make trees learn dictionaries

- Changing that matrix enables RF to choose a dictionary

---

#### Strategy \# 3: Make trees learn dictionaries

- Choose dictionary based on prior information

---

#### Strategy \# 3: Make trees learn dictionaries

- ".blue[Thm]" (Sulam):
  - Dictionary learning pursuit achieves the global optimum of the biconvex problem

- .blue[Algorithm]
  - Choose low-dimensional dictionary based on prior knowledge
  - Update posterior over dictionary after training

-   ".blue[Conjecture 5]":
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$

---

#### $\hat{R}\_{S-RerF}(n) < \hat{R}\_{RerF}(n)$

---

### Scenario 2: "Covariate Shift" Learning

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$
- Assume that we do not know whether $$P_x(t < t') \neq P_x(t>t')$$

---

### Strategy 4: Weight trees

- Learn $S$ trees under $P_x(t< t')$
- Learn $S'$ trees under $P_x(t>t')$
- For a new sample $x$, let $\hat{y}= \alpha RF\_{\mathcal{S}}(x) + (1-\alpha) RF\_{\mathcal{S}'}(x)$

The magic is in how to choose $\alpha$, which depends on
- difference between  $n$ and $m$
-  difference between $P_x(t < t')$ and $P_x(t>t')$

---

### Strategy 4: Weight trees

- ".blue[Thm]" (Shen): Our multiscale graph correlation (MGC) test has
$$\varepsilon_n(MGC) \leq \varepsilon_n(global)$$

- .blue[Algorithm]
  - Hypothesis test: $H_0: P_x(t < t') = P_x(t>t')$
  - Effect size estimate: $|| P_x(t < t') - P_x(t>t')||$
  - Estimate $\alpha$

- ".blue[Conjecture 6]"
  - $\forall \,  P\_x(t < t'), P\_x(t>t')$,   with high probability
  $$\varepsilon\_{n}^{\mathcal{S}, \mathcal{S'}}(\hat{\alpha}) < \varepsilon\_{n}^{\mathcal{S'}}$$

---

#### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$

---

#### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$

---

### Scenario 3: The Whole Megilla

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$
- .blue[everything] might be changing dynamically; define the setting as any time as

$$\mathcal{S}\_t=P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$$
- what to do?

---

### Strategy 4: Forest Daemon

- ".blue[Thm]": we got nothing here :)

- .blue[Algorithm]
  -  A .y[Forest Demon] stores "meta-data" for each (tree, sample, setting) triple
  - The demon performs "random forestry", that is, chooses weights for each tree and sample based on the setting
   $$\hat{y} = \sum_k^K \alpha^{\mathcal{S}}_k T_k(x)$$

- ".blue[Conjecture 7]"
  -  The forest demon solves L2M!

---

### Phase 1: Demonstrate L2Forests Can Satisfy the Criteria

1. Continual learning (updating old trees)
2. Avoids catastrophic forgetting (building new trees)
3. Goal driven perception (demon is given a new setting)
4. Select plasticity (demon determines the the relationship between current and past contexts)
5. Monitoring and safety (demon keeps track of everything required)

### Phase 2: Extend to Multimodal Data

- Requires learning multimodal trees
- Requires building a demon to integrate cross-modal information

---
class: center, middle

# Applications

---
class:center, middle

---
### \#1: Characterizing Psychopathy with Changing Sensors

- 10 years of scanning psychopaths (multi-modality)
- Frequent scanner updates meticulously documented (.blue[$h$])
- Estimate recidivism using the first $n$ samples of $(x_i, y_i)$
- Improve accuracy including $m$ additional $x_i$'s?
<img src="images/KentF1.png" style="width: 80%;"/>

---
### \#2: Characterizing Personality in Non-Stationary Life Span Data

- Bringing it back to 🦁
- Lifespan data: 8-80 (.blue[$P$])
- Certain phenotypics are fixed (e.g., ethnicity,  gender, IQ)
- Multimodal brain measurements  are dynamic with age
- Learn function using $(x_i,y_i)$ of kids
- Continue improving using only $x_i$ of adults

---
### Batch Effects are a Mess in these data

---
class:  top,  left

.pull-left[
Theory Collaborations:
- Carey E. Priebe
- Cencheng Shen
- Minh Tang
- Tyler Tomita
- James Browne
- Randal Burns
- Mincent K. Tanzynski?
- Karl Kumbier?

Data Collaborators:
- Kent Kiel (MRN)
- Bruce Rosen (MGH)
]

.pull-left[

Love Collaborators:
- 🦁
- 😋
- 👪
- 🌍
- 🌠

Questions?
- [jovo@jhu.edu](mailto:jovo@jhu.edu)
- [neurodata.io](http://neurodata.io)

We are hiring!
]