class: center, middle name:opening ## Lifelong Learning Forests .center[ Joshua T. Vogelstein
[jovo@jhu.edu](mailto:jovo at jhu dot edu) |
] --- ## Goals of this Talk - Motivation - Background - Intuition - Applications - Everyone in room understands Please ask questions! --- class: center, middle # Motivation ---
--- ### A motivating example: psychopathy - .blue[truth]: this guy is a psychopath - .blue[sample]: psychopath index is high - .blue[action]: kill him, keep him in jail, or release him - .blue[loss]: cost of jailing - .blue[risk]: false positives are quite bad! Goal: release people if they are not "too" dangerous -- But then.... - .blue[truth]: cured of psychopathy - .blue[sample]: new index, now says low - .blue[action]: can start treatment - .blue[loss]: cost of treatment - .blue[risk]: treating people that don't get better is not so bad Goal: Can the judge (.blue[learner]) updates her ruling? --- class: center, middle # Background --- ### Statistical Decision Theory | object | space | definition | |:--- |:--- | | $\xi \sim P_\xi$ | $\Xi$ | .blue[true] latent state | | $z \sim P_{z \vert \xi} = h(\xi)$ | $ \mathcal{Z}$ | noisily .blue[sampled] state | $a$ | $\mathcal{A}$ | possible .blue[actions] | $g: \mathcal{Z} \to \mathcal{A}$ | $\mathcal{G}$ | decision rule | $\ell(g(z), a) \to \mathbb{R}_+$ | $\mathcal{L}$ | .blue[loss] function | $f_n: \mathcal{Z}^n \times \mathcal{L} \to \mathcal{G}$ | $\mathcal{F}$ | .blue[learner] | $\mathbb{R}_P [ \ell (\hat{g}(z; f_n), a)]$ | $\mathcal{R}$ | .blue[risk] functional - Let $\varepsilon_n = \mathbb{R}[f_n] - \mathbb{R}^{*}$ be the .blue[residual] - Goal: choose $f_n$ such that for many $P \in \mathcal{P}$, $\varepsilon_n \rightarrow 0$ --- ## Lifelong Learning Theory - At $t=t'$, any of the following may change: $P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$ - We know when $\mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$ changes - But not necessarily if $P=P\_{z,\xi}$ changes - Goal: choose $f_n$ such that for many $P \in \mathcal{P}$ | constraint | definition | |:--- |:--- | | $\varepsilon_n(t) \rightarrow 0$ | consistency | | $\varepsilon_n(t>t') = \varepsilon_n(t)$ | no catestrophic forgetting | $\varepsilon_{n+m}(t>t') \rightarrow 0$| continual learning | $\\#(f_{n+m}) - \\#(f_n) \in \mathcal{O}(\log m)$ | space efficient What class of $f_n$ might have these properties? --- ## Random Forests Might! - Excellent empirical performance - Caruana et al. 2006 (ICML): "With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. .blue[Random forests] are close second." - Caruana et al. 2008 (ICML): "the method that performs consistently well across all dimensions is .blue[random forests]." - Delgado et al. 2014 (JMLR): "The classifiers most likely to be the bests are the .blue[random forest]." -- - Strong theoretical performance - .blue[consistent]: $\varepsilon_n(RF_n) \rightarrow 0$ for any $P \in \mathcal{P}$ - .blue[space efficient]: $\\#(RF_n) \in \mathcal{O}(T n )$ --- ### Brief Introduction to Random Forests - Let $z_i=(x_i,y_i)$, where $ x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$ - Given $z_i$ for $i \in [n]$, - For T trees: - Subsample $n' < n$ samples - At each node $j$, select feature $d_j$ and threshold $\tau_j$ to split - For each child of $j$, repeat until criteria is met - Tree is encoded by $( d_j,\tau_j )_j$ All the magic is in choosing the $d_j$'s and $\tau_j$'s --- ### A Simple Example: XOR
--
--- class: center, middle # Intuition --- ### Scenario 1: Semisupervised - Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$ - After time $t'$, given $x_i$ for $i = n+1, \ldots n+m$ - Assume (for now), that $P_x(t < t') = P_x(t>t')$ -- ### What would Tukey do? --- ### Strategy \# 1: Update $\tau_j$'s - ".blue[Thm]" (Telgarsky-Vattani): - The set of local optima of Hartigan’s method is a (possibly strict) subset of local optima of Lloyd’s method. - .blue[Algorithm]: - Use Hartigan's method to recursively update $\tau_j$'s -- - ".blue[Conjecture 1]": - Updating $\tau\_j$'s this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$ --- ### Strategy \# 2: Make trees deeper - ".blue[Thm]" (DGL): For RF to be consistent: - Let $A(x)$ denote the "cell" containing $x$ - diam$(A(x)) \rightarrow 0$ in probability - \# of points in $A(x) \rightarrow \infty$ in probability - .blue[Algorithm] - Pass each new point down the tree, - At leaf node, decide whether to continue spliting - If so, split as per usual - ".blue[Conjecture 2]" - Making trees deeper this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$ --- ### Strategy \# 3: Update $d_j$'s - Some algorithm parameters can be estimated - Let's think of the process at each node as a .blue[random projection] - At each node, sample $A \sim F_A$
--- #### Strategy \# 3: Update $d_j$'s - ".blue[Thm]" (Bayes): posterior = likelihood $\times$ prior - $F_A$ has a low-dimensional parameterization - $d$ is the \# of non-zeros in the matrix - Option A: estimate .blue[which $d$'s] are doing well - Option B: estimate .blue[which dimensions's] are doing well - .blue[Algorithm]: - build T trees using a prior distribution over $d$ - sample different $d$'s for each node - estimate posterior over $d$/dimensions - build T' more trees using posterior - ".blue[Conjecture 3]": - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$ --- #### Strategy \# 3: Make trees stronger & less correlated - ".blue[Thm]" (Breiman): RF error is bounded by a function of tree "strength" and "correlation" - .blue[Algorithm] - Add more parameters to $F_A$ - e.g., relax constraint of 1 non-zero per row
- ".blue[Conjecture 4]" - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$ --- #### $\hat{R}\_{RerF}(n) < \hat{R}\_{RF}(n)$ - ~100 benchmark datasets from Delgado et al. - For continuous data, RerF is significantly better
--- #### Strategy \# 4: Make trees learn dictionaries - Putting an identity matrix in there changes nothing
--- #### Strategy \# 3: Make trees learn dictionaries - Changing that matrix enables RF to choose a dictionary
--- #### Strategy \# 3: Make trees learn dictionaries - Choose dictionary based on prior information
--- #### Strategy \# 3: Make trees learn dictionaries - ".blue[Thm]" (Sulam): - Dictionary learning pursuit achieves the global optimum of the biconvex problem - .blue[Algorithm] - Choose low-dimensional dictionary based on prior knowledge - Update posterior over dictionary after training - ".blue[Conjecture 5]": - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$ --- #### $\hat{R}\_{S-RerF}(n) < \hat{R}\_{RerF}(n)$
--- ### Scenario 2: "Covariate Shift" Learning - Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$ - after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$ - Assume that we do not know whether $$P_x(t < t') \neq P_x(t>t')$$ --- ### Strategy 4: Weight trees - Learn $S$ trees under $P_x(t< t')$ - Learn $S'$ trees under $P_x(t>t')$ - For a new sample $x$, let $\hat{y}= \alpha RF\_{\mathcal{S}}(x) + (1-\alpha) RF\_{\mathcal{S}'}(x)$ The magic is in how to choose $\alpha$, which depends on - difference between $n$ and $m$ - difference between $P_x(t < t')$ and $P_x(t>t')$ --- ### Strategy 4: Weight trees - ".blue[Thm]" (Shen): Our multiscale graph correlation (MGC) test has $$\varepsilon_n(MGC) \leq \varepsilon_n(global)$$ - .blue[Algorithm] - Hypothesis test: $H_0: P_x(t < t') = P_x(t>t')$ - Effect size estimate: $|| P_x(t < t') - P_x(t>t')||$ - Estimate $\alpha$ - ".blue[Conjecture 6]" - $\forall \, P\_x(t < t'), P\_x(t>t')$, with high probability $$\varepsilon\_{n}^{\mathcal{S}, \mathcal{S'}}(\hat{\alpha}) < \varepsilon\_{n}^{\mathcal{S'}}$$ --- #### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$
--- #### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$
--- ### Scenario 3: The Whole Megilla - Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$ - after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$ - .blue[everything] might be changing dynamically; define the setting as any time as $$\mathcal{S}\_t=P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$$ - what to do? --- ### Strategy 4: Forest Daemon - ".blue[Thm]": we got nothing here :) - .blue[Algorithm] - A .y[Forest Demon] stores "meta-data" for each (tree, sample, setting) triple - The demon performs "random forestry", that is, chooses weights for each tree and sample based on the setting $$\hat{y} = \sum_k^K \alpha^{\mathcal{S}}_k T_k(x)$$ - ".blue[Conjecture 7]" - The forest demon solves L2M! --- ### Phase 1: Demonstrate L2Forests Can Satisfy the Criteria 1. Continual learning (updating old trees) 2. Avoids catastrophic forgetting (building new trees) 3. Goal driven perception (demon is given a new setting) 4. Select plasticity (demon determines the the relationship between current and past contexts) 5. Monitoring and safety (demon keeps track of everything required) ### Phase 2: Extend to Multimodal Data - Requires learning multimodal trees - Requires building a demon to integrate cross-modal information --- class: center, middle # Applications --- class:center, middle
--- ### \#1: Characterizing Psychopathy with Changing Sensors - 10 years of scanning psychopaths (multi-modality) - Frequent scanner updates meticulously documented (.blue[$h$]) - Estimate recidivism using the first $n$ samples of $(x_i, y_i)$ - Improve accuracy including $m$ additional $x_i$'s?
--- ### \#2: Characterizing Personality in Non-Stationary Life Span Data - Bringing it back to 🦁 - Lifespan data: 8-80 (.blue[$P$]) - Certain phenotypics are fixed (e.g., ethnicity, gender, IQ) - Multimodal brain measurements are dynamic with age - Learn function using $(x_i,y_i)$ of kids - Continue improving using only $x_i$ of adults --- ### Batch Effects are a Mess in these data
--- class: top, left .pull-left[ Theory Collaborations: - Carey E. Priebe - Cencheng Shen - Minh Tang - Tyler Tomita - James Browne - Randal Burns - Mincent K. Tanzynski? - Karl Kumbier? Data Collaborators: - Kent Kiel (MRN) - Bruce Rosen (MGH) ] .pull-left[ Love Collaborators: - 🦁 - 😋 - 👪 - 🌍 - 🌠 Questions? - [jovo@jhu.edu](mailto:jovo@jhu.edu) - [neurodata.io](http://neurodata.io) We are hiring! ]