###Lifelong Learning: Theory and Practice PI: Joshua T. Vogelstein, [JHU](https://www.jhu.edu/)
Jayanta Dey, Ali Geisa, Hayden Helm, Ronak Mehta, Will LeVine, Carey E. Priebe
Co-PI: Vova Braverman, [JHU](https://www.jhu.edu/)
Haoran Li, Aditya Krishnan, Jingfeng Wu
SGs: SRI, ARGONE, HRL
![:scale 35%](images/neurodata_blue.png)
--- ### Summary .ye[Research Question:] Why is LL difficult, and how can we design algorithms/datasets to solve it? .ye[Approach:] - Introduced out-of-distribution (OOD) learning theory framework for theoretical analysis of lifelong learning - Introduced ensembling representations .ye[Accomplishments:] - Proved various OOD weak learner theorems - Achieved consistent positive forward and backward transfer (synergistic learning) in practice .ye[Key Take-Away:] LL is fundamentally harder than classical ML, and ensembling representations can synergistically learn --- ### Result 1: OOD Learning Theory We uncouple the evaluation distribution from training data distributions ![:scale 100%](images/learning-schematics.png) --- ### Putting LL within OOD Framework ![:scale 100%](images/learning-table.png) --- ### Defining/Quantifying Learning & Forgetting ![:scale 100%](images/learning-efficiency.png) Using non-task data to improve performance over what it could achieve using only task data - Learning: $\mathbf{S}^A=\mathbf{S}\_0$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Transfer learning: $\mathbf{S}^A=\mathbf{S}^1$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Multitask learning: for each $t$, $\mathbf{S}^A=\mathbf{S}^t$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Forward learning: $\mathbf{S}^A=\mathbf{S}^t$ and $\mathbf{S}^B=\mathbf{S}^{< t}$. - Backward learning: $\mathbf{S}^A=\mathbf{S}^{< t}$ and $\mathbf{S}^B=\mathbf{S}\_n$. --- ### Result 2: Proving novel properties of OOD learning Classical theory: - Weak learning: can do better than chance on some task with sufficient data - Strong learning: can do arbitrarily close to optimal on some task with sufficient data - Weak Learner Theorem: if a problem is weakly learnable, it is also strongly learnable OOD learning theory - Training distribution is uncoupled from evaluation distribution --- ### More data is inadequate for LL Theorem 1: With *only* out-of-distribution data, there exists some problems that are weakly, but not strongly, learnable. - This implies that OOD learning is different *in kind* from in-distribution learning. - Lifelong learning is a special case of OOD learning - Getting .ye[more] data is *not* guaranteed to improve performance arbitrarily in LL, we need .ye[better] data --- ### Learning efficiency is a fundamental notion of learning Theorem 2: Weak OOD learnability implies transfer learnability (i.e., learning efficiency > 1). That is, if one can weakly learn, one can also transfer learn, but not necessarily vice versa. - This implies that transfer learnability is a fundamental property of learning problems - In other words, inability to transfer is equivalent to inability to learn at all. --- ### Result 3: Ensembling representations achieves synergistic learning ![:scale 100%](images/learning_schema_new.png) --- ### Omnidirectional Algorithms Show Forward Transfer CIFAR 10x10 ![:scale 100%](images/cifar_exp_fte.png) --- ### Omnidirectional Algorithms Uniquely Show Backward Transfer for Each Task ![:scale 100%](images/cifar_exp_bte.png) --- ### Future Directions/ Transitions - omnidirctional algorithm code continues to improve [http://proglearn.neurodata.io/](http://proglearn.neurodata.io/) - streaming forest for streaming lifelong learning setup [https://sdtf.neurodata.io](https://sdtf.neurodata.io) ![:scale 80%](images/streaming_forest.png) --- ### Kernel Density Networks/Forests generate well calibrated posteriors - [https://github.com/neurodata/kdg](https://github.com/neurodata/kdg) - KDG on Guassian XOR simulation data ![:scale 100%](images/kdn_kdf.png)
--- ### Deep Networks are the worst model of the mind
---
---
--- ### Acknowledgements
yummy
lion
baby girl
family
earth
milkyway
##### JHU
Carey Priebe
Jesse Patsolic
Meghana Madhya
Hayden Helm
Richard Gou
Ronak Mehta
Jayanta Dey
Will LeVine
##### Microsoft Research
Chris White
Weiwei Yang
Jonathan Larson
Bryan Tower
##### DARPA L2M {[BME](https://www.bme.jhu.edu/),[CIS](http://cis.jhu.edu/), [ICM](https://icm.jhu.edu/), [KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) | [neurodata](https://neurodata.io)
[jovo@jhu.edu](mailto:j1c@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data) --- background-image: url(images/l_and_v.jpeg) .footnote[Questions?] --- class: middle # .center[Appendix] --- .small[ ### Publications 1. A. Geisa et al. [Towards a theory of out-of-distribution learning](https://arxiv.org/abs/2109.14501), arXiv, 2021. 1. J. T. Vogelstein et al. [Omnidirectional Transfer for Quasilinear Lifelong Learning](https://arxiv.org/abs/2004.12908), arXiv, 2021. 1. Xu, Haoyin, et al. [Streaming Decision Trees and Forests](https://arxiv.org/abs/2110.08483), arXiv, 2021. 1. C. E. Priebe et al. [Modern Machine Learning: Partition and Vote](https://doi.org/10.1101/2020.04.29.068460), 2020. 1. R Guo, et al. [Estimating Information-Theoretic Quantities with Uncertainty Forests](https://arxiv.org/abs/1907.00325). arXiv, 2019. 1. R. Perry, et al. [Manifold Forests: Closing the Gap on Neural Networks](https://openreview.net/forum?id=B1xewR4KvH). arXiv, 2019. 1. C. Shen and J. T. Vogelstein. [Decision Forests Induce Characteristic Kernels](https://arxiv.org/abs/1812.00029). arXiv, 2019. 1. M. Madhya, et al. [Geodesic Learning via Unsupervised Decision Forests](https://arxiv.org/abs/1907.02844). arXiv, 2019. 1. M. Madhya, et al. [PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment](https://arxiv.org/abs/2011.05383). arXiv, 2020. ### Conferences 1. J.T. Vogelstein et al. A biological implementation of lifelong learning in the pursuit of artificial general intelligence. NAISys, 2020. 2. B. Pedigo et al. A quantitative comparison of a complete connectome to artificial intelligence architectures. NAISys, 2020. ] --- ### Biological learning is on top ![:scale 100%](images/learning-table.png) --- ### Omnidirectional Algorithms can Transfer Between XOR and XNOR ![:scale 100%](images/xor_xnor_exp.png) --- ### Spoken Digit dataset .pull-left[ - *Spoken Digit* contains recording from 6 different speakers. - Each digit has 50 recordings (3000 total recordings). - For each recording spectrogram was extracted using using Hanning windows of duration 16 ms with an overlap of 4 ms. - The spectrograms were resized down to 28×28. ] .pull-right[
] --- ### Omnidirectional Algorithms on Spoken Digit Task ![:scale 105%](images/spoken_digit.png)