##Lifelong Learning: challenges and practice

Joshua T. Vogelstein ([jovo@jhu.edu](mailto:jovo@jhu.edu)) | [Johns Hopkins University](https://www.jhu.edu/) --- ### What is Lifelong Learning - Similar to multitask learning - Sequential rather than batch - Require computational complexity constraints on hypothesis and learner spaces, $ o(n) $ space and/or $ o(n^2) $ time as upperbounds. - Everything is streaming: data, queries, actions, error, and tasks. Anything about task can change over time. --- ### Lifelong Learning in Biology: Honey Bees Transfer Learn
--
- honey bees can also transfer to different sensory modality (smell) - honeybees do not forget how to do the first task - this is called "forward transfer" - bees learn the concept of "sameness" --- ### Lifelong learning is hard: catastrophic forgetting  --- ### 30 years later... 
And the struggle to not forget continues... --- ### Defining/Quantifying Learning & Forgetting  Using non-task data to improve performance over what it could achieve using only task data Key is measuring improvement in performance rather than raw accuracy --- ### What is forward learning? - Let $n\_t$ be the last occurence of task $t$ in $\mathbf{D}\_n$ - Let $\mathbf{D}\_n^{< t} = \lbrace S\_1, S\_2, \ldots, S\_{n_t} \rbrace$ - .ye[Forward] learning efficiency is the improvement on task $t$ resulting from all data .ye[preceding] task $t$ $$ FLE^s\_{\mathbf{n}}(f) := \frac{\mathcal{E}_f^s(\mathbf{D}^{t}\_n)}{\mathcal{E}_f^s(\mathbf{D}^{< t}\_n)} $$
$f$ .ye[forward learns] if $FLE_{\mathbf{n}}(f) > 1$. --- ### What is backward learning? .ye[Backward] learning efficiency is the improvement on task $t$ resulting from all data .ye[after] task $t$ $$ BLE^s\_{\mathbf{n}}(f) := \frac{\mathcal{E}_f^s(\mathbf{D}^{< t}\_n)}{\mathcal{E}_f^s(\mathbf{D}\_n)} $$
$f$ .ye[backward learns] if $BLE_{\mathbf{n}}(f) > 1$. --- ### Learning efficiency factorizes $$LE^s\_{\mathbf{n}}(f) := FLE^s\_{\mathbf{n}}(f) \times BLE^s\_{\mathbf{n}}(f) $$ $$ \frac{\mathcal{E}_f^s(\mathbf{D}^{t}\_n)}{\mathcal{E}_f^s(\mathbf{D}\_n)} = \frac{\mathcal{E}_f^s(\mathbf{D}^{t}\_n)}{\mathcal{E}_f^s(\mathbf{D}^{< t}\_n)} \times \frac{\mathcal{E}_f^s(\mathbf{D}^{< t}\_n)}{\mathcal{E}_f^s(\mathbf{D}\_n)} $$
--- ### Our claim A lifelong learning agent should improve on
past tasks , i.e., $BLE_{\mathbf{n}}(f) > 1$
current tasks, i.e., $LE^s_{\mathbf{n}}(f) > 1$
future or yet unseen tasks, i.e., $FLE_{\mathbf{n}}(f) > 1$
--- ### Our approach: ensembling representations  --- ### Synergistic Algorithms can Transfer Between XOR and XNOR  --- ## CIFAR 10x10 .pull-left[ - *CIFAR 100* is a popular image classification dataset with 100 classes of images. - 500 training images and 100 testing images per class. - All images are 32x32 color images. - CIFAR 10x10 breaks the 100-class task problem into 10 tasks, each with 10-class. ] .pull-right[
] --- ### Synergistic Algorithms Show Forward Transfer  --- ### Synergistic Algorithms Uniquely Show Backward Transfer for Each Task  --- ### Acknowledgements
yummy
lion
baby girl
family
earth
milkyway
##### JHU
Carey Priebe
Jesse Patsolic
Meghana Madhya
Hayden Helm
Richard Gou
Ronak Mehta
Jayanta Dey
Will LeVine
##### Microsoft Research
Chris White
Weiwei Yang
Jonathan Larson
Bryan Tower
##### DARPA L2M {[BME](https://www.bme.jhu.edu/),[CIS](http://cis.jhu.edu/), [ICM](https://icm.jhu.edu/), [KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) | [neurodata](https://neurodata.io)
[jovo@jhu.edu](mailto:j1c@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data) --- background-image: url(images/l_and_v.jpeg) .footnote[Questions?] --- class: middle # .center[Appendix] --- .small[ ### Publications 1. A. Geisa et al. [Towards a theory of out-of-distribution learning](https://arxiv.org/abs/2109.14501), arXiv, 2021. 1. J. T. Vogelstein et al. [Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity](https://arxiv.org/abs/2004.12908), arXiv, 2022. 1. H. Xu et al. [Simplest Streaming Trees](https://arxiv.org/abs/2110.08483), arXiv, 2022. 1. J. Dey et al. [Out-of-distribution and in-distribution posterior calibration using Kernel Density Polytopes](https://arxiv.org/abs/2201.13001), arXiv, 2022. 1. C. E. Priebe et al. [Modern Machine Learning: Partition and Vote](https://doi.org/10.1101/2020.04.29.068460), 2020. 1. R. Guo, et al. [Estimating Information-Theoretic Quantities with Uncertainty Forests](https://arxiv.org/abs/1907.00325). arXiv, 2019. 1. R. Perry, et al. [Manifold Forests: Closing the Gap on Neural Networks](https://openreview.net/forum?id=B1xewR4KvH). arXiv, 2019. 1. C. Shen and J. T. Vogelstein. [Decision Forests Induce Characteristic Kernels](https://arxiv.org/abs/1812.00029). arXiv, 2019. 1. M. Madhya, et al. [Geodesic Learning via Unsupervised Decision Forests](https://arxiv.org/abs/1907.02844). arXiv, 2019. 1. M. Madhya, et al. [PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment](https://arxiv.org/abs/2011.05383). arXiv, 2020. ### Conferences 1. J. T. Vogelstein et al. A biological implementation of lifelong learning in the pursuit of artificial general intelligence. NAISys, 2020. 2. B. Pedigo et al. A quantitative comparison of a complete connectome to artificial intelligence architectures. NAISys, 2020. ] --- ### Biological learning is on top  --- ### Spoken Digit dataset .pull-left[ - *Spoken Digit* contains recording from 6 different speakers. - Each digit has 50 recordings (3000 total recordings). - For each recording spectrogram was extracted using using Hanning windows of duration 16 ms with an overlap of 4 ms. - The spectrograms were resized down to 28×28. ] .pull-right[
] --- ### Synergistic Algorithms on Spoken Digit Task 