### Lifelong Learning:
Theory and Context PI: Joshua T. Vogelstein, [JHU](https://www.jhu.edu/)
Co-PI: Vova Braverman, [JHU](https://www.jhu.edu/)
Ali Geisa, Jayanta Dey, Will LeVine, Hayden Helm, Ronak Mehta, Carey E. Priebe ![:scale 30%](images/neurodata_blue.png) --- ### Outline - background - theoretically motivate lifelong learning metrics - properly situation lifelong learning within hierarchy of learning paradigms --- class: middle # .center[Background] --- ### What is learning (Valiant)? ![:scale 100%](images/weak-learning.png) basically, doing better than chance with enough data ![:scale 100%](images/strong-learning.png) basically, doing arbitrarily well with enough data .ye[weak learning theorem states that if a problem is weakly learnable, then it is also strongly learnable] --- ### Limitations of this formal definition - there is only 1 task - requires large sample sizes for theory to be relevant - all data are from the same fixed distribution - evaluation is with respect to the data distribution --- ### What is learning (Mitchell)? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. -- Tom Mitchell, 1997 - Pro's - multiple tasks - uncouples experience (data) with tasks - explicit mention of improving due to data - implicitly requires transfer - Con's - not formalized --- class: middle # .center[Jovo Framework] --- ##### In-Distribution vs Out-of-Distribution Learning ![:scale 100%](images/learning-schematics.png) - the key differences - evaluation distribution is uncoupled from data distributions - multiple datasets & distributions --- ### Formalizing OOD Learnability ![:scale 100%](images/weak-ood-learnability.png) basically, using non-task data to improve performance at all ![:scale 100%](images/strong-ood-learnability.png) basically, using non-task data to perform arbitrarily well --- ### Quantifying learning The above two definitions enable one to assess .ye[whether] an agent $f$ has learned, but not .ye[how much] it learned. ![:scale 100%](images/learning-efficiency.png) basically, using non-task data to improve performance over what it could achieve using only task data --- ### Weak OOD Learner Theorem Theorem 1: With *only* out-of-distribution data, there exists some problems that are weakly, but not strongly, learnable. This implies that OOD learning is different *in kind* (and .ye[harder]) from in-distribution learning. --- ### Transfer Learning Theorem Theorem 2: Weak OOD learnability implies transfer learnability (i.e., learning efficiency > 1). That is, if one can weakly learn, one can also transfer learn, but not necessarily vice versa. - This implies that transfer learnability is a fundamental property of learning problems - In other words, inability to transfer is equivalent to inability to learn at all. If one cannot transfer, one cannot learn in any meaningful sense. --- ### Learning Efficiency Applications Each of the previous definitions are all special cases of $LE^t_f(\mathbf{S}^A, \mathbf{S}^B)$, for specific choices of $\mathbf{S}^A$ and $\mathbf{S}^B$ - Learning: $\mathbf{S}^A=\mathbf{S}\_0$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Transfer learning: $\mathbf{S}^A=\mathbf{S}\_n^t$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Multitask learning: for each $t$, $\mathbf{S}^A=\mathbf{S}\_n^t$ and $\mathbf{S}^B=\mathbf{S}\_n$. - Forward learning: $\mathbf{S}^A=\mathbf{S}\_n^t$ and $\mathbf{S}^B=\mathbf{S}\_n^{< t}$. - Backward learning: $\mathbf{S}^A=\mathbf{S}\_n^{< t}$ and $\mathbf{S}^B=\mathbf{S}\_n$. --- ### Lifelong Learning $\subsetneq$ OOD learning ![:scale 65%](images/nested-learning-schematic.png) --- ### Biological learning is on top ![:scale 100%](images/learning-table.png) --- ### Discussion - unified definition and quantification of learning - presented hierarchy of learning paradigms - limitations of current framework: in biology, there are no tasks --- ### Transition Opportunities ### [http://proglearn.neurodata.io/](http://proglearn.neurodata.io/) ![:scale 80%](images/proglearn_webpage.png) - code continues to improve (no time to discuss here) - ensembling representations (rather than decision rules) continues to be a promising path to solving OOD (including lifelong) and eventually biological learning --- ### Acknowledgements
yummy
lion
owl
family
earth
milkyway
##### JHU
Carey Priebe
Meghana Madhya
Ronak Mehta
Jayanta Dey
Will LeVine
Hayden Helm
Richard Gou
Ali Geisa
##### Microsoft Research
Chris White
Weiwei Yang
Jonathan Larson
Bryan Tower
##### DARPA L2M: All code open source and reproducible from [proglearn.neurodata.io/](http://proglearn.neurodata.io/) {[BME](https://www.bme.jhu.edu/),[CIS](http://cis.jhu.edu/), [ICM](https://icm.jhu.edu/), [KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) | [neurodata](https://neurodata.io)
[jovo@jhu.edu](mailto:j1c@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data) --- background-image: url(images/l_and_v.jpeg) .footnote[Questions?]