### Lifelong Learning:
Theory and Practice and Coresets PI: Joshua T. Vogelstein, [JHU](https://www.jhu.edu/)
Co-PI: Vova Braverman, [JHU](https://www.jhu.edu/)
Jayanta Dey, Will LeVine, Hayden Helm, Ali Geisa, Ronak Mehta, Carey E. Priebe  --- Biological agents progressively build representations to transfer both forward & backward  1. We learn representations independently for each task, and ensemble the representations both past and future tasks 3. We illustrate it achieves SOTA forward, and unique backward transfer, and uniquely monotonically transfer on CIFAR 10x10 4. It can be applied to any sequential task-aware classification task --- ### Potential Collaborations ### [http://proglearn.neurodata.io/](http://proglearn.neurodata.io/)  --- ### Key Innovations 1. Modernized statistical decision theory to explicitly incorporate a .ye[learning task] 2. Introduced .ye[learning efficiency] 3. Formalized and unified .ye[learning metrics] 4. Identified .ye[partition & vote] equivalence of Deep Networks and Decision Forests 5. Developed .ye[Progressive Learning Forests and Networks] 6. Illustrated the value of .ye[coresets] for lifelong learning (vova) --- ### Learning Task Definition | Component | Notation | Examples | | :--- | :--- | :--- | Query Space | $\mathcal{Q}$ | is this a cat? | Action Space | $\mathcal{A}$ | A,B, ←, →, ↑, ↓ | Measurement Space | $\mathcal{Z}$ | 8-bit images, 256 x 256 | Statistical Model | $\mathcal{P}$ | Gaussian | Hypotheses | $\mathcal{H}$ | linear classifiers | Risk | $R$ | expected loss | Algorithm Space | $\mathcal{F}$ | Random Forests | True & Unknown Distribution | $P$ | $\mu=0$, $\sigma=1$ --- ### What is learning? .ye[$f$] learns from .ye[data] $\mathbf{Z}_n$ with respect to .ye[task] $t$ when its .ye[performance] at $t$ improves due to $\mathbf{Z}_n$. - Define .ye[generalization error] $\mathcal{E}_n(f) := \mathbb{E}_P[R(f(\bold{Z}_n))]$ - $\bold{Z}_0$ corresponds to no data. - Define .ye[learning efficiency]: $$LE_n(f) := \frac{\mathbb{E}_P[R(f(\bold{Z}_0))]}{\mathbb{E}_P[R(f(\bold{Z}_n))]} = \frac{\mathcal{E}_0(f)}{\mathcal{E}_n(f)}$$
$f$ learns from $\mathbf{Z}_n$ with respect to task $t$ when $LE_n(f) > 1$. --- ### What is forward learning? - Let $n\_t$ be the last occurence of task $t$ in $\mathbf{Z}\_n$ - Let $\mathbf{Z}\_n^{< t} = \lbrace Z\_1, Z\_2, \ldots, Z\_{n_t} \rbrace$ .ye[Forward] learning efficiency is the improvement on task $t$ resulting from all data .ye[preceding] task $t$ $$FL^t\_{\mathbf{n}}(f) := \frac{\mathbb{E}[R^t(f(\mathbf{Z}^{t}\_n))]}{\mathbb{E}[R^t(f(\mathbf{Z}^{< t}\_n))]} =\frac{\mathcal{E}\_{Z\_n^t}(f)}{\mathcal{E}\_{Z\_n^{< t}}(f)}.$$
$f$ .ye[forward learns] if $FL_{\mathbf{n}}(f) > 1$. --- ### What is backward learning? .ye[Backward] learning efficiency is the improvement on task $t$ resulting from all data .ye[after] task $t$ $$ BL^t\_{\mathbf{n}}(f) := \frac{\mathbb{E}[R^t(f(\mathbf{Z}^{< t}\_n))]}{\mathbb{E}[R^t(f(\mathbf{Z}\_n))]} =\frac{\mathcal{E}\_{Z\_n^{< t}}(f)}{\mathcal{E}\_{Z\_n}(f)}. $$
$f$ .ye[backward learns] if $BL_{\mathbf{n}}(f) > 1$. --- ### Unification of Learning Metrics Each of the previous definitions are all special cases of $LE^t(\mathbf{Z}\_A, \mathbf{Z}\_B; f)$, for specific choices of $\mathbf{Z}\_A$ and $\mathbf{Z}\_B$ - Learning: $\mathbf{Z}\_A=\mathbf{Z}\_0$ and $\mathbf{Z}\_B=\mathbf{Z}\_n$. - Transfer learning: $\mathbf{Z}\_A=\mathbf{Z}\_n^t$ and $\mathbf{Z}\_B=\mathbf{Z}\_n$. - Multitask learning: for each $t$, $\mathbf{Z}\_A=\mathbf{Z}\_n^t$ and $\mathbf{Z}\_B=\mathbf{Z}\_n$. - Forward learning: $\mathbf{Z}\_A=\mathbf{Z}\_n^t$ and $\mathbf{Z}\_B=\mathbf{Z}\_n^{< t}$. - Backward learning: $\mathbf{Z}\_A=\mathbf{Z}\_n^{< t}$ and $\mathbf{Z}\_B=\mathbf{Z}\_n$. Conjecture: All learning metrics we care about are functions of learning efficiency for a specific $\mathbf{Z}\_A$ and $\mathbf{Z}\_B$. --- ### Deep Nets and Decision Forests  - Both learn convex polytope partitions of feature space, with affine activation functions - We can easily swap between the two empirically and theoretically --- #### Progressive Learning Forests and Networks  1. Representers can be forest, networks, etc. 1. Seperate representers learned for each task 2. Voters leverage both past and future representers --- ### Key Strength  - SOTA forward transfer - Unique backward transfer using 500 training samples (SOTA) - Monotonically increasing backward transfer (SOTA) --- ### Key Limitations 1. Only works for task-aware (not task-unaware) 1. Only works for classification (not regression) 2. Only batched data (not streaming, not RL) 3. Theorems not yet with dotted i's and crossed t's ---  ---  --- ### Acknowledgements
yummy
lion
baby girl
family
earth
milkyway
##### JHU
Carey Priebe
Meghana Madhya
Ronak Mehta
Jayanta Dey
Will LeVine
Hayden Helm
Richard Gou
Ali Geisa
##### Microsoft Research
Chris White
Weiwei Yang
Jonathan Larson
Bryan Tower
##### DARPA L2M: All code open source and reproducible from [proglearn.neurodata.io/](http://proglearn.neurodata.io/) {[BME](https://www.bme.jhu.edu/),[CIS](http://cis.jhu.edu/), [ICM](https://icm.jhu.edu/), [KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) | [neurodata](https://neurodata.io)
[jovo@jhu.edu](mailto:j1c@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data) --- background-image: url(images/l_and_v.jpeg) .footnote[Questions?]