GTL

Towards a theory of out-of-distribution learning

1 / 20

Background and MotivationWhy do we need a theory paper?
Classical machine learning is: You are given data, the goal is to train a system for the task at hand.
The task could be things like image classification (cat or dog) or clustering, and you train a neural network to do the classification for you.
A lot of theory has been done on this sort of learning (see VC and Valiant).
2 / 20

Background and MotivationThere are many other machine learning paradigms that haven't been theoretically formalized well yet.
In the past it didn't matter because we couldn't even do classical machine learning that well.
GPUs changed all that and now we want to do things like transfer learning, multitask learning, lifelong learning, etc...
3 / 20

LearningThis can get pretty philosophical and complicated, so let's not go there. We use a practical definition instead.
A learner learns with respect to a task from data if the learner's performance on the task improves because of the data.
In English, if we are doing image classification (the task), and our network classifies better (performance) after training it on the data, then the network has learned.
4 / 20

Performance

Performance is just the risk. In a familiar scenario like linear regression, it is simply expected squared loss,

$\mathbb{E}_{P_{X, Y}}[\left(h(X) - Y\right)^2]$

$X$ is the query, the pattern/feature, the image to be classified, the thing to which assign a label.
$Y$ is the action, the label, cat or dog, the actual cluster assigned to the pattern.
$h$ is the hypothesis, a function which when given a query, outputs its guess for the corresponding correct action.
$P_{X, Y}$ is a distribution over queries and actions.

5 / 20

Performance

Performance then, in its general form is,

$R_{P_{X, Y}}(h)$

6 / 20

DataThis can be query-action pairs (i.e. (X,Y) (X, Y) (X,Y) pairs), or if we have a Gaussian N(μ,1)\mathcal{N}(\mu, 1)N(μ,1) whose mean we want to estimate, the data can just be nnn observations from the
distribution.
7 / 20

Performance with Data

Some function of the expected risk, or generalization error

$\mathbb{E}_{P_{X, Y, \mathbf{S}}}[R(\hat{h})]$

$\mathbf{S}$ is the data set.
$\hat{h}$ is the hypothesis estimated using the data.
From where do we $\hat{h}$ ?

8 / 20

Learner

The learner $f$ is what produces $\hat{h}$ .
The learner takes in the data set $\mathbf{S}$ to output an estimated hypothesis. $f(\mathbf{S}) = \hat{h}$
The performance then is really some function of $\mathbb{E}_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)]$

9 / 20

Task

After we have specified all the previous stuff (queries, actions, data, etc.), then the task is to minimize the generalization error

$\text{min } \qquad \mathbb{E}_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)]$

$\text{such that } \qquad f \in \mathcal{F}$

$\mathcal{F}$ is the space of learners (we could have computational complexity constraints on the learners used for example)

10 / 20

Improving Performance

Recall our definition of learning, "A learner learns with respect to a task from data if the learner's performance on the task improves because of the data."
The final piece is improving performance because of the data. We introduce learning efficiency to measure this,

$\text{LE}_f^t(\mathbf{S}_0, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}_0)}{\mathcal{E}_f^t(\mathbf{S})}$

$\mathbf{S}_0 = \varnothing$ is the empty data set, meaning no data.
If $\text{LE} > 1$ , then performance has improved because of the data $\mathbf{S}$ .

11 / 20

Generalized TaskA task is the object of study in classical machine learning.
There we are given a data set S\mathbf{S}S and a task ttt.
The break we make with this framework is that we now allow for multiple data sets, and multiple tasks.
A super task then is when we have multiple data sets {S1,...,Sm} \lbrace \mathbf{S}^1, ... , \mathbf{S}^m \rbrace {S​1​​,...,S​m​​} and multiple tasks {t1,...,tn} \lbrace t_1, ..., t_n \rbrace {t​1​​,...,t​n​​}, m≥nm \geq nm≥n.
Note we can have empty data sets.
Multitask learning, transfer learning, etc. can now be framed as generalized tasks.
12 / 20

Transfer LearningIn transfer learning, we have a single task and multiple data sets.
With transfer learning, we want to measure if the out-of-task data has helped our performance more than just using the task data.
We use learning efficiency to measure this,
LEft(S1,S)=Eft(S1)Eft(S) \text{LE}_f^t(\mathbf{S}^1, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^1)}{\mathcal{E}_f^t(\mathbf{S})} LE​f​t​​(S​1​​,S)=​E​f​t​​(S)​​E​f​t​​(S​1​​)​​
S1 \mathbf{S}^1 S​1​​ is the task data.
S \mathbf{S} S is all of the data (i.e. we put together all of the data sets).
We have transfer learned if LEft>1 \text{LE}_f^t > 1 LE​f​t​​>1.
13 / 20

Multitask LearningIn multitask learning, we have multiple tasks and multiple data sets.
With multitask learning, we want want to measure how much we have transfer learned for each task, 
LEft(St,S)=Eft(St)Eft(S) \text{LE}_f^t(\mathbf{S}^t, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S})} LE​f​t​​(S​t​​,S)=​E​f​t​​(S)​​E​f​t​​(S​t​​)​​
St \mathbf{S}^t S​t​​ is the task ttt data.
S \mathbf{S} S is all of the data.
We have transfer learned for that task if LEft>1 \text{LE}_f^t > 1 LE​f​t​​>1.
We have multitask learned if some weighted average of the learning efficiencies LEft1,...,LEtn\text{LE}_f^{t_1}, ..., \text{LE}^{t_n}LE​f​t​1​​​​,...,LE​t​n​​​​ is greater than 111.
14 / 20

Continual LearningContinual learning is very akin to multitask learning
The primary difference with multitask learning is that it is done sequentially in time.
This leads us to explicitly require computational complexity constraints on the hypothesis and learner spaces. Namely, we require o(n) o(n) o(n) space and/or o(n2) o(n^2) o(n​2​​) time as upperbounds for
the complexity.
The other aspect to continual learning is that everything is streaming: data, queries, actions, error, and tasks. Hence anything about the task the learner is faced with can change over
time (without the learner necessarily knowing that a change has occurred).
15 / 20

Quantifying Continual Learning

With continual learning, we want to incorporate the time-dependent, streaming nature of the problem in our performance metrics. Given a task $t$ , let $\mathbf{S}^{< t}$ be the set of data points up to and including the last data point from task $t$ .

We define forward transfer to be $\text{LE}_f^t(\mathbf{S}^t, \mathbf{S}^{< t}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S}^{< t})}$

And we define backward transfer to be $\text{LE}_f^t(\mathbf{S}^{< t}, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^{< t})}{\mathcal{E}_f^t(\mathbf{S})}$

Where $\mathbf{S}$ is all of the data and $\mathbf{S}^t$ is the task $t$ data.

16 / 20

Other definitions of learning for OOD scenariosWe used learning efficiency as the basis metric with which to measure performance and trasnfer in the various OOD scenarios.
There are many other ways as well to measure transfer and performance.
Analogous to traditional learning, we will introduce and examine here the two notions of weak OOD learning and strong OOD learning.
17 / 20

Weak OOD

Assume we have some model of distributions $\mathcal{P}$ and source distribution $P$ . Loosely, we say $\mathcal{P}$ is weakly OOD learnable with target data of size $n$ if given enough source data, we can perform better than base performance (i.e. with just target data) with arbitrarily high probability.

18 / 20

Strong OOD

Assume we have some model of distributions $\mathcal{P}$ and source distribution $P$ . We say $\mathcal{P}$ is strongly OOD learnable with target data of size $n$ if given enough source data, we can perform arbitrarily well (i.e. arbitrarily close to the Bayes risk) with arbitrarily high probability.

19 / 20

Theoretical ResultsStrong OOD learning implies weak OOD learning
Weak OOD meta-learning does not imply strong OOD meta-learning (meaning we cannot boost)
Weak OOD learning implies positive transfer (meaning learning efficiency is greater than 1 1 1 ), but not vice versa
20 / 20

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Towards a theory of out-of-distribution learning

Background and Motivation

Background and Motivation

Learning

Performance

Performance

Data

Performance with Data

Learner

Task

Improving Performance

Generalized Task

Transfer Learning

Multitask Learning

Continual Learning

Quantifying Continual Learning

Other definitions of learning for OOD scenarios

Weak OOD

Strong OOD

Theoretical Results

Background and Motivation

Help