+ - 0:00:00
Notes for current slide
Notes for next slide

Towards a theory of out-of-distribution learning

Ali | Ronak | Jovo | CEP | Hayden | Jayanta

1 / 20

Background and Motivation

  • Why do we need a theory paper?
  • Classical machine learning is: You are given data, the goal is to train a system for the task at hand.
  • The task could be things like image classification (cat or dog) or clustering, and you train a neural network to do the classification for you.
  • A lot of theory has been done on this sort of learning (see VC and Valiant).
2 / 20

Background and Motivation

  • There are many other machine learning paradigms that haven't been theoretically formalized well yet.
  • In the past it didn't matter because we couldn't even do classical machine learning that well.
  • GPUs changed all that and now we want to do things like transfer learning, multitask learning, lifelong learning, etc...
3 / 20

Learning

  • This can get pretty philosophical and complicated, so let's not go there. We use a practical definition instead.
  • A learner learns with respect to a task from data if the learner's performance on the task improves because of the data.
  • In English, if we are doing image classification (the task), and our network classifies better (performance) after training it on the data, then the network has learned.
4 / 20

Performance

  • Performance is just the risk. In a familiar scenario like linear regression, it is simply expected squared loss,

EPX,Y[(h(X)Y)2] \mathbb{E}_{P_{X, Y}}[\left(h(X) - Y\right)^2]

  • X X is the query, the pattern/feature, the image to be classified, the thing to which assign a label.
  • Y Y is the action, the label, cat or dog, the actual cluster assigned to the pattern.
  • h h is the hypothesis, a function which when given a query, outputs its guess for the corresponding correct action.
  • PX,Y P_{X, Y} is a distribution over queries and actions.
5 / 20

Performance

  • Performance then, in its general form is,

RPX,Y(h) R_{P_{X, Y}}(h)

6 / 20

Data

  • This can be query-action pairs (i.e. (X,Y) (X, Y) pairs), or if we have a Gaussian N(μ,1)\mathcal{N}(\mu, 1) whose mean we want to estimate, the data can just be nn observations from the distribution.
7 / 20

Performance with Data

  • Some function of the expected risk, or generalization error

EPX,Y,S[R(h^)] \mathbb{E}_{P_{X, Y, \mathbf{S}}}[R(\hat{h})]

  • S \mathbf{S} is the data set.
  • h^ \hat{h} is the hypothesis estimated using the data.
  • From where do we h^ \hat{h} ?
8 / 20

Learner

  • The learner f f is what produces h^ \hat{h} .
  • The learner takes in the data set S \mathbf{S} to output an estimated hypothesis. f(S)=h^ f(\mathbf{S}) = \hat{h}

  • The performance then is really some function of EPX,Y,S[R(f(S))] \mathbb{E}_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)]

9 / 20

Task

  • After we have specified all the previous stuff (queries, actions, data, etc.), then the task is to minimize the generalization error

min EPX,Y,S[R(f(S))] \text{min } \qquad \mathbb{E}_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)]

such that fF \text{such that } \qquad f \in \mathcal{F}

  • F \mathcal{F} is the space of learners (we could have computational complexity constraints on the learners used for example)
10 / 20

Improving Performance

  • Recall our definition of learning, "A learner learns with respect to a task from data if the learner's performance on the task improves because of the data."
  • The final piece is improving performance because of the data. We introduce learning efficiency to measure this,

LEft(S0,S)=Eft(S0)Eft(S) \text{LE}_f^t(\mathbf{S}_0, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}_0)}{\mathcal{E}_f^t(\mathbf{S})}

  • S0= \mathbf{S}_0 = \varnothing is the empty data set, meaning no data.
  • If LE>1 \text{LE} > 1 , then performance has improved because of the data S \mathbf{S} .
11 / 20

Generalized Task

  • A task is the object of study in classical machine learning.
  • There we are given a data set S\mathbf{S} and a task tt.
  • The break we make with this framework is that we now allow for multiple data sets, and multiple tasks.
  • A super task then is when we have multiple data sets {S1,...,Sm} \lbrace \mathbf{S}^1, ... , \mathbf{S}^m \rbrace and multiple tasks {t1,...,tn} \lbrace t_1, ..., t_n \rbrace , mnm \geq n.
  • Note we can have empty data sets.
  • Multitask learning, transfer learning, etc. can now be framed as generalized tasks.
12 / 20

Transfer Learning

  • In transfer learning, we have a single task and multiple data sets.
  • With transfer learning, we want to measure if the out-of-task data has helped our performance more than just using the task data.
  • We use learning efficiency to measure this, LEft(S1,S)=Eft(S1)Eft(S) \text{LE}_f^t(\mathbf{S}^1, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^1)}{\mathcal{E}_f^t(\mathbf{S})}
  • S1 \mathbf{S}^1 is the task data.
  • S \mathbf{S} is all of the data (i.e. we put together all of the data sets).
  • We have transfer learned if LEft>1 \text{LE}_f^t > 1 .
13 / 20

Multitask Learning

  • In multitask learning, we have multiple tasks and multiple data sets.
  • With multitask learning, we want want to measure how much we have transfer learned for each task, LEft(St,S)=Eft(St)Eft(S) \text{LE}_f^t(\mathbf{S}^t, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S})}
  • St \mathbf{S}^t is the task tt data.
  • S \mathbf{S} is all of the data.
  • We have transfer learned for that task if LEft>1 \text{LE}_f^t > 1 .
  • We have multitask learned if some weighted average of the learning efficiencies LEft1,...,LEtn\text{LE}_f^{t_1}, ..., \text{LE}^{t_n} is greater than 11.
14 / 20

Continual Learning

  • Continual learning is very akin to multitask learning
  • The primary difference with multitask learning is that it is done sequentially in time.
  • This leads us to explicitly require computational complexity constraints on the hypothesis and learner spaces. Namely, we require o(n) o(n) space and/or o(n2) o(n^2) time as upperbounds for the complexity.
  • The other aspect to continual learning is that everything is streaming: data, queries, actions, error, and tasks. Hence anything about the task the learner is faced with can change over time (without the learner necessarily knowing that a change has occurred).
15 / 20

Quantifying Continual Learning

With continual learning, we want to incorporate the time-dependent, streaming nature of the problem in our performance metrics. Given a task t t , let S<t \mathbf{S}^{< t} be the set of data points up to and including the last data point from task t t .

We define forward transfer to be LEft(St,S<t)=Eft(St)Eft(S<t) \text{LE}_f^t(\mathbf{S}^t, \mathbf{S}^{< t}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S}^{< t})}

And we define backward transfer to be LEft(S<t,S)=Eft(S<t)Eft(S) \text{LE}_f^t(\mathbf{S}^{< t}, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^{< t})}{\mathcal{E}_f^t(\mathbf{S})}

Where S \mathbf{S} is all of the data and St \mathbf{S}^t is the task t t data.

16 / 20

Other definitions of learning for OOD scenarios

  • We used learning efficiency as the basis metric with which to measure performance and trasnfer in the various OOD scenarios.
  • There are many other ways as well to measure transfer and performance.
  • Analogous to traditional learning, we will introduce and examine here the two notions of weak OOD learning and strong OOD learning.
17 / 20

Weak OOD

Assume we have some model of distributions P \mathcal{P} and source distribution P P . Loosely, we say P \mathcal{P} is weakly OOD learnable with target data of size n n if given enough source data, we can perform better than base performance (i.e. with just target data) with arbitrarily high probability.

18 / 20

Strong OOD

Assume we have some model of distributions P \mathcal{P} and source distribution P P . We say P \mathcal{P} is strongly OOD learnable with target data of size n n if given enough source data, we can perform arbitrarily well (i.e. arbitrarily close to the Bayes risk) with arbitrarily high probability.

19 / 20

Theoretical Results

  • Strong OOD learning implies weak OOD learning
  • Weak OOD meta-learning does not imply strong OOD meta-learning (meaning we cannot boost)
  • Weak OOD learning implies positive transfer (meaning learning efficiency is greater than 1 1 ), but not vice versa
20 / 20

Background and Motivation

  • Why do we need a theory paper?
  • Classical machine learning is: You are given data, the goal is to train a system for the task at hand.
  • The task could be things like image classification (cat or dog) or clustering, and you train a neural network to do the classification for you.
  • A lot of theory has been done on this sort of learning (see VC and Valiant).
2 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow