### Towards a theory of out-of-distribution learning Ali | Ronak | Jovo | CEP | Hayden | Jayanta ![:scale 40%](images/neurodata_blue.png) --- ### Background and Motivation - Why do we need a theory paper? - Classical machine learning is: You are given data, the goal is to train a system for the task at hand. - The task could be things like image classification (cat or dog) or clustering, and you train a neural network to do the classification for you. - A lot of theory has been done on this sort of learning (see VC and Valiant). --- ### Background and Motivation - There are many other machine learning paradigms that haven't been theoretically formalized well yet. - In the past it didn't matter because we couldn't even do classical machine learning that well. - GPUs changed all that and now we want to do things like transfer learning, multitask learning, lifelong learning, etc... --- ### Learning - This can get pretty philosophical and complicated, so let's not go there. We use a practical definition instead. - A learner learns with respect to a task from data if the learner's performance on the task improves because of the data. - In English, if we are doing image classification (the task), and our network classifies better (performance) after training it on the data, then the network has learned. --- ### Performance - Performance is just the risk. In a familiar scenario like linear regression, it is simply expected squared loss, $$ \mathbb{E}\_{P_{X, Y}}[\left(h(X) - Y\right)^2] $$ - $ X $ is the query, the pattern/feature, the image to be classified, the thing to which assign a label. - $ Y $ is the action, the label, cat or dog, the actual cluster assigned to the pattern. - $ h $ is the hypothesis, a function which when given a query, outputs its guess for the corresponding correct action. - $ P_{X, Y} $ is a distribution over queries and actions. --- ### Performance - Performance then, in its general form is, $$ R\_{P_{X, Y}}(h) $$ --- ### Data - This can be query-action pairs (i.e. $ (X, Y) $ pairs), or if we have a Gaussian $\mathcal{N}(\mu, 1)$ whose mean we want to estimate, the data can just be $n$ observations from the distribution. --- ### Performance with Data - Some function of the expected risk, or generalization error $$ \mathbb{E}\_{P_{X, Y, \mathbf{S}}}[R(\hat{h})] $$ - $ \mathbf{S} $ is the data set. - $ \hat{h} $ is the hypothesis estimated using the data. - From where do we $ \hat{h} $? --- ### Learner - The learner $ f $ is what produces $ \hat{h} $. - The learner takes in the data set $ \mathbf{S} $ to output an estimated hypothesis. $$ f(\mathbf{S}) = \hat{h} $$ - The performance then is really some function of $$ \mathbb{E}\_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)] $$ --- ### Task - After we have specified all the previous stuff (queries, actions, data, etc.), then the task is to minimize the generalization error $$ \text{min } \qquad \mathbb{E}\_{P_{X, Y, \mathbf{S}}}[R\left(f(\mathbf{S})\right)] $$ $$ \text{such that } \qquad f \in \mathcal{F} $$ - $ \mathcal{F} $ is the space of learners (we could have computational complexity constraints on the learners used for example) --- ### Improving Performance - Recall our definition of learning, "A learner learns with respect to a task from data if the learner's performance on the task improves because of the data." - The final piece is improving performance because of the data. We introduce learning efficiency to measure this, $$ \text{LE}_f^t(\mathbf{S}_0, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}_0)}{\mathcal{E}_f^t(\mathbf{S})} $$ - $ \mathbf{S}_0 = \varnothing $ is the empty data set, meaning no data. - If $ \text{LE} > 1 $, then performance has improved because of the data $ \mathbf{S} $. --- ### Generalized Task - A task is the object of study in classical machine learning. - There we are given a data set $\mathbf{S}$ and a task $t$. - The break we make with this framework is that we now allow for multiple data sets, and multiple tasks. - A super task then is when we have multiple data sets $ \lbrace \mathbf{S}^1, ... , \mathbf{S}^m \rbrace $ and multiple tasks $ \lbrace t_1, ..., t_n \rbrace $, $m \geq n$. - Note we can have empty data sets. - Multitask learning, transfer learning, etc. can now be framed as generalized tasks. --- ### Transfer Learning - In transfer learning, we have a single task and multiple data sets. - With transfer learning, we want to measure if the out-of-task data has helped our performance more than just using the task data. - We use learning efficiency to measure this, $$ \text{LE}_f^t(\mathbf{S}^1, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^1)}{\mathcal{E}_f^t(\mathbf{S})} $$ - $ \mathbf{S}^1 $ is the task data. - $ \mathbf{S} $ is all of the data (i.e. we put together all of the data sets). - We have transfer learned if $ \text{LE}_f^t > 1 $. --- ### Multitask Learning - In multitask learning, we have multiple tasks and multiple data sets. - With multitask learning, we want want to measure how much we have transfer learned for each task, $$ \text{LE}_f^t(\mathbf{S}^t, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S})} $$ - $ \mathbf{S}^t $ is the task $t$ data. - $ \mathbf{S} $ is all of the data. - We have transfer learned for that task if $ \text{LE}_f^t > 1 $. - We have multitask learned if some weighted average of the learning efficiencies $\text{LE}_f^{t_1}, ..., \text{LE}^{t_n}$ is greater than $1$. --- ### Continual Learning - Continual learning is very akin to multitask learning - The primary difference with multitask learning is that it is done sequentially in time. - This leads us to explicitly require computational complexity constraints on the hypothesis and learner spaces. Namely, we require $ o(n) $ space and/or $ o(n^2) $ time as upperbounds for the complexity. - The other aspect to continual learning is that everything is streaming: data, queries, actions, error, and tasks. Hence anything about the task the learner is faced with can change over time (without the learner necessarily knowing that a change has occurred). --- ### Quantifying Continual Learning With continual learning, we want to incorporate the time-dependent, streaming nature of the problem in our performance metrics. Given a task $ t $, let $ \mathbf{S}^{< t} $ be the set of data points up to and including the last data point from task $ t $. We define forward transfer to be $$ \text{LE}_f^t(\mathbf{S}^t, \mathbf{S}^{< t}) = \frac{\mathcal{E}_f^t(\mathbf{S}^t)}{\mathcal{E}_f^t(\mathbf{S}^{< t})} $$ And we define backward transfer to be $$ \text{LE}_f^t(\mathbf{S}^{< t}, \mathbf{S}) = \frac{\mathcal{E}_f^t(\mathbf{S}^{< t})}{\mathcal{E}_f^t(\mathbf{S})} $$ Where $ \mathbf{S} $ is all of the data and $ \mathbf{S}^t $ is the task $ t $ data. --- ### Other definitions of learning for OOD scenarios - We used learning efficiency as the basis metric with which to measure performance and trasnfer in the various OOD scenarios. - There are many other ways as well to measure transfer and performance. - Analogous to traditional learning, we will introduce and examine here the two notions of weak OOD learning and strong OOD learning. --- ### Weak OOD Assume we have some model of distributions $ \mathcal{P} $ and source distribution $ P $. Loosely, we say $ \mathcal{P} $ is weakly OOD learnable with target data of size $ n $ if given enough source data, we can perform better than base performance (i.e. with just target data) with arbitrarily high probability. --- ### Strong OOD Assume we have some model of distributions $ \mathcal{P} $ and source distribution $ P $. We say $ \mathcal{P} $ is strongly OOD learnable with target data of size $ n $ if given enough source data, we can perform arbitrarily well (i.e. arbitrarily close to the Bayes risk) with arbitrarily high probability. --- ### Theoretical Results - Strong OOD learning implies weak OOD learning - Weak OOD meta-learning does not imply strong OOD meta-learning (meaning we cannot boost) - Weak OOD learning implies positive transfer (meaning learning efficiency is greater than $ 1 $ ), but not vice versa