LOL

class: center, middle

name:opening

### Supervised Manifold Learning Outperforms PCA for Subsequent Inference

Joshua T. Vogelstein*, Minh Tang, Da Zheng, Randal Burns, Mauro Maggioni

.center[
 

{[bme](http://www.bme.jhu.edu/), [icm](http://icm.jhu.edu/), [cis](http://cis.jhu.edu/), [idies](http://idies.jhu.edu/), [kavli](http://kavlijhu.org/), [cs](http://engineering.jhu.edu/computer-science/), [ams](http://engineering.jhu.edu/ams/), [neuro](http://neuroscience.jhu.edu/)} | [jhu](https://www.jhu.edu/)
 
questions: [jovo@jhu.edu](mailto:jovo at jhu dot edu)
 
slides: <http://neurodata.io/tools/LOL/>

Co-Founder: [NeuroData](http://neurodata.io) & [gigantum](http://gigantum.io)

]

---
class: center, middle

# Interrupt!

---

### What is supervised learning?

Given (Xi,Yi) pairs with neither FY nor FY degenerate,
*supervised learning* is the estimation of any given functional of FX|Y

#### Examples

- Classification
- Regression
- 2-sample testing
- K-sample testing

---

### Classification and Fisher's LDA

- Given FX|Y = N(μy,Σ) and FY = B(π),
- where X ϵ Rp
- Bayes optimal classifier is x' Σ-1 δ > t, where δ=μ0-μ1

#### Properties

- simple
- multiclass generalizations
- plug-in estimate converges to Bayes optimal
- algorithmic efficiency

---

### But...

- When n < p, our estimate of Σ is singular
- Cannot use Fisher's LDA
- What to do?

--

- Manifold learning
- Spare modeling (is secretly also manifold learning)

---

### Manifold Learning for  Subsequent Inference

---

### Limitations of existing approaches

- Manifold learing
 - is typically unsupervised
 - out of sample embedding is icky
 - do not scale to terabytes (often require n3 operations)
 - who says directions of variance are near directions of discrimination?

--

- Sparse modeling (is supervised, but...)
  - NP-hard (feature screening), or
  - approximations do not scale to terabytes (Lasso), or
  - non-convex (Dictionary learning),
  - with icky hyperparameters (elastic net & dictionary learning)

---

## Linear Projections

- PCA: eig({xi - μ})
- PCA': eig({xi - μj})
- LOL: [δ, eig({xi - μj})]

"Linear Optimal Low-Rank"

--

### Notes

- Fisher's LDA uses δ and {xi - μj}
- PCA' removes δ
- PCA kind of accidentally includes δ, Σ + π(1-π) δ δ', but weights it suboptimally
- LOL uses both terms explicitly, weighting δ more
- For each we compose with LDA on low-d estimates

---

## LOL Gaussian Intuition

---

## LOL > PCA Theory

---

## Unpacking the theory: Chernoff Information

- C(F,G) = sup t [ -log ∫ ft(x) g1-t(x) dx], for 0 < t < 1
- the *exponential rate* at which the Bayes error decreases
- it is the tightest possible bound on performance
- if F=N(μ0,Σ0) and G=N(μ1,Σ1)
- C(F,G)= 0.5 supt t(1-t) δ' Σ-1 δ + log |Σt| / (|Σ0|t |Σ1|1-t)

### Chernoff on Projected Data

- let F & G be Gaussian with same covariance (LDA model)
- let A be any linear transformation
- C(FA, GA) = 1/8 * || PZ Σ-1/2 δ ||F2
- PZ = Z (Z' Z)-1 Z', and Z = Σ1/2A'
- let CA := C(FA, GA)

---

## "Thm A: LOL > PCA'"

- let F & G be Gaussian with same covariance
- CLOL ≥ CPCA'
- inequality is strict whenever δ' (I - UdUd' ) δ ≥ 0.

---

## "Thm B: LOL > PCA"

- CPCA = 4K / (4 - K), where
- K = δ' Σtd δ
- so when δ is in the space spanned by the smaller eigenvectors, PCA discards nearly all the info

--

### Simple Example

- if Σ is diagonal with decreasing λ's,
- δ=(0,...,0,s),
- λp + s2/4 < λd
- CPCA = 0
- CLOL = s2 / λp

---

## "Thm C: [δ, Ud] > Ud"

- let Ud be any matrix in Rp x d with Ud Ud' = I
- arthimetic is messier
- nearly the same result as PCA
- basically, when δ and Ud are nearly orthogonal, adding delta helps

---

### "Thm D: LOL > PCA as p increases"

- let γ = λd - λd+1
- let δ be sparse with probabilty ε an element is 0,
- o.w. it is Gaussian

- let p(1-ε) → θ
- then with probability at least εd, CPCA = 0 < CLOL
- and this probability can be made arbitrarily close to 1

---

## "Thm E: LOL > PCA as n & p increases"

- when n/p → 0, all results trivially hold

- Suppose Σ is low rank + σ2I
- Suppose that: M log p ≤ log n ≤ M' log λ,
- provided M & M' are large enough
- estimates of C converge, so
- E [ CLOL ] > E[ CPCA]

---

## LOL > PCA Simulations

---

## LOL is fast

- utilize FlashX semi-external memory (SEM) computing
- optimal (linear) scale up and out
- SEM speed ≈ internal memory speed
- swap random projections (RP) with SVD for 10x speed improvement
- RP error rate ≈ SVD error rate
- ~3 minutes for a 2 terabyte dataset

---

## LOL > PCA Data

---

### LOL Hypothesis Testing & Regression

---

## Discussion

- simple supervised inference for wide data
- big data tools (eg, Spark, H20, VW) typically focus on large n
- algorithmic and theoretical generalizations straightforward
- open source implementations

<center>
<a href="http://neurodata.io/tools/LOL/">http://neurodataio/tools/LOL/</a>
</center>

---
class:    center

# Questions?

## Hiring Postdocs & Software Engineers Now!

e: [jovo@jhu.edu](mailto:jovo@jhu.edu) |
w: [neurodata.io](http://neurodata.io), [gigantum.io](http://gigantum.io)