class: center, middle name:opening ### Supervised Manifold Learning Outperforms PCA for Subsequent Inference Joshua T. Vogelstein*, Minh Tang, Da Zheng, Randal Burns, Mauro Maggioni
.center[
{[bme](http://www.bme.jhu.edu/), [icm](http://icm.jhu.edu/), [cis](http://cis.jhu.edu/), [idies](http://idies.jhu.edu/), [kavli](http://kavlijhu.org/), [cs](http://engineering.jhu.edu/computer-science/), [ams](http://engineering.jhu.edu/ams/), [neuro](http://neuroscience.jhu.edu/)} | [jhu](https://www.jhu.edu/)
questions: [jovo@jhu.edu](mailto:jovo at jhu dot edu)
slides:
Co-Founder: [NeuroData](http://neurodata.io) & [gigantum](http://gigantum.io) ] --- class: center, middle # Interrupt! --- ### What is supervised learning? Given (X
i
,Y
i
) pairs with neither F
Y
nor F
Y
degenerate, *supervised learning* is the estimation of any given functional of F
X|Y
-- #### Examples - Classification - Regression - 2-sample testing - K-sample testing --- ### Classification and Fisher's LDA - Given F
X|Y
= N(μ
y
,Σ) and F
Y
= B(π), - where X ϵ R
p
- Bayes optimal classifier is x' Σ
-1
δ > t, where δ=μ
0
-μ
1
-- #### Properties - simple - multiclass generalizations - plug-in estimate converges to Bayes optimal - algorithmic efficiency --- ### But... - When n < p, our estimate of Σ is singular - Cannot use Fisher's LDA - What to do?
-- - Manifold learning - Spare modeling (is secretly also manifold learning)
--- ### Manifold Learning for Subsequent Inference
--- ### Limitations of existing approaches - Manifold learing - is typically unsupervised - out of sample embedding is icky - do not scale to terabytes (often require n
3
operations) - who says directions of variance are near directions of discrimination?
-- - Sparse modeling (is supervised, but...) - NP-hard (feature screening), or - approximations do not scale to terabytes (Lasso), or - non-convex (Dictionary learning), - with icky hyperparameters (elastic net & dictionary learning) --- ## Linear Projections - PCA: eig({x
i
- μ}) - PCA': eig({x
i
- μ
j
}) - LOL: [δ, eig({x
i
- μ
j
})] "Linear Optimal Low-Rank"
-- ### Notes - Fisher's LDA uses δ and {x
i
- μ
j
} - PCA' removes δ - PCA kind of accidentally includes δ, Σ + π(1-π) δ δ', but weights it suboptimally - LOL uses both terms explicitly, weighting δ more - For each we compose with LDA on low-d estimates --- ## LOL Gaussian Intuition
--- ## LOL > PCA Theory
--- ## Unpacking the theory: Chernoff Information - C(F,G) = sup
t
[ -log ∫ f
t
(x) g
1-t
(x) dx], for 0 < t < 1 - the *exponential rate* at which the Bayes error decreases - it is the tightest possible bound on performance - if F=N(μ
0
,Σ
0
) and G=N(μ
1
,Σ
1
) - C(F,G)= 0.5 sup
t
t(1-t) δ' Σ
-1
δ + log |Σ
t
| / (|Σ
0
|
t
|Σ
1
|
1-t
) -- ### Chernoff on Projected Data - let F & G be Gaussian with same covariance (LDA model) - let A be any linear transformation - C(F
A
, G
A
) = 1/8 * || P
Z
Σ
-1/2
δ ||
F
2
- P
Z
= Z (Z' Z)
-1
Z', and Z = Σ
1/2
A' - let C
A
:= C(F
A
, G
A
) --- ## "Thm A: LOL > PCA'" - let F & G be Gaussian with same covariance - C
LOL
≥ C
PCA'
- inequality is strict whenever δ' (I - U
d
U
d
' ) δ ≥ 0. --- ## "Thm B: LOL > PCA" - C
PCA
= 4K / (4 - K), where - K = δ' Σ
t
d
δ - so when δ is in the space spanned by the smaller eigenvectors, PCA discards nearly all the info
-- ### Simple Example - if Σ is diagonal with decreasing λ's, - δ=(0,...,0,s), - λ
p
+ s
2
/4 < λ
d
- C
PCA
= 0 - C
LOL
= s
2
/ λ
p
--- ## "Thm C: [δ, U
d
] > U
d
" - let U
d
be any matrix in R
p x d
with U
d
U
d
' = I - arthimetic is messier - nearly the same result as PCA - basically, when δ and U
d
are nearly orthogonal, adding delta helps --- ### "Thm D: LOL > PCA as p increases" - let γ = λ
d
- λ
d+1
- let δ be sparse with probabilty ε an element is 0, - o.w. it is Gaussian - let p(1-ε) → θ - then with probability at least ε
d
, C
PCA
= 0 < C
LOL
- and this probability can be made arbitrarily close to 1 --- ## "Thm E: LOL > PCA as n & p increases" - when n/p → 0, all results trivially hold
- Suppose Σ is low rank + σ
2
I - Suppose that: M log p ≤ log n ≤ M' log λ, - provided M & M' are large enough - estimates of C converge, so - E [ C
LOL
] > E[ C
PCA
] --- ## LOL > PCA Simulations
--- ## LOL is fast
- utilize FlashX semi-external memory (SEM) computing - optimal (linear) scale up and out - SEM speed ≈ internal memory speed - swap random projections (RP) with SVD for 10x speed improvement - RP error rate ≈ SVD error rate - ~3 minutes for a 2 terabyte dataset --- ## LOL > PCA Data
--- ### LOL Hypothesis Testing & Regression
--- ## Discussion - simple supervised inference for wide data - big data tools (eg, Spark, H20, VW) typically focus on large n - algorithmic and theoretical generalizations straightforward - open source implementations
http://neurodataio/tools/LOL/
--- class: center
# Questions? ## Hiring Postdocs & Software Engineers Now! e: [jovo@jhu.edu](mailto:jovo@jhu.edu) | w: [neurodata.io](http://neurodata.io), [gigantum.io](http://gigantum.io)