+ - 0:00:00
Notes for current slide
Notes for next slide

Supervised Manifold Learning Outperforms PCA for Subsequent Inference

Joshua T. Vogelstein*, Minh Tang, Da Zheng, Randal Burns, Mauro Maggioni


1 / 29

Interrupt!

2 / 29

What is supervised learning?

Given (Xi,Yi) pairs with neither FY nor FY degenerate, supervised learning is the estimation of any given functional of FX|Y

3 / 29

What is supervised learning?

Given (Xi,Yi) pairs with neither FY nor FY degenerate, supervised learning is the estimation of any given functional of FX|Y

Examples

  • Classification
  • Regression
  • 2-sample testing
  • K-sample testing
4 / 29

Classification and Fisher's LDA

  • Given FX|Y = N(μy,Σ) and FY = B(π),
  • where X ϵ Rp
  • Bayes optimal classifier is x' Σ-1 δ > t, where δ=μ01
5 / 29

Classification and Fisher's LDA

  • Given FX|Y = N(μy,Σ) and FY = B(π),
  • where X ϵ Rp
  • Bayes optimal classifier is x' Σ-1 δ > t, where δ=μ01

Properties

  • simple
  • multiclass generalizations
  • plug-in estimate converges to Bayes optimal
  • algorithmic efficiency
6 / 29

But...

  • When n < p, our estimate of Σ is singular
  • Cannot use Fisher's LDA
  • What to do?


7 / 29

But...

  • When n < p, our estimate of Σ is singular
  • Cannot use Fisher's LDA
  • What to do?


  • Manifold learning
  • Spare modeling (is secretly also manifold learning)


8 / 29

Manifold Learning for Subsequent Inference

9 / 29

Limitations of existing approaches

  • Manifold learing
    • is typically unsupervised
    • out of sample embedding is icky
    • do not scale to terabytes (often require n3 operations)
    • who says directions of variance are near directions of discrimination?


10 / 29

Limitations of existing approaches

  • Manifold learing
    • is typically unsupervised
    • out of sample embedding is icky
    • do not scale to terabytes (often require n3 operations)
    • who says directions of variance are near directions of discrimination?


  • Sparse modeling (is supervised, but...)
    • NP-hard (feature screening), or
    • approximations do not scale to terabytes (Lasso), or
    • non-convex (Dictionary learning),
    • with icky hyperparameters (elastic net & dictionary learning)
11 / 29

Linear Projections

  • PCA: eig({xi - μ})
  • PCA': eig({xi - μj})
  • LOL: [δ, eig({xi - μj})]

"Linear Optimal Low-Rank"


12 / 29

Linear Projections

  • PCA: eig({xi - μ})
  • PCA': eig({xi - μj})
  • LOL: [δ, eig({xi - μj})]

"Linear Optimal Low-Rank"


Notes

  • Fisher's LDA uses δ and {xi - μj}
  • PCA' removes δ
  • PCA kind of accidentally includes δ, Σ + π(1-π) δ δ', but weights it suboptimally
  • LOL uses both terms explicitly, weighting δ more
  • For each we compose with LDA on low-d estimates
13 / 29

LOL Gaussian Intuition

14 / 29

LOL > PCA Theory

15 / 29

Unpacking the theory: Chernoff Information

  • C(F,G) = sup t [ -log ∫ ft(x) g1-t(x) dx], for 0 < t < 1
  • the exponential rate at which the Bayes error decreases
  • it is the tightest possible bound on performance
  • if F=N(μ00) and G=N(μ11)
  • C(F,G)= 0.5 supt t(1-t) δ' Σ-1 δ + log |Σt| / (|Σ0|t1|1-t)
16 / 29

Unpacking the theory: Chernoff Information

  • C(F,G) = sup t [ -log ∫ ft(x) g1-t(x) dx], for 0 < t < 1
  • the exponential rate at which the Bayes error decreases
  • it is the tightest possible bound on performance
  • if F=N(μ00) and G=N(μ11)
  • C(F,G)= 0.5 supt t(1-t) δ' Σ-1 δ + log |Σt| / (|Σ0|t1|1-t)

Chernoff on Projected Data

  • let F & G be Gaussian with same covariance (LDA model)
  • let A be any linear transformation
  • C(FA, GA) = 1/8 * || PZ Σ-1/2 δ ||F2
  • PZ = Z (Z' Z)-1 Z', and Z = Σ1/2A'
  • let CA := C(FA, GA)
17 / 29

"Thm A: LOL > PCA'"

  • let F & G be Gaussian with same covariance
  • CLOL ≥ CPCA'
  • inequality is strict whenever δ' (I - UdUd' ) δ ≥ 0.
18 / 29

"Thm B: LOL > PCA"

  • CPCA = 4K / (4 - K), where
  • K = δ' Σtd δ
  • so when δ is in the space spanned by the smaller eigenvectors, PCA discards nearly all the info


19 / 29

"Thm B: LOL > PCA"

  • CPCA = 4K / (4 - K), where
  • K = δ' Σtd δ
  • so when δ is in the space spanned by the smaller eigenvectors, PCA discards nearly all the info


Simple Example

  • if Σ is diagonal with decreasing λ's,
  • δ=(0,...,0,s),
  • λp + s2/4 < λd
  • CPCA = 0
  • CLOL = s2 / λp
20 / 29

"Thm C: [δ, Ud] > Ud"

  • let Ud be any matrix in Rp x d with Ud Ud' = I
  • arthimetic is messier
  • nearly the same result as PCA
  • basically, when δ and Ud are nearly orthogonal, adding delta helps
21 / 29

"Thm D: LOL > PCA as p increases"

  • let γ = λd - λd+1
  • let δ be sparse with probabilty ε an element is 0,
  • o.w. it is Gaussian
  • let p(1-ε) → θ
  • then with probability at least εd, CPCA = 0 < CLOL
  • and this probability can be made arbitrarily close to 1
22 / 29

"Thm E: LOL > PCA as n & p increases"

  • when n/p → 0, all results trivially hold


  • Suppose Σ is low rank + σ2I
  • Suppose that: M log p ≤ log n ≤ M' log λ,
  • provided M & M' are large enough
  • estimates of C converge, so
  • E [ CLOL ] > E[ CPCA]
23 / 29

LOL > PCA Simulations

24 / 29

LOL is fast

  • utilize FlashX semi-external memory (SEM) computing
  • optimal (linear) scale up and out
  • SEM speed ≈ internal memory speed
  • swap random projections (RP) with SVD for 10x speed improvement
  • RP error rate ≈ SVD error rate
  • ~3 minutes for a 2 terabyte dataset
25 / 29

LOL > PCA Data

26 / 29

LOL Hypothesis Testing & Regression

27 / 29

Discussion

  • simple supervised inference for wide data
  • big data tools (eg, Spark, H20, VW) typically focus on large n
  • algorithmic and theoretical generalizations straightforward
  • open source implementations


http://neurodataio/tools/LOL/
28 / 29


Questions?

Hiring Postdocs & Software Engineers Now!

e: jovo@jhu.edu | w: neurodata.io, gigantum.io

29 / 29

Interrupt!

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow