name:opening #### Sparse Projection Oblique Randomer Forest Joshua T. Vogelstein |
.footnote[[jovo@jhu.edu](mailto:jovo@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data)]
--- #### Motivation
-- - Discover predictable patterns in high-dimensional data - predict disease status vs wild-type - predict age - find "similar" individuals --- #### Formal Definition of Supervised Machine Learning
- Let $x_i \in \mathcal{X} \subseteq \mathbb{R}^p$ - Let $y_i \in \mathcal{Y} \subseteq \mathbb{R}^q$ - Given $\mathcal{D}_n = (x_i, y_i)$ pairs, for $i \in 1,\ldots, n$ - Assume each pair is sampled independently and identically from some joint distribution, $F_{XY}$ - Use the data $\mathcal{D}\_n$ to obtain an estimate of a function of $F\_{Y|X}$ --- ### Examples - Example 1 (classification): $y \in \{0,1\}$, we desire to estimate $E[Y|X]$ -- - Example 2 (regression): $y \in \mathbb{R}$, we desire to estimate $E[Y|X]$ -- - Example 3 (multivariate regression): $y \in \mathbb{R}^p$, we desire to estimate $E[Y|X]$ -- - Example 4 (canonical correlation analysis): we desire to find low-dimensional representations of $X$ and $Y$ that preserve certain properties -- - Example 5 (quantile regression): $y \in \mathbb{R}$, we desire to estimate $P[Y < y | X]$ -- - Example 6 (mutual information): we desire to estimate the amount of information $X$ and $Y$ have about one another, $MI(X,Y)= \sum_x P[x] \sum_y P[y|x] \log P[y|x]$ -- - Example 7 ($k$-Sample testing): we wonder is $U$ different from $V$, $H\_0: F\_U = F\_V$, $H\_A: F\_U \neq F\_V$ -- .r[!!!!] --- class: top, left ## Outline
- intuition - simulations - real data - extensions - theory - discussion --- class: middle # .center[intuition] --- class: top, left #### Intuitive Desiderata of Supervised Learning Procedures 1. Performant under *any* joint distribution - low- and high-dimensional - Euclidean and structured data (eg, sequences, images, networks, shapes) - linear and nonlinear relationships 6. Is interpretable 5. Is computational efficiency --- ### linear 2-way classification in 1 dimension
- build a classifier on $\mathcal{D}_n$ - try all possible splits, compute the "score" for each - split on the best choice (highest score) - predict the class of a new $x$ - $g(x) = 1$ if $ x>$ threshold - $g(x) = 0$ if $x < $ threshold
-- - limitations? --- ### decision tree in 1 dimension
- build a tree on $\mathcal{D}_n$ - try all possible splits, compute the "score" for each - split on the best choice (highest score) to create two daughter nodes - repeat on the daughter nodes - predict the class of a new $x$ - push down the tree - select the plurality class for the node $x$ lands in -- - limitations? --- ### random forest in 1 dimension
- build a forest on $\mathcal{D}_n$ - subsample the data to select $m < n$ points - build a tree on each - predict the class of a new $x$ - push down each tree - select the plurality vote of the trees -- - limitations? --- ### random forest in 1D - what score function should i use? purity - how deep should each tree be? as deep as possible - how many trees? about 1000 seems fine - how does it scale? linearly in $n$, number of trees, dimension of data - any problems in 1D? not really --- ### linear 2-way classification in 2 dimension
- build a classifier on $\mathcal{D}_n$ - try all possible .r[angles (?)], compute the "score" for each - split on the best choice (highest score) - predict the class of a new $x$ - $g(x) = 1$ if $ x>$ the line - $g(x) = 0$ if $x < $ the line
-- - limitations? --- ### decision tree in 2 dimension .pull-left30[
] .pull-right70[ - build a tree on $\mathcal{D}_n$ - .r[for each dimension] try all possible splits, compute the "score" for each - split on the best choice (highest score) to create two daughter nodes - repeat on the daughter nodes - predict the class of a new $x$ - push down the tree - select the plurality class for the node $x$ lands in ] -- .pull-right70[ - limitations? ] --- ### random forest in 2 dimension .pull-left30[
] .pull-right70[ - build a forest on $\mathcal{D}_n$ - subsample the data to select $m < n$ points - build a tree on each - predict the class of a new $x$ - push down each tree - select the plurality vote of the trees ] -- .pull-right70[ - limitations? ] --- ### random forest in 2D - what score function should i use? purity - how deep should each tree be? as deep as possible - how many trees? about 1000 seems fine - how does it scale? linearly in $n$, number of trees, dimension of data - any problems in 1D? only considers .r[axis-aligned splits] ---
--- --- ### Forest-RC in 2 dimension .pull-left30[
] .pull-right70[ - build a forest on $\mathcal{D}_n$ - subsample the data to select $m < n$ points - select $L$, number of linear combinations per split - build a .r[F-RC] tree on each - predict the class of a new $x$ - push down each tree - select the plurality vote of the trees - basic idea 1. rather than using "axis-aligned", use .r[oblique] splits 2. use .r[sparse] oblique splits ] -- .pull-right70[ - limitations? ] --- ### (SPO) Random.r[er] Forest in 2 dimension .pull-left30[
] .pull-right70[ - build a forest on $\mathcal{D}_n$ - subsample the data to select $m < n$ points - select $\lambda$, expected number of linear combinations per split - build a random.r[er] tree on each - predict the class of a new $x$ - push down each tree - select the plurality vote of the trees - basic idea 1. robust to parameter tuning 2. scalable open source implementation ] -- .pull-right70[ - limitations? ] --- class: top, left #### Intuitive Desiderata of Supervised Machine Learning
1. Performant under *any* joint distribution - low- and high-dimensional - Euclidean and structured data (eg, sequences, images, networks, shapes) - linear and nonlinear relationships 6. Is interpretable 5. Is computational efficiency --- class: middle, center # .center[simulations] --- class: top, left ### 2 Different Functions
- *sparse parity* 3 signal dimensions, 17 noise dimensions - *orthant* 6 signal dimensions, no noise dimensions --- ### Robustness to Hyperparameter
--- ### Feature Importance
--- class: middle, center # .center[real data] --- ### Benchmark Suite - 100+ benchmark problems from previous literature - sample sizes range from 10's to 1000's - dimensions range from 10's to 1000's - previous result: RF better than all existing classifiers --- ### Empirical Performance (Numeric)
--- ### Empirical Performance (Mixed)
--- ### Summary so far - random forest was the best classifier - now, sporf is --- class: middle # .center[extensions] --- ## Random Projections - RF does feature selection - Feature selection is actually a sparse matrix multiplication - We can do other kinds of sparse matrix multiplies --- ### Classifying via Manifold Forests (MF)
--- ## Computational Considerations - Train time - Scalability - Test time --- ### Training Time
--- ### Scalability
--- ### Testing Time
--- ## Special Considerations - *Mixed data* some problems, "unbiased recursive partitioning" can mitigate them - *Missing data* in theory, no problem - *Outliers* in practice, no problem - *Regression* in certain implementations, no problem - *Very large sample size* in our implementation, no problem --- ### Universal Consistency
--- ## Geodesic Learning - Assume data live near a low-dimensional manifold - But we observe data in high-dimensional space - Find the nearest neighbors (nn) on the manifold #### Metrics - *Geodesic Recall @ k* fraction of estimated k-nn that are k-nn on the manifold - Compute average GR@k over all points #### Algorithms - *Euclidean* (just compute Euclidean distances in observed dimensions) - *Isomap* - *UMAP* - *Random Forest* - *URerF* --- ### Simulated Datasets
- Each has low-dimensional points - We add noise dimensions --- ### Simulated Results for Low Dimensions
--- ### Simulated Results for High Dimensions
--- ### Real Data .pull-left[
] .pull-right[
] --- ## Mutual Information - RF learns two things: - a partition of the data - a posterior in each partition - Can do them separately - Can correct for finite sample bias --- ### Estimating Posteriors/Quantiles
--- ### Estimating Mutual Information
--- ### Real Data Mutual Information .pull-left[
] .pull-right[ - CF: 0.65 - KSG: 0.24 - Mixed KSG: 0.05] --- ### SPORF $k$-Sample Testing - have previously shown $k$-sample testing problem can be reduced independence testing problem - can use RF "kernel" in MGC - can directly estimate MI ---
--- class: middle, center # .center[theory] --- class: top, left ### Universal Consistency Lemma: SPORF is universally consistent regression function -- ### Mutual Information Lemma: SPORF with honest sampling is a consistent estimator of MI -- ### URerF is Robust to Noise Dimensions Lemma: E[urerf(geodesic)] = E[true geodesic] + constant --- ### Overall Summary - SPORF empirically outperforms other classifiers on benchmarks - SPORF scales better than other implementations - SPORF provides feature importance - SPORF can work on non-Euclidean data - SPORF can estimate nearest (geodesic) neighbors - SPORF can estimate mutual information - SPORF can implement hypothesis testing --- ### References - ROFLMAO [[1]](https://doi.org/10.1137/1.9781611974973.56) - SPORF [[2]](http://arxiv.org/abs/1506.03410) - RF Packing [[3]](https://arxiv.org/abs/1806.07300) - RF Kernel [[4]](https://arxiv.org/abs/1812.00029) ### Forthcoming (drafts available upon request) - RF for MI [5] - URerF [6] --- ### Next Steps
- Get you going with it ([download here](https://neurodata.io/rerf)) - Apply it to your data --- ### Acknowledgements [jovo@jhu.edu](mailto:jovo@jhu.edu) | [neurodata.io](https://neurodata.io) | [@neuro_data](https://twitter.com/neuro_data)
Carey Priebe
Randal Burns
Michael Miller
Eric Bridgeford
Drishti Mannan
Jesse Patsolic
Benjamin Falk
Alex Loftus
Brian Caffo
Minh Tang
Avanti Athreya
Vince Lyzinski
Daniel Sussman
Youngser Park
Cencheng Shen
Shangsi Wang
Tyler Tomita
James Brown
Disa Mhembere
Ben Pedigo
Jaewon Chung
Greg Kiar
Satish
Jesus
Ronak
bear
Brandon
Richard Guo
Sambit Panda
Jeremias Sulam
♥, 🦁, 👪, 🌎, 🌌
---