name:opening ## Big Biomedical Data Science Joshua T. Vogelstein
.foot[[jovo@jhu.edu](mailto:jovo@jhu.edu) |
| [@neuro_data](https://twitter.com/neuro_data)] --- class: middle, inverse .center[please interrupt] --- ### What is Big Biomedical Data Science? A field that develops and applies algorithms, statistical models, and (database / machine learning) systems to manage, visualize, wrangle, summarize, generalize, and control big data to improve health and healthcare. --
-- data = genomics, imaging, electric health records, etc. --- class: middle, inverse ## .center[Why is it hard?] --- ### Reason #1: Evolution - 1 billion years of evolution for human perceptual capacities - 1 billion receptors at 1 kHz each is .r[1 terabit per second] - Minimize false *negatives*: is that a tiger?
--- ### Reason #2: We Don't Know What Random Looks Like
--- ### Reason #2: We Don't Know What Random Looks Like
--- ### Reason #3: Grazing Goat Starves
--- #### Reason #4: How many circles per square?
- ratio of volume of ball to cube with diameter 1 -- - volume of cube → 1 - volume of ball $\approx \frac{\pi^{n}}{n!}$ → 0 - ball contains no volume in high-dimensions --- #### Reason #4: How many circles per square?
- every additional dimension squares number of corners - cubes are pointy in high-dimesions --- ### Reason #5: Estimate mean - sample $x \sim \mathcal{N}_1(\mu,1)$, estimate $\mu$? -- - sample $[x_1, x_2] \sim \mathcal{N}_2([\mu_1, \mu_2], I)$, estimate $\mu = [\mu_1, \mu_2]$? -- - sample $[x_1, x_2, x_3] \sim \mathcal{N}_3(\mu, I)$, estimate $\mu$? The usual estimator of the mean of a multivariate Gaussian is inadmissable .footnote[Stein, 1956] --- ### Summary Don't trust intuition, it sucks at this. --- class: middle, inverse ## .center[Potential Solutions?] -- .center[(time check?)] --- ### #1: Build Better Human Brains 1. Impose selective pressures 1. wait 1,000 generations 2. have better human brains --- ### #2: Build Knowledge Systems .center[
] --- ### #2: Build Knowledge Systems .center[
] .footnote[AI Winter #2] --- ### #3: Build Learning Systems
.footnote[AI Winter #3] --- ### Approaches that will fail 1. better human brains (evolution) 2. just parametric modeling (knowledge systems) 3. just deep learning (machine learning systems) --- ### Proposed Solution: AI Spring #4 1. Work together: AI + domain experts 2. Build probabilistic generative models using maximum amount of domain knowledge 3. Use those models to guide a search for low-dimensional latent structure --- #### How Many Points to Describe a Line? .center[
] --- #### How Many Points to Describe a Plane? .center[
] --- ### Proposed Solution 1. Work together: AI + domain experts 2. Build probabilistic generative models using maximum amount of domain knowledge 3. Use those models to guide a search for low-dimensional latent structure - We need $\geq n$ points to describe a line in $n$ dimensions - Our $p > n$ - We need to **learn** a "line" works best - There are $\infty$, so need experts to help guide search --- ### Good News, Bad News - Good: you can, in theory, already do this - Bad: no idea of this will work. --- ### A Glimmer of Hope
--- ### Conclusions - big biomedical data science is hard - our intuitions are bad - success requires (swallowing pride): - convert your knowledge generative models - leverage models to guide search for "structure" in learning systems --- class:center
--- ### Acknowledgements
Carey Priebe
Randal Burns
Michael Miller
Daniel Tward
Eric Bridgeford
Vikram Chandrashekhar
Drishti Mannan
Jesse Patsolic
Benjamin Falk
Kwame Kutten
Eric Perlman
Alex Loftus
Brian Caffo
Minh Tang
Avanti Athreya
Vince Lyzinski
Daniel Sussman
Youngser Park
Cencheng Shen
Shangsi Wang
Tyler Tomita
James Brown
Disa Mhembere
Ben Pedigo
Jaewon Chung
Greg Kiar
Jeremias Sulam
♥, 🦁, 👪, 🌎, 🌌
--- ### Four V's of big data - volume - velocity - variety - veracity --- ### Drosophila Brain Networks
--- ### Geodesic Learning Drosophila Brain