+ - 0:00:00
Notes for current slide
Notes for next slide

Big Biomedical Data Science

Joshua T. Vogelstein











jovo@jhu.edu | http://neurodata.io/talks/ | @neuro_data

1 / 36

please interrupt

2 / 36

What is Big Biomedical Data Science?

A field that develops and applies algorithms, statistical models, and (database / machine learning) systems to manage, visualize, wrangle, summarize, generalize, and control big data to improve health and healthcare.

3 / 36

What is Big Biomedical Data Science?

A field that develops and applies algorithms, statistical models, and (database / machine learning) systems to manage, visualize, wrangle, summarize, generalize, and control big data to improve health and healthcare.


4 / 36

What is Big Biomedical Data Science?

A field that develops and applies algorithms, statistical models, and (database / machine learning) systems to manage, visualize, wrangle, summarize, generalize, and control big data to improve health and healthcare.


data = genomics, imaging, electric health records, etc.

5 / 36

Why is it hard?

6 / 36

Reason #1: Evolution

  • 1 billion years of evolution for human perceptual capacities
  • 1 billion receptors at 1 kHz each is 1 terabit per second
  • Minimize false negatives: is that a tiger?

7 / 36

Reason #2: We Don't Know What Random Looks Like

8 / 36

Reason #2: We Don't Know What Random Looks Like

9 / 36

Reason #3: Grazing Goat Starves

10 / 36

Reason #4: How many circles per square?

  • ratio of volume of ball to cube with diameter 1
11 / 36

Reason #4: How many circles per square?

  • ratio of volume of ball to cube with diameter 1
  • volume of cube → 1
  • volume of ball πnn!\approx \frac{\pi^{n}}{n!} → 0
  • ball contains no volume in high-dimensions
12 / 36

Reason #4: How many circles per square?

  • every additional dimension squares number of corners
  • cubes are pointy in high-dimesions
13 / 36

Reason #5: Estimate mean

  • sample xN1(μ,1)x \sim \mathcal{N}_1(\mu,1), estimate μ\mu?
14 / 36

Reason #5: Estimate mean

  • sample xN1(μ,1)x \sim \mathcal{N}_1(\mu,1), estimate μ\mu?
  • sample [x1,x2]N2([μ1,μ2],I)[x_1, x_2] \sim \mathcal{N}_2([\mu_1, \mu_2], I), estimate μ=[μ1,μ2]\mu = [\mu_1, \mu_2]?
15 / 36

Reason #5: Estimate mean

  • sample xN1(μ,1)x \sim \mathcal{N}_1(\mu,1), estimate μ\mu?
  • sample [x1,x2]N2([μ1,μ2],I)[x_1, x_2] \sim \mathcal{N}_2([\mu_1, \mu_2], I), estimate μ=[μ1,μ2]\mu = [\mu_1, \mu_2]?
  • sample [x1,x2,x3]N3(μ,I)[x_1, x_2, x_3] \sim \mathcal{N}_3(\mu, I), estimate μ\mu?

The usual estimator of the mean of a multivariate Gaussian is inadmissable

Stein, 1956

16 / 36

Summary

Don't trust intuition, it sucks at this.

17 / 36

Potential Solutions?

18 / 36

Potential Solutions?

(time check?)

19 / 36

#1: Build Better Human Brains

  1. Impose selective pressures
  2. wait 1,000 generations
  3. have better human brains
20 / 36

#2: Build Knowledge Systems

21 / 36

#2: Build Knowledge Systems

AI Winter #2

22 / 36

#3: Build Learning Systems

AI Winter #3

23 / 36

Approaches that will fail

  1. better human brains (evolution)
  2. just parametric modeling (knowledge systems)
  3. just deep learning (machine learning systems)
24 / 36

Proposed Solution: AI Spring #4

  1. Work together: AI + domain experts
  2. Build probabilistic generative models using maximum amount of domain knowledge
  3. Use those models to guide a search for low-dimensional latent structure
25 / 36

How Many Points to Describe a Line?

26 / 36

How Many Points to Describe a Plane?

27 / 36

Proposed Solution

  1. Work together: AI + domain experts
  2. Build probabilistic generative models using maximum amount of domain knowledge
  3. Use those models to guide a search for low-dimensional latent structure
    • We need n\geq n points to describe a line in nn dimensions
    • Our p>np > n
    • We need to learn a "line" works best
    • There are \infty, so need experts to help guide search
28 / 36

Good News, Bad News

  • Good: you can, in theory, already do this
  • Bad: no idea of this will work.
29 / 36

A Glimmer of Hope

30 / 36

Conclusions

  • big biomedical data science is hard
  • our intuitions are bad
  • success requires (swallowing pride):
    • convert your knowledge generative models
    • leverage models to guide search for "structure" in learning systems
31 / 36

32 / 36

Acknowledgements

Carey Priebe
Randal Burns
Michael Miller
Daniel Tward
Eric Bridgeford
Vikram Chandrashekhar
Drishti Mannan
Jesse Patsolic
Benjamin Falk
Kwame Kutten
Eric Perlman
Alex Loftus
Brian Caffo
Minh Tang
Avanti Athreya
Vince Lyzinski
Daniel Sussman
Youngser Park
Cencheng Shen
Shangsi Wang
Tyler Tomita
James Brown
Disa Mhembere
Ben Pedigo
Jaewon Chung
Greg Kiar
Jeremias Sulam
♥, 🦁, 👪, 🌎, 🌌

33 / 36

Four V's of big data

  • volume
  • velocity
  • variety
  • veracity
34 / 36

Drosophila Brain Networks

35 / 36

Geodesic Learning Drosophila Brain

36 / 36

please interrupt

2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow