MGC

name:opening

#### Discovering and deciphering relationships across disparate data modalities

Joshua T. Vogelstein |       <https://neurodata.io/mgc>

.footnote[[jovo@jhu.edu](mailto:jovo@jhu.edu) | <http://neurodata.io/talks/> | [@neuro_data](https://twitter.com/neuro_data)]

<br>

---

#### Motivation

<br>
      --

- "Understand" the relationship between physical  (brain) and mental properties
      --

- Question 1: are the two related at all?
      --

- Question 2: how are they related?

- Note: many efforts dedicated to Q1, fewer to Q2.
      - We will formally address Q1, and address Q2

---
      
      class: left

#### One often desires to test for independence

--
      | anything | anything else |

---

#### Statistics Background

<br>

- $X$ is a random variable (some measurement).

--
      <br>

- $F_X$ is the distribution of $X$.

--
      <br>

- This means $F_X(a) = P(X \leq a)$.

- This is denoted:

---

#### Statistics Background

<br>

- For two random variables $X$ and $Y$, $F_{XY}$ is called the joint distribution.

--
      <br>

- This means that $F_{XY} (a, b) = P(X \leq a\ \mathrm{and}\ Y \leq b)$,

or,

---

#### Independence

<br>

$X$ and $Y$ are independent if neither contains information about the other.

In other words,

$$F_{XY} = P(X \leq a\ \mathrm{and}\ Y \leq b) = P(X \leq a) \times P(Y \leq b) = F_X F_Y$$
      <font size="10">$$ F_{XY} = F_X F_Y$$</font>

Note: These ideas and notation generalize for multivariate $X$ and $Y$.

---

#### Informal Definition of Hypothesis Testing

<br>

- **Null Hypothesis**: The conventional belief about a phenomenon of interest, written  $H_0$.

- **Alternative Hypothesis**: An alternate belief about the same phenomenon, written $H_A$.

- **p-value**: The probability (under the null) of measurements more extreme than what was observed.

---

#### Formal Definition of Independence Testing

$$(X\_i,Y\_i) \sim F\_{XY} = F\_{X|Y} F\_Y, \quad i \in \{1,\ldots,n\}$$

$$H\_0: F\_{XY} = F\_X F\_Y $$
      $$H\_A: F\_{XY} \neq F\_X F\_Y $$

---
      class: top, left

## Outline

<br>

- intuition
      - simulations
      - real data
      - extensions
      - theory
      - discussion

---
      class: middle

# .center[intuition]

---
      class: top, left

#### Intuitive Desiderata of Testing Procedure

1. Performant under *any* joint distribution
        - low- and high-dimensional
        - Euclidean and structured data (eg, sequences, images, networks, shapes)
        - linear and nonlinear relationships
      6. Reveals the "geometry"  of dependence
      5. Is computational efficiency

Provides a tractable algorithm that addresses the two motivating questions:

- Question 1: are the two related at all?
      - Question 2: how are they related?

---
      class:  center

### correlation coefficient

$$r\_{XY}^2 =  \frac{(\sum\_{i=1}^n ( x\_i - \bar{x}) (y\_i - \bar{y}))^2}{\sum\_{i=1}^n (x\_i - \bar{x})^2 \sum\_{i=1}^n (y\_i- \bar{y})^2} $$

---
      class: center

### **mantel** correlation coefficient

$$r\_{XY}^2 =  \frac{(\sum\_{i=1}^n ( x\_i - \bar{x}) (y\_i - \bar{y}))^2}{\sum\_{i=1}^n (x\_i - \bar{x})^2 \sum\_{i=1}^n (y\_i- \bar{y})^2} $$

<br>

$$d\_{XY}^2 =  \frac{(\sum\_{i,\color{red}{j}=1}^n ( x\_i - \color{red}{x\_j}) (y\_i - \color{red}{y\_j}))^2}{\sum\_{i,\color{red}{j}=1}^n (x\_i - \color{red}{x\_j})^2 \sum\_{i,\color{red}{j}=1}^n (y\_i- \color{red}{y\_j})^2} $$

---
      class: center

### **generalized** correlation coefficient

$$r\_{XY}^2 =  \frac{(\sum\_{i=1}^n ( x\_i - \bar{x}) (y\_i - \bar{y}))^2}{\sum\_{i=1}^n (x\_i - \bar{x})^2 \sum\_{i=1}^n (y\_i- \bar{y})^2} $$

<br>

$$c\_{XY}^2 =  \frac{(\sum\_{i,j=1}^n \color{red}{\sigma\_x}(x\_i,x\_j) \color{red}{\sigma\_y}(y\_i,y\_j))^2}{\sum\_{i,j=1}^n \color{red}{\sigma\_x}(x\_i,x\_j)^2 \sum\_{i,j=1}^n \color{red}{\sigma\_y}(y\_i,y\_j)^2} $$

---

### local distance correlation

<!-- --

<div style="position:absolute; TOP: 150px;  LEFT: 560px; height: 200px;">
      dcorr(X,Y)=0.15, p-val < 0.001 <br> MGC(X,Y)=0.15,  p-val < 0.001
      </div>

<!-- --

<div style="position:absolute; TOP: 400px;  LEFT: 560px; height: 200px;">
      dcorr(X,Y)=0.01, p-val 0.3 <br> MGC(X,Y)= .r[0.13], p-val < .r[0.001]
      </div>

---

### multiscale distance correlation

- compute local dcorr **at all scales**
      - find scale with **max** smoothed test statistic
      - permutation test to determine p-value

---
      class: top, center

### Multiscale Generalized Correlation (MGC)

<img src="images/mgc-linear.png"  style="width: 780px;"/>
      --

<img src="images/mgc-spiral.png"  style="width: 780px;"/>
      --

---
      class: top, left

#### Intuitive Desiderata of Testing Procedure

<br>

Provides a tractable algorithm that addresses the two motivating questions:

- Question 1: are the two related at all?
      - Question 2: how are they related?

---
      class: middle, center

# .center[simulations]

---
      class: top, left

### 20 Different Functions (1D version)

---
      class: top, left

### Definitions

- **power** is the probability of rejecting the null when the alternative is true
      -  $\beta_n(t)$ = power of test statistic $t$ given $n$ samples

- **relative power** power of one approach minus power of another

- $\beta_n(mgc)- \beta_n(t)$

---
      class: top, left

###  1D Relative Power (MGC nearly dominates)

---
      class: top, left

###  HD Relative Power (MGC nearly dominates)

---
      class: top, left

###  MGC Requires 1/2 Samples to Achieve Power

<br>

---
      class: top, left

#### MGC Reveals Geometry of Dependence

---
      class: middle, center

# .center[real data]

---
      class: top, left

### Real Data Desiderata

<br>

1. when we believe dependence, MGC obtains a small p-value
      2. MGC provides insight into the geometry of real dependence
      3. when there is no dependence, MGC correcly controls FPR

---
      class: top, left

### MGC Discovers Relationships between Brain & Mental Properties

---
      class: top, left

### MGC Discovers Pancreatic Cancer Biomarkers

---

### MGC Does Not Make Too Many False Discoveries

---
      class: middle

# .center[extensions]

---

#### Categorical Features

- All testing depends on measuring similarity/distance between observations
      - Let $Y=$ {normal, condition 1, condition 2}
      - $d(y_i,y_j) = 0$ if $y_i = y_j$
      - $d(y_i,y_j) = 1$ if $y_i \neq y_j$
      - Then everything just works :)

#### Mixed Data

- Define similarity/distance appropriately for each feature, then combine.
      - In theory, the normalization procedure will work appropriately, though not yet tested.

#### Missing Data & Outliers

- *Missing* Does not yet work, though see RerF talk later today for ideas
      - *Outliers* Natively handled, probably best to include them, otherwise, the statistics get (more) complicated. But haven't tested carefully.

---

### Computational Considerations

<br>

- n is # of samples
      - T is # of threads

- Python, MATLAB and R code at [http://neurodata.io/mgc](http://neurodata.io/mgc)
      - Note: none of the implementations are parallelized

---

### Fast MGC

---

### K-Sample Tests are Independence Tests

Recall this definition of independence testing

$$(X\_i,Y\_i) \sim F\_{XY} = F\_{X|Y} F\_Y, \quad i \in \{1,\ldots,n\}$$

$$H\_0: F\_{XY} = F\_X F\_Y $$
      $$H\_A: F\_{XY} \neq F\_X F\_Y $$

Here is the formal definition of 2-sample testing

$$U\_i \sim F_U, \, i \in \{1,\ldots,n\} \quad V\_i \sim F\_V, \, i \in \{1,\ldots,m\}$$

$$H\_0: F\_{U} =  F\_V $$
      $$H\_A: F\_{U} \neq F\_V $$

Let $X = [U\_1, \ldots U\_n | V\_1,\ldots, V\_m]$ and $Y=[0 \ldots 0 | 1 \cdots 1 ]$.

Now, testing whether $X$ and $Y$ are independent is equivalent to testing whether $F_U$ and $F_V$ are the same.

---
      class: top, left

### 20 Different 2-Sample Functions (2D version)

---
      class: top, left

###  Relative Power vs Sample Size for 2D Rotations

---
      class: top, left

###  Relative Power vs Angle for 2D Rotations

---
      class: top, left

###  Relative Power vs Dimension for Rotations

---

### Graph Equality Testing

For example, are the left and right hemispheres of the larval Drosophila Mushroom Body sampled from the same distribution?

$G=(V_1,E_1) \sim F_X, \quad H = (V_2, E_2) \sim F_Y$

$H\_0: F\_{X} =  F\_y , \quad H\_A: F\_{X} \neq F\_Y $

---

### Graph Equality Testing

For example, are the left and right hemispheres of the larval Drosophila Mushroom Body sampled from the same distribution?

$G=(V_1,E_1) \sim F_X, \quad H = (V_2, E_2) \sim F_Y$

$H\_0: F\_{X} =  F\_Y , \quad H\_A: F\_{X} \neq F\_Y $

---

### Graph Independence Testing

For example, in the Hermaphrodite *C. elegans*, is the chemical synapse connectome independent of the gap junction connectome?

$G=(V,E_1, Z_v) \sim F_U, \quad H = (V, E_2, Z_v) \sim F_V$

$H\_0: F\_{XY|Z} =  F\_{X|Z} F\_{Y|Z} , \quad H\_A: F\_{XY|Z} \neq F\_{X|Z} F\_{Y|Z} $

---

### Graph Independence Testing

For example, in the Hermaphrodite *C. elegans*, is the chemical synapse connectome independent of the gap junction connectome?

$G=(V,E_1, Z_v) \sim F_U, \quad H = (V, E_2, Z_v) \sim F_V$

$H\_0: F\_{XY|Z} =  F\_{X|Z} F\_{Y|Z} , \quad H\_A: F\_{XY|Z} \neq F\_{X|Z} F\_{Y|Z} $

---

### Graph Topology vs Vertex Attribute Testing

For example, in the *C elegans*, each node/neuron has an attribute (e.g., location), and we desire to understand whether connectivity and cell type are independent.

$G=(V,E, Z\_v) \sim F\_{UZ}$

$H\_0: F\_{EZ} =  F\_{E} F\_{Z} , \quad H\_A: F\_{EZ} \neq F\_{E} F\_{Z} $

---

### Time Series Testing

For example, is a given ROI independent of another ROI?

Consider a strictly stationary time series $\{(X\_t,Y\_t)\}_{t=0}^{n}$.
      Choose some $M$, the "maximum lag" hyperparameter. We wish to test the following hypothesis.

$$ H\_0: F\_{X\_t,Y\_{t-j}} = F\_{X\_t} F\_{Y\_{t-j}} \text{ for each } j \in \{0, 1, ..., M\}, \ \forall t$$
      $$ H\_A: F\_{X\_t,Y\_{t-j}} \neq F\_{X\_t} F\_{Y\_{t-j}} \text{ for some } j \in \{0, 1, ..., M\}, \ \forall t$$

---
      ### Application: Functional Connectomics

---
      class: middle, center

# .center[theory]

---
      class: top, left

### Definitions

- $t$ is sample MGC

- $T$ is population MGC

- $t^\prime$ is sample Dcorr

- $t^*$ is sample oracle MGC

-  $\beta_n(t)$ = power of test statistic $t$ given $n$ samples

---

### Theoretical Desiderata

|  name | principle |
      |  ---   | --- |
      |  boundedness | $0 \leq t,T \leq 1 $|
      | symmetric   | $T(X,Y) = T(Y,X)$ |
      |  1-linear. | $  T= 1 \Leftrightarrow y = A x + b$ |
      |  0-indep. | $T=0 \Leftrightarrow H(X \lvert Y) = H(X)$ |
      |  ortho. invar. | $T(X,Y) = T(a_1 + b_1  C_1  X, a_2 + b_2  C_2  Y)$ |

--
      |  univ. consist. | $\beta\_n(t) \to 1, \quad \forall \, F\_{XY}$ |
      |  dominance | $\beta\_n(t^\prime) \geq \beta\_n(t), \, \forall n,\, t^\prime \in \mathcal{T}, \forall F\_{XY}$ |
      |  convergence | $t \to T$ as $n \to \infty$ |

---
      class: top, left

#### MGC is a Reasonable Dependence Statistic

<br>

Thm: Oracle MGC has the following properties:

- 0 ≤  MGC ≤ 1
      - MGC = 0 only under independence
      - MGC is symmetric
      - MGC = 1 only under linear relationship
      - MGC is invariant to rotation, translation, and scale of X and/or Y

---
      class: left

### MGC has power 1 for all F<sub>XY</sub>

Lemma: \$ \beta_n (t) \to 1  \$ as \$ n \to \infty \$ whenever \$ E [X], E [Y] < \infty. \$

---
      class: left

### Linear: Local = Global

Lemma: If $X$ is linearly dependent on $Y$, then it always holds that
      $$ \beta_n(t^*) = \beta_n(t^\prime) $$

for any $n$.

---
      class: left

### (Certain) Nonlinear: Local > Global

Lemma: There exists certain nonlinear relationships between $X$ and $Y$, and $k$ and $l$ such that,

$$\beta_n( t^{k,l} ) > \beta_n(t^\prime). $$

---
      class: left

### Oracle MGC > Dcorr

Thm: Oracle MGC statistically dominates dcorr, that is,

$$\beta_n( t^* ) \geq \beta_n(t^\prime). $$

---
      class: left

### Sample MGC Converges

<br>

Thm: For any  $F_{XY}$ with finite 1st moment,

$\beta_n(t) \to 1$ as $n \to \infty$

if and only if  distance functions $\sigma_X$ and $\sigma_Y$  are of strong negative type.

---
      class: top, left

### Theoretical Desiderata

<br>

|  name | principle |
      |  ---   | --- |
      |  boundedness | ✅|
      |  symmetric | ✅|
      |  linear. | ✅ |
      |  0-indep. | ✅ |
      |  ortho. invar. | L$_p$ |
      |  univ. consist. | ✅  |
      |  dominance | ✅ |
      |  convergence | ✅ |

---

### Overall Summary

- Oracle MGC theoretically dominates, even in finite samples
      - MGC empirically nearly dominates on extensive simulations
      - Visual quantitative characterization of arbitrary relationships
      - MGC reveals geometry of dependence in real data
      - MGC mitigate "post selection inference" problems

---

### References

<br>

- MGC for scientists [[1]](https://elifesciences.org/articles/41690)
      - Foundational theory for MGC [[2]](https://www.tandfonline.com/doi/full/10.1080/01621459.2018.1543125)
      - MGC for independence between graph topology & attributes [[3]](https://arxiv.org/abs/1703.10136)
      - MGC for signal subgraph detection [[4]](https://arxiv.org/abs/1801.07683)
      - MGC for clustering (ish) [[5]](https://arxiv.org/abs/1710.09859)
      - Energy and kernel tests are equivalent [[6]](https://arxiv.org/abs/1806.05514)
      - MGC + RF [[7]](https://arxiv.org/abs/1812.00029)
      - mgcpy [[8]](https://arxiv.org/abs/1907.02088)
      - graph independence testing [[9]](https://arxiv.org/abs/1906.03661)
      - MGCX [10]
      <br>

### Forthcoming (drafts available upon request)

- fast mgc [11]
      - graph two sample testing [12]
      - multivariate nonpar k-sample tests [13]

---

### Next Steps

- Get you going with it ([download here](https://neurodata.io/mgc))
      - Apply it to your data

---
      ### Acknowledgements

[jovo@jhu.edu](mailto:jovo@jhu.edu) | [neurodata.io](https://neurodata.io) | [@neuro_data](https://twitter.com/neuro_data)

<div class="small-container">
        <img src="faces/cep.png"/>
        <div class="centered">Carey Priebe</div>
      </div>

<!-- <div class="small-container">
        <img src="faces/randal.jpg"/>
        <div class="centered">Randal Burns</div>
      </div>

<div class="small-container">
        <img src="faces/mim.jpg"/>
        <div class="centered">Michael Miller</div>
      </div> -->

<div class="small-container">
        <img src="faces/ebridge.jpg"/>
        <div class="centered">Eric Bridgeford</div>
      </div>

<div class="small-container">
        <img src="faces/drishti.jpg"/>
        <div class="centered">Drishti Mannan</div>
      </div>

<div class="small-container">
        <img src="faces/jesse.jpg"/>
        <div class="centered">Jesse Patsolic</div>
      </div>

<div class="small-container">
        <img src="faces/falk_ben.jpg"/>
        <div class="centered">Benjamin Falk</div>
      </div>

<div class="small-container">
        <img src="faces/kwame.jpg"/>
        <div class="centered">Kwame Kutten</div>
      </div>

<div class="small-container">
        <img src="faces/perlman.jpg"/>
        <div class="centered">Eric Perlman</div>
      </div>

<div class="small-container">
        <img src="faces/loftus.jpg"/>
        <div class="centered">Alex Loftus</div>
      </div>

<div class="small-container">
        <img src="faces/bcaffo.jpg"/>
        <div class="centered">Brian Caffo</div>
      </div>

<div class="small-container">
        <img src="faces/avanti.jpg"/>
        <div class="centered">Avanti Athreya</div>
      </div>

<div class="small-container">
        <img src="faces/vince.jpg"/>
        <div class="centered">Vince Lyzinski</div>
      </div>

<div class="small-container">
        <img src="faces/dpmcsuss.jpg"/>
        <div class="centered">Daniel Sussman</div>
      </div>

<div class="small-container">
        <img src="faces/youngser.jpg"/>
        <div class="centered">Youngser Park</div>
      </div>

<div class="small-container">
        <img src="faces/cshen.jpg"/>
        <div class="centered">Cencheng Shen</div>
      </div>

<div class="small-container">
        <img src="faces/shangsi.jpg"/>
        <div class="centered">Shangsi Wang</div>
      </div>

<div class="small-container">
        <img src="faces/tyler.jpg"/>
        <div class="centered">Tyler Tomita</div>
      </div>

<div class="small-container">
        <img src="faces/james.jpg"/>
        <div class="centered">James Brown</div>
      </div>

<div class="small-container">
        <img src="faces/disa.jpg"/>
        <div class="centered">Disa Mhembere</div>
      </div>

<div class="small-container">
        <img src="faces/pedigo.jpg"/>
        <div class="centered">Ben Pedigo</div>
      </div>

<div class="small-container">
        <img src="faces/jaewon.jpg"/>
        <div class="centered">Jaewon Chung</div>
      </div>

<div class="small-container">
        <img src="faces/satish.jpg"/>
        <div class="centered">Satish</div>
      </div>

<div class="small-container">
        <img src="faces/jesus.jpg"/>
        <div class="centered">Jesus</div>
      </div>

<div class="small-container">
        <img src="faces/ronak.jpg"/>
        <div class="centered">Ronak</div>
      </div>

<div class="small-container">
        <img src="faces/brandon.jpg"/>
        <div class="centered">Brandon</div>
      </div>

<div class="small-container">
        <img src="faces/rguo.jpg"/>
        <div class="centered">Richard Guo</div>
      </div>

<div class="small-container">
        <img src="faces/sambit.jpg"/>
        <div class="centered">Sambit Panda</div>
      </div>

<div class="small-container">
        <img src="faces/jeremias.png"/>
        <div class="centered">Jeremias Sulam</div>

</div> <span style="font-size:150%; color:red;">♥, 🦁, 👪, 🌎, 🌌</span>

<br>

---