]
---
# Outline
- Theoretical Results
- Numerical Analysis
- Bleeding Edge
- Discussion
---
class: middle
# .center[Theoretical Results]
---
### Our Goal
Build a knowledge represetation system with the following properties:
1. can answer various kinds of queries
2. has a rigorous mathematical and statistical framework
3. enables subject matter experts to encode domain-specific information including context
--
Requires definitions for
1. query types to support, and
2. mathematical desiderata to satisfy.
---
### Queries of Interest
Given $N$ entities, each measured with up to $M$ modalities,
as well as directly measuring $R$ relationships (eg, FB friends),
discover and decipher relationships between subsets of:
1. entities,
2. modalities.
---
#### Mathematical and Statistical Desiderata
1. .r[Sufficiency]: Operate in any measurable metric spaces
2. .r[Uncertainty]: Represent and quantify uncertainty
3. .r[Complexity]: Can discover any complex relationship
4. .r[Consistency]: Sample estimator converges to truth
4. .r[Robustness]: Robust to model misspecifications
4. .r[Efficiency]: Computationally tractable/efficient
---
### Limitations of Existing Approaches
Every existing approach has severe limitations in $\geq$ 1 of these
1. .r[Sufficiency]: HHG, dCorr, HSIC
2. .r[Uncertainty]: Isomap, Diffusion Wavelets
3. .r[Complexity]: Kendall, Spearman
4. .r[Consistency]: Maximal Correlation, MIC
4. .r[Robustness]: Pearson
4. .r[Efficiency]: TDA
---
#### Our Key Principles
1. all knowledge from the data is encoded in sets of relationships
2. the optimal set size is a function of the (data,query) pair
---
#### Our Basic Strategy
Given $N$ entities, each measured with up to $M$ modalities,
as well as directly measuring $R$ relationships,
build the following data structures:
1. For each modality
1. choose a metric
1. build the k-nearest neighbor graph, for all k=1,...,$N$
2. build a (centered) pairwise distance matrix
1. For each relationship
1. build the k-nearest neighbor graph, for all k=1,...,$N$
2. convert the relationship into a centered distance matrix
3. Estimate the jointly optimal set of neighborhood sizes simultaneously across all modalities
3. For each modality, sparsify/truncate/censor its pairwise distance matrix to only include the those neighbors
4. Compute the Hadamard tensor product
---
### Multiscale Graph Correlation
--
--
dcorr(X,Y)=0.15, p-val < 0.001
MGC(X,Y)=0.15, p-val < 0.001
--
--
--
dcorr(X,Y)=0.01, p-val 0.3
MGC(X,Y)= .r[0.13], p-val < .r[0.001]
---
### Theoretical Background
.r[Non-Parametric Dependence] is the primary quantity of interest:
- bounds predictive accuracy
- bounds causal inference
- used for feature selection / screening
- mutual information is a special case
---
### Theoretical Results
- (nearly) all hypotheses can be stated in terms of dependence
- MGC .r[connects] previously disparate data science approaches
- kernel learning
- distance-based
- nearest-neighbors
- multiscale analysis
---
### Theoretical Limitations
MGC theoretically satisfies all of the desiderata, with the following caveats, MGC requires
1. .r[Sufficiency]: metric must be of "strong negative type"
2. .r[Uncertainty]: samples are "exchangable" with finite variance
3. .r[Complexity]: $R=2$ or $M=2$
4. .r[Consistency]: $\mathcal{O}(1/n)$
4. .r[Robustness]: non-parametric
4. .r[Efficiency]: $\mathcal{O}(n^2 \log n)$
---
class: middle
# .center[Numerical Analysis]
---
class: top, left
### Setup
- sample $(x_i, y_i) \sim F$ iid for $i=1,\ldots, n$ for 20 different functions
- in each case, there is *some* true but unknown relationship
- *power* is the probability detecting a relationship
- $N_\beta$ is the # of samples required to achieve power $\beta$
---
class: top, left
### 20 Different Functions (1D version)
---
class: top, left
### MGC Outperforms Benchmarks
---
### MGC Outperforms Benchmarks
Relative sample size to achieve 85% power
Other approaches with the sample theoretical properties and limitations (HHG, dCorr, HSIC)
tend to require **double** or **triple** the samples to achieve the same power
---
### MGC Biomedical Applications
- detecting dependence (human brain imaging)
- graph topology vs node attributes (worm network)
- signal subgraph detection (human brain network and personality)
- effect size estimation ("batches" of imaging)
- feature selection/screening (cancer genetics)
- machine learning / classification (proteomics)
- DARPA's SIGMA data
---
class: middle
# .center[Bleeding Edge]
---
### DARPA's Sigma Data
- Details of the analysis will be provided after lunch
- Existing MGC theory and methods were inadequate to perform the tasks
- Computationally, MGC required $\mathcal{O}(r \times n^2 \log n)$ run time,
where $r$ is the number of permutations required in the permutaiton test
- SIGMA data had $n=25,000$, which was too large for existing MGC to handle
---
### Accelerated MGC
- Goal: accelerate MGC without weakening theory
- Reduce to linear time without permutation: $\mathcal{O}(r \times n^2 \log n) \rightarrow \mathcal{O}(n \log n)$
- Algorithm:
- For $i = 1, \ldots, \sqrt{N}$
- randomly subsample $\sqrt{n}$ points
- apply MGC to obtain $t_i$
- compute $\langle t_i \rangle$
- Variance of sub-sample provides confidence intervals
- Reduces computational time in practice from hours to minutes
- But minor loss of power
---
### Extended Results
1. Proved equivalence of existing kernel and distance approaches
2. Proved random forest (RF) induces a universal kernel
3. Demonstrated empirically that RF yields improved results
---
### Next Steps
By coupling RF with MGC, we strengthen the theoretical results
1. Sufficiency: metric must be of "strong negative type"
- .r[RF can learn metric]
2. Uncertainty: samples are "exchangable" with finite variance
- .r[just requires exchangability]
3. Complexity: $R=2$ or $M=2$
- .r[RF can handle $R>2$]
4. Consistency: $\mathcal{O}(1/n)$
- .r[$\mathcal{O}(1/n^2)$?]
4. Robustness: non-parametric
- .r[yes]
4. Efficiency: $\mathcal{O}(n^2 \log n)$
- .r[$\mathcal{O}(n \log n)$]
And RF adds additional queries, including entropy, conditional entropy, mutual information, classification, and regression
---
class: middle
# .center[Discussion]
---
### References
- MGC foundational theory [[1]](https://arxiv.org/abs/1710.09768)
- MGC biomedical applications [[2]](https://arxiv.org/abs/1609.05148)
- MGC for independence between graph topology & attributes [[3]](https://arxiv.org/abs/1703.10136)
- MGC for signal subgraph detection [[4]](https://arxiv.org/abs/1801.07683)
- MGC for clustering (ish) [[5]](https://arxiv.org/abs/1710.09859)
- Equivalence of distance and kernel learning [[6]](https://arxiv.org/abs/1806.05514)
- R package: [CRAN link](https://cran.r-project.org/web/packages/mgc/index.html); ~80 monthly downloads
---
### Summary
- MGC is a KRS that can discover latent geometric structure
- We have used it to answer various query types
- Has a rigorous mathematical and statistical foundation
- SME can encode domain specific knowledge
- Coupling to RF generalizes, improves theoretical and empirical performances
- Specifically useful for DARPA SIGMA (as we will show later)
---
class: top, left
### Acknowledgements
Carey Priebe
Cencheng Shen
Shangsi Wang
Youjin Lee
Eric Bridgeford
♥, 🦁, 👪, 🌎, 🌌
---
### Questions?
---
### DARPA SIGMA Data
Some radioactive substance is moving around, and we have
- 3 days of data
- 25 sensors
- 15 features per sensor
- each feature sampled at different frequency
- some features on some days include $\approx$ 25,000 samples
- 2 features of interest of particular: gamma and neutron
--
### Queries
1. Do any sensors have information about radiation? If so, which?
2. Do any features have information about radiation? If so, which?
3. Are any features redundant?
---
### Data Processing Pipeline
1. Open hdf5 files
2. Convert to 2D arrays
3. Run MGC on each
4. Visualize results
---
### What does raw data look like?
---
### Do any sensors/features have information about gamma?
---
### Do any sensors/features have information about neutron?
---
### Are any features correlated with one another?
---
### SIGMA Conclusions
1. Neutron measurements seem to be redundant
2. Several of the features are uninformative with regard to radiation (eg, location, voltage, etc.)
3. All features that are informative about Gamma are also informative about one another