Statistics¶

Initialize a linear mixed model from a matrix table. 

For each row, test an input variable for association using a linear mixed model. 

For each row, test an input variable for association with response variables using linear regression. 

For each row, test an input variable for association with a binary response variable using logistic regression. 

For each row, test an input variable for association with a count response variable using Poisson regression. 

Run principal component analysis (PCA) on numeric columns derived from a matrix table. 

Computes the correlation matrix between row vectors. 

hail.methods.
linear_mixed_model
(y, x, z_t=None, k=None, p_path=None, overwrite=False, standardize=True, mean_impute=True)[source]¶ Initialize a linear mixed model from a matrix table.
Examples
Initialize a model using three fixed effects (including intercept) and genetic marker random effects:
>>> marker_ds = dataset.filter_rows(dataset.use_as_marker) >>> model, _ = hl.linear_mixed_model( ... y=marker_ds.pheno.height, ... x=[1, marker_ds.pheno.age, marker_ds.pheno.is_female], ... z_t=marker_ds.GT.n_alt_alleles(), ... p_path='output/p.bm')
Fit the model and examine \(h^2\):
>>> model.fit() >>> model.h_sq
Sanitycheck the normalized likelihood of \(h^2\) over the percentile grid:
>>> import matplotlib.pyplot as plt >>> plt.plot(range(101), model.h_sq_normalized_lkhd())
For this value of \(h^2\), test each variant for association:
>>> result_table = hl.linear_mixed_regression_rows(dataset.GT.n_alt_alleles(), model)
Alternatively, one can define a fullrank model using a precomputed kinship matrix \(K\) in ndarray form. When \(K\) is the realized relationship matrix defined by the genetic markers, we obtain the same model as above with \(P\) written as a block matrix but returned as an ndarray:
>>> rrm = hl.realized_relationship_matrix(marker_ds.GT).to_numpy() >>> model, p = hl.linear_mixed_model( ... y=dataset.pheno.height, ... x=[1, dataset.pheno.age, dataset.pheno.is_female], ... k=rrm, ... p_path='output/p.bm', ... overwrite=True)
Notes
See
LinearMixedModel
for details on the model and notation.Exactly one of z_t and k must be set.
If z_t is set, the model is lowrank if the number of samples \(n\) exceeds the number of random effects \(m\). At least one dimension must be less than or equal to 46300. If standardize is true, each random effect is first standardized to have mean 0 and variance \(\frac{1}{m}\), so that the diagonal values of the kinship matrix \(K = ZZ^T\) are 1.0 in expectation. This kinship matrix corresponds to the
realized_relationship_matrix()
in genetics. SeeLinearMixedModel.from_random_effects()
andBlockMatrix.svd()
for more details.If k is set, the model is fullrank. For correct results, the indices of k must be aligned with columns of the source of y. Set p_path if you plan to use the model in
linear_mixed_regression_rows()
. k must be positive semidefinite; symmetry is not checked as only the lower triangle is used. SeeLinearMixedModel.from_kinship()
for more details.Missing, nan, or infinite values in y or x will raise an error. If set, z_t may only have missing values if mean_impute is true, in which case missing values of are set to the row mean. We recommend setting mean_impute to false if you expect no missing values, both for performance and as a sanity check.
Warning
If the rows of the matrix table have been filtered to a small fraction, then
MatrixTable.repartition()
before this method to improve performance. Parameters
y (
Float64Expression
) – Columnindexed expression for the observations (rows of \(y\)). Must have no missing values.x (
list
ofFloat64Expression
) – Nonempty list of columnindexed expressions for the fixed effects (rows of \(X\)). Each expression must have the same source as y or no source (e.g., the intercept1.0
). Must have no missing values.z_t (
Float64Expression
, optional) – Entryindexed expression for each mixed effect. These values are rowstandardized to variance \(1 / m\) to form the entries of \(Z^T\). If mean_impute is false, must have no missing values. Exactly one of z_t and k must be set.k (
numpy.ndarray
, optional) – Kinship matrix \(K\). Exactly one of z_t and k must be set.p_path (
str
, optional) – Path at which to write the projection \(P\) as a block matrix. Required if z_t is set.overwrite (
bool
) – IfTrue
, overwrite an existing file at p_path.standardize (
bool
) – IfTrue
, standardize z_t by row to mean 0 and variance \(\frac{1}{m}\).mean_impute (
bool
) – IfTrue
, meanimpute missing values of z_t by row.
 Returns
model (
LinearMixedModel
) – Linear mixed model ready to be fit.p (
numpy.ndarray
orBlockMatrix
) – Matrix \(P\) whose rows are the eigenvectors of \(K\). The type is block matrix if the model is low rank (i.e., if z_t is set and \(n > m\)).

hail.methods.
linear_mixed_regression_rows
(entry_expr, model, pa_t_path=None, a_t_path=None, mean_impute=True, partition_size=None, pass_through=())[source]¶ For each row, test an input variable for association using a linear mixed model.
Examples
See the example in
linear_mixed_model()
and section below on efficiently testing multiple responses or sets of fixed effects.Notes
See
LinearMixedModel
for details on the model and notation.This method packages up several steps for convenience:
Read the transformation \(P\) from
model.p_path
.Write entry_expr at a_t_path as the block matrix \(A^T\) with block size that of \(P\). The parallelism is
n_rows / block_size
.Multiply and write \(A^T P^T\) at pa_t_path. The parallelism is the number of blocks in \((PA)^T\), which equals
(n_rows / block_size) * (model.r / block_size)
.Compute regression results per row with
LinearMixedModel.fit_alternatives()
. The parallelism isn_rows / partition_size
.
If pa_t_path and a_t_path are not set, temporary files are used.
entry_expr may only have missing values if mean_impute is true, in which case missing values of are set to the row mean. We recommend setting mean_impute to false if you expect no missing values, both for performance and as a sanity check.
Efficiently varying the response or set of fixed effects
Computing \(K\), \(P\), \(S\), \(A^T\), and especially the product \((PA)^T\) may require significant compute when \(n\) and/or \(m\) is large. However these quantities are all independent of the response \(y\) or fixed effects \(X\)! And with the model diagonalized, Step 4 above is fast and scalable.
So having run linear mixed regression once, we can compute \(h^2\) and regression statistics for another response or set of fixed effects on the same samples at the roughly the speed of
linear_regression_rows()
.For example, having collected another y and x as ndarrays, one can construct a new linear mixed model directly.
Supposing the model is fullrank and p is an ndarray:
>>> model = hl.stats.LinearMixedModel(p @ y, p @ x, s) >>> model.fit() >>> result_ht = model.fit_alternatives(pa_t_path)
Supposing the model is lowrank and p is a block matrix:
>>> p = BlockMatrix.read(p_path) >>> py, px = (p @ y).to_numpy(), (p @ x).to_numpy() >>> model = LinearMixedModel(py, px, s, y, x) >>> model.fit() >>> result_ht = model.fit_alternatives(pa_t_path, a_t_path)
In either case, one can easily loop through many responses or conditional analyses. To join results back to the matrix table:
>>> dataset = dataset.add_row_index() >>> dataset = dataset.annotate_rows(lmmreg=result_ht[dataset.row_idx]])
Warning
For correct results, the columnindex of entry_expr must correspond to the sample index of the model. This will be true, for example, if model was created with
linear_mixed_model()
using (a possibly rowfiltered version of) the source of entry_expr, or if y and x were collected to arrays from this source. Hail will raise an error if the number of columns does not matchmodel.n
, but will not detect, for example, permuted samples.The warning on
BlockMatrix.write_from_entry_expr()
applies to this method when the number of samples is large.Note
Use the pass_through parameter to include additional row fields from matrix table underlying
entry_expr
. For example, to include an “rsid” field, set` pass_through=[‘rsid’]`` orpass_through=[mt.rsid]
. Parameters
entry_expr (
Float64Expression
) – Entryindexed expression for input variable. If mean_impute is false, must have no missing values.model (
LinearMixedModel
) – Fit linear mixed model withpath_p
set.pa_t_path (
str
, optional) – Path at which to store the transpose of \(PA\). If not set, a temporary file is used.a_t_path (
str
, optional) – Path at which to store the transpose of \(A\). If not set, a temporary file is used.mean_impute (
bool
) – Meanimpute missing values of entry_expr by row.partition_size (
int
) – Number of rows to process per partition. Default given by block size of \(P\).pass_through (
list
ofstr
orExpression
) – Additional row fields to include in the resulting table.
 Returns

hail.methods.
linear_regression_rows
(y, x, covariates, block_size=16, pass_through=(), *, weights=None)[source]¶ For each row, test an input variable for association with response variables using linear regression.
Examples
>>> result_ht = hl.linear_regression_rows( ... y=dataset.pheno.height, ... x=dataset.GT.n_alt_alleles(), ... covariates=[1, dataset.pheno.age, dataset.pheno.is_female])
Warning
As in the example, the intercept covariate
1
must be included explicitly if desired.Warning
If y is a single value or a list,
linear_regression_rows()
considers the same set of columns (i.e., samples, points) for every response variable and row, namely those columns for which all response variables and covariates are defined.If y is a list of lists, then each inner list is treated as an independent group, subsetting columns for missingness separately.
Notes
With the default root and y a single expression, the following rowindexed fields are added.
<row key fields> (Any) – Row key fields.
<pass_through fields> (Any) – Row fields in pass_through.
n (
tint32
) – Number of columns used.sum_x (
tfloat64
) – Sum of input values x.y_transpose_x (
tfloat64
) – Dot product of response vector y with the input vector x.beta (
tfloat64
) – Fit effect coefficient of x, \(\hat\beta_1\) below.standard_error (
tfloat64
) – Estimated standard error, \(\widehat{\mathrm{se}}_1\).t_stat (
tfloat64
) – \(t\)statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}_1\).p_value (
tfloat64
) – \(p\)value.
If y is a list of expressions, then the last five fields instead have type
tarray
oftfloat64
, with corresponding indexing of the list and each array.If y is a list of lists of expressions, then n and sum_x are of type
array<float64>
, and the last five fields are of typearray<array<float64>>
. Index into these arrays witha[index_in_outer_list, index_in_inner_list]
. For example, ify=[[a], [b, c]]
then the pvalue forb
isp_value[1][0]
.In the statistical genetics example above, the input variable x encodes genotype as the number of alternate alleles (0, 1, or 2). For each variant (row), genotype is tested for association with height controlling for age and sex, by fitting the linear regression model:
\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{genotype} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]Boolean covariates like \(\mathrm{is\_female}\) are encoded as 1 for
True
and 0 forFalse
. The null model sets \(\beta_1 = 0\).The standard leastsquares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the tstatistic which follows the tdistribution with \(n  k  1\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to
x
.Note
Use the pass_through parameter to include additional row fields from matrix table underlying
x
. For example, to include an “rsid” field, setpass_through=['rsid']
orpass_through=[mt.rsid]
. Parameters
y (
Float64Expression
orlist
ofFloat64Expression
) – One or more columnindexed response expressions.x (
Float64Expression
) – Entryindexed expression for input variable.covariates (
list
ofFloat64Expression
) – List of columnindexed covariate expressions.block_size (
int
) – Number of row regressions to perform simultaneously per core. Larger blocks require more memory but may improve performance.pass_through (
list
ofstr
orExpression
) – Additional row fields to include in the resulting table.weights (
Float64Expression
orlist
ofFloat64Expression
) – Optional columnindexed weighting for doing weighted least squares regression. Specify a single weight if a single y or list of ys is specified. If a list of lists of ys is specified, specify one weight per inner list.
 Returns

hail.methods.
logistic_regression_rows
(test, y, x, covariates, pass_through=())[source]¶ For each row, test an input variable for association with a binary response variable using logistic regression.
Examples
Run the logistic regression Wald test per variant using a Boolean phenotype, intercept and two covariates stored in columnindexed fields:
>>> result_ht = hl.logistic_regression_rows( ... test='wald', ... y=dataset.pheno.is_case, ... x=dataset.GT.n_alt_alleles(), ... covariates=[1, dataset.pheno.age, dataset.pheno.is_female])
Run the logistic regression Wald test per variant using a list of binary (0/1) phenotypes, intercept and two covariates stored in columnindexed fields:
>>> result_ht = hl.logistic_regression_rows( ... test='wald', ... y=[dataset.pheno.is_case, dataset.pheno.is_case], # where pheno values are 0, 1, or missing ... x=dataset.GT.n_alt_alleles(), ... covariates=[1, dataset.pheno.age, dataset.pheno.is_female])
Warning
logistic_regression_rows()
considers the same set of columns (i.e., samples, points) for every row, namely those columns for which all response variables and covariates are defined. For each row, missing values of x are meanimputed over these columns. As in the example, the intercept covariate1
must be included explicitly if desired.Notes
This method performs, for each row, a significance test of the input variable in predicting a binary (casecontrol) response variable based on the logistic regression model. The response variable type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.
Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes columns for which the response variable and all covariates are defined. For each row, Hail imputes missing input values as the mean of the nonmissing values.
The example above considers a model of the form
\[\mathrm{Prob}(\mathrm{is\_case}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{is\_female}\) is coded as for
True
(female) and 0 forFalse
(male). The null model sets \(\beta_1 = 0\).The structure of the emitted row field depends on the test statistic as shown in the tables below.
Test
Field
Type
Value
Wald
beta
float64
fit effect coefficient, \(\hat\beta_1\)
Wald
standard_error
float64
estimated standard error, \(\widehat{\mathrm{se}}\)
Wald
z_stat
float64
Wald \(z\)statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\)
Wald
p_value
float64
Wald pvalue testing \(\beta_1 = 0\)
LRT, Firth
beta
float64
fit effect coefficient, \(\hat\beta_1\)
LRT, Firth
chi_sq_stat
float64
deviance statistic
LRT, Firth
p_value
float64
LRT / Firth pvalue testing \(\beta_1 = 0\)
Score
chi_sq_stat
float64
score statistic
Score
p_value
float64
score pvalue testing \(\beta_1 = 0\)
For the Wald and likelihood ratio tests, Hail fits the logistic model for each row using Newton iteration and only emits the above fields when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three fields which summarize the iterative fitting process:
Test
Field
Type
Value
Wald, LRT, Firth
fit.n_iterations
int32
number of iterations until convergence, explosion, or reaching the max (25 for Wald, LRT; 100 for Firth)
Wald, LRT, Firth
fit.converged
bool
True
if iteration convergedWald, LRT, Firth
fit.exploded
bool
True
if iteration explodedWe consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{6}\). For Wald and LRT, up to 25 iterations are attempted; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to lowlevel numerical linear algebra exceptions caused by manipulating illconditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.
A more common situation in genetics is quasicomplete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) pvalues. To not miss such variants, consider using Firth logistic regression, linear regression, or groupbased tests.
Here’s a concrete illustration of quasicomplete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:
Status
HomRef
Het
HomVar
Case
1000
10
0
Control
1000
0
0
The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where
x
is genotype,y
is phenotype, andlogistf
is from the logistf package:x < c(rep(0,1000), rep(1,1000), rep(1,10) y < c(rep(0,1000), rep(0,1000), rep(1,10)) logfit < glm(y ~ x, family=binomial()) firthfit < logistf(y ~ x) linfit < lm(y ~ x)
The resulting pvalues for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasicomplete separation. Moving one of the 10 hets from case to control eliminates this quasicomplete separation; the pvalues from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.
The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the logreg.fit fields reflect the null model; otherwise, they reflect the full model.
See Recommended joint and metaanalysis strategies for casecontrol association testing of single lowcount variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.
Hail’s logistic regression tests correspond to the
b.wald
,b.lrt
, andb.score
tests in EPACTS. For each variant, Hail imputes missing input values as the mean of nonmissing input values, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.Note
Use the pass_through parameter to include additional row fields from matrix table underlying
x
. For example, to include an “rsid” field, setpass_through=['rsid']
orpass_through=[mt.rsid]
. Parameters
test ({‘wald’, ‘lrt’, ‘score’, ‘firth’}) – Statistical test.
y (
Float64Expression
orlist
ofFloat64Expression
) – One or more columnindexed response expressions. All nonmissing values must evaluate to 0 or 1. Note that aBooleanExpression
will be implicitly converted to aFloat64Expression
with this property.x (
Float64Expression
) – Entryindexed expression for input variable.covariates (
list
ofFloat64Expression
) – Nonempty list of columnindexed covariate expressions.pass_through (
list
ofstr
orExpression
) – Additional row fields to include in the resulting table.
 Returns

hail.methods.
poisson_regression_rows
(test, y, x, covariates, pass_through=())[source]¶ For each row, test an input variable for association with a count response variable using Poisson regression.
Notes
See
logistic_regression_rows()
for more info on statistical tests of general linear models.Note
Use the pass_through parameter to include additional row fields from matrix table underlying
x
. For example, to include an “rsid” field, setpass_through=['rsid']
orpass_through=[mt.rsid]
. Parameters
y (
Float64Expression
) – Columnindexed response expression. All nonmissing values must evaluate to a nonnegative integer.x (
Float64Expression
) – Entryindexed expression for input variable.covariates (
list
ofFloat64Expression
) – Nonempty list of columnindexed covariate expressions.pass_through (
list
ofstr
orExpression
) – Additional row fields to include in the resulting table.
 Returns

hail.methods.
pca
(entry_expr, k=10, compute_loadings=False)[source]¶ Run principal component analysis (PCA) on numeric columns derived from a matrix table.
Examples
For a matrix table with variant rows, sample columns, and genotype entries, compute the top 2 PC sample scores and eigenvalues of the matrix of 0s and 1s encoding missingness of genotype calls.
>>> eigenvalues, scores, _ = hl.pca(hl.int(hl.is_defined(dataset.GT)), ... k=2)
Warning
This method does not automatically meancenter or normalize each column. If desired, such transformations should be incorporated in entry_expr.
Hail will return an error if entry_expr evaluates to missing, nan, or infinity on any entry.
Notes
PCA is run on the columns of the numeric matrix obtained by evaluating entry_expr on each entry of the matrix table, or equivalently on the rows of the transposed numeric matrix \(M\) referenced below.
PCA computes the SVD
\[M = USV^T\]where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.
From the perspective of the rows of \(M\) as samples (data points), \(V_k\) contains the loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the Gramian \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the
loadings
parameter is specified.Scores are stored in a
Table
with the column key of the matrix table as key and a field scores of typearray<float64>
containing the principal component scores.Loadings are stored in a
Table
with the row key of the matrix table as key and a field loadings of typearray<float64>
containing the principal component loadings.The eigenvalues are returned in descending order, with scores and loadings given the corresponding array order.

hail.methods.
row_correlation
(entry_expr, block_size=None)[source]¶ Computes the correlation matrix between row vectors.
Examples
Consider the following dataset with three variants and four samples:
>>> data = [{'v': '1:1:A:C', 's': 'a', 'GT': hl.Call([0, 0])}, ... {'v': '1:1:A:C', 's': 'b', 'GT': hl.Call([0, 0])}, ... {'v': '1:1:A:C', 's': 'c', 'GT': hl.Call([0, 1])}, ... {'v': '1:1:A:C', 's': 'd', 'GT': hl.Call([1, 1])}, ... {'v': '1:2:G:T', 's': 'a', 'GT': hl.Call([0, 1])}, ... {'v': '1:2:G:T', 's': 'b', 'GT': hl.Call([1, 1])}, ... {'v': '1:2:G:T', 's': 'c', 'GT': hl.Call([0, 1])}, ... {'v': '1:2:G:T', 's': 'd', 'GT': hl.Call([0, 0])}, ... {'v': '1:3:C:G', 's': 'a', 'GT': hl.Call([0, 1])}, ... {'v': '1:3:C:G', 's': 'b', 'GT': hl.Call([0, 0])}, ... {'v': '1:3:C:G', 's': 'c', 'GT': hl.Call([1, 1])}, ... {'v': '1:3:C:G', 's': 'd', 'GT': hl.missing(hl.tcall)}] >>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, GT: call}')) >>> mt = ht.to_matrix_table(row_key=['v'], col_key=['s'])
Compute genotype correlation between all pairs of variants:
>>> ld = hl.row_correlation(mt.GT.n_alt_alleles()) >>> ld.to_numpy() array([[ 1. , 0.85280287, 0.42640143], [0.85280287, 1. , 0.5 ], [ 0.42640143, 0.5 , 1. ]])
Compute genotype correlation between consecutivelyindexed variants:
>>> ld.sparsify_band(lower=0, upper=1).to_numpy() array([[ 1. , 0.85280287, 0. ], [ 0. , 1. , 0.5 ], [ 0. , 0. , 1. ]])
Warning
Rows with a constant value (i.e., zero variance) will result nan correlation values. To avoid this, first check that all rows vary or filter out constant rows (for example, with the help of
aggregators.stats()
).Notes
In this method, each row of entries is regarded as a vector with elements defined by entry_expr and missing values meanimputed per row. The
(i, j)
element of the resulting block matrix is the correlation between rowsi
andj
(as 0indexed by order in the matrix table; seeadd_row_index()
).The correlation of two vectors is defined as the Pearson correlation coeffecient between the corresponding empirical distributions of elements, or equivalently as the cosine of the angle between the vectors.
This method has two stages:
writing the rownormalized block matrix to a temporary file on persistent disk with
BlockMatrix.from_entry_expr()
. The parallelism isn_rows / block_size
.reading and multiplying this block matrix by its transpose. The parallelism is
(n_rows / block_size)^2
if all blocks are computed.
Warning
See all warnings on
BlockMatrix.from_entry_expr()
. In particular, for large matrices, it may be preferable to run the two stages separately, saving the rownormalized block matrix to a file on external storage withBlockMatrix.write_from_entry_expr()
.The resulting number of matrix elements is the square of the number of rows in the matrix table, so computing the full matrix may be infeasible. For example, ten million rows would produce 800TB of float64 values. The blocksparse representation on BlockMatrix may be used to work efficiently with regions of such matrices, as in the second example above and
ld_matrix()
.To prevent excessive recomputation, be sure to write and read the (possibly blocksparsified) result before multiplication by another matrix.
 Parameters
entry_expr (
Float64Expression
) – Entryindexed numeric expression on matrix table.block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
 Returns
BlockMatrix
– Correlation matrix between row vectors. Row and column indices correspond to matrix table row index.