Score samples by gene signatures

Compute gene signature single sample scores in one of different ways. You can choose to apply either the original procedure or one of three single sample scoring methods: the combined z-score (Lee et al., 2008), the single sample GSEA (Barbie et al., 2009) or the singscore method (Foroutan et al., 2018).

Usage

hack_sig(
  expr_data,
  signatures = "all",
  method = "original",
  direction = "none",
  sample_norm = "raw",
  rank_norm = "none",
  alpha = 0.25
)

Arguments

expr_data

A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns.

signatures

It can be a list of signatures or a character vector indicating keywords for a group of signatures. The default ("all") will cause the function to compute single sample scores for all the signatures implemented in hacksig.

method

A character string specifying which method to use for computing the single sample score for each signature. You can choose one of:

"original", the original method used by the authors of the signature;
"zscore", the combined z-score method;
"ssgsea", the single sample GSEA method;
"singscore", the singscore method;

direction

A character string specifying the singscore computation method depending on the direction of the signatures. Can be on of:

"none" (default), undirected signatures, that is you do not know whether the genes are up- or down-regulated;
"up", all genes in the signature are supposed to be up-regulated;
"down", all genes in the signature are supposed to be down-regulated;

sample_norm

A character string specifying the type of normalization affecting the single sample GSEA scores. Can be one of:

"raw" (default), obtain raw scores;
"separate", normalize raw scores in \([0, 1]\) across samples for each signature separately.
"all", normalize raw scores both across samples and signatures.

rank_norm

A character string specifying how gene expression ranks should be normalized in the single sample GSEA procedure. Valid choices are:

"none" (default), no rank normalization;
"rank", ranks are multiplied by 10000 / nrow(expr_data);
"logrank", normalized ranks are logged.

alpha

A numeric scalar. Exponent in the running sum of the single sample GSEA score calculation which weighs the gene ranks. Defaults to \(\alpha = 0.25\).

Value

A tibble with one row for each sample in expr_data, a column sample_id indicating sample identifiers and one column for each input signature giving single sample scores.

Details

For "original" method, it is intended the procedure used in the original publication by the authors for computing the signature score. hack_sig() can compute signature scores with the original method only if this is a relatively simple procedure (e.g weighted sum of fitted model coefficients and expression values). For more complex methods, such as CINSARC, ESTIMATE and Immunophenoscore, use the dedicated functions.

If signatures is a custom list of gene signatures, then the "ssgsea" method will be applied by default.

Algorithm

This section gives a brief explanation of how single sample scores are obtained from different methods.

Combined z-score

Gene expression values are centered by their mean value and scaled by their standard deviation across samples for each gene (z-scores). Then, for each sample and signature, corresponding z-scores are added up and divided by the square root of the signature size (i.e. the number of genes composing a signature).

The combined z-score method is also implemented in the R package GSVA (Hänzelmann et al., 2013).

Single sample GSEA

For each sample, genes are ranked by expression value in increasing order and rank normalization may follow (see argument rank_norm). Then, two probability-like vectors are computed for each sample and signature:

\(P_{in}\), the cumulative sum of weighted ranks divided by their total sum for genes in the signature;
\(P_{out}\), the cumulative sum of ones (indicating genes not in the signature) divided by the number of genes not in the signature.

The single sample GSEA score is obtained by adding up the elements of the vector difference \(P_{in} - P_{out}\). Finally, single sample scores could be normalized either across samples or across gene signatures and samples.

The single sample GSEA method is also implemented in the R package GSVA (Hänzelmann et al., 2013).

Singscore

For signatures whose genes are supposed to be up- or down-regulated, genes are ranked by expression value in increasing or decreasing order, respectively. For signatures whose direction is unknown, genes are ranked by absolute expression in increasing order and are median-centered. Enrichment scores are then computed for each sample and signature by averaging gene ranks for genes in the signature. Finally, normalized scores are obtained by subtracting the theoretical minimum mean rank from the score and dividing by the difference between the theoretical maximum and minimum mean ranks.

The hacksig implementation of this method works only with unidirectional (i.e. all genes up- or down-regulated) and undirected gene signatures. If you want to get single sample scores for bidirectional gene signatures (i.e. signatures composed of both up- and down-regulated genes), please use the R package singscore (Foroutan et al., 2018).

References

Barbie, D. A., Tamayo, P., Boehm, J. S., Kim, S. Y., Moody, S. E., Dunn, I. F., Schinzel, A. C., Sandy, P., Meylan, E., Scholl, C., Fröhling, S., Chan, E. M., Sos, M. L., Michel, K., Mermel, C., Silver, S. J., Weir, B. A., Reiling, J. H., Sheng, Q., Gupta, P. B., … Hahn, W. C. (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(7269), 108–112. doi:10.1038/nature08460 .

Foroutan, M., Bhuva, D. D., Lyu, R., Horan, K., Cursons, J., & Davis, M. J. (2018). Single sample scoring of molecular phenotypes. BMC bioinformatics, 19(1), 404. doi:10.1186/s12859-018-2435-4 .

Hänzelmann, S., Castelo, R., & Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics, 14, 7. doi:10.1186/1471-2105-14-7 .

Lee, E., Chuang, H. Y., Kim, J. W., Ideker, T., & Lee, D. (2008). Inferring pathway activity toward precise disease classification. PLoS computational biology, 4(11), e1000217. doi:10.1371/journal.pcbi.1000217 .

Examples

# Raw ssGSEA scores for all implemented signatures can be obtained with:
hack_sig(test_expr, method = "ssgsea")
#> Warning: ℹ No genes are present in 'expr_data' for the following signatures:
#> ✖ zhu2021_ferroptosis
#> ✖ rooney2015_cyt
#> # A tibble: 20 × 39
#>    sample_id ayers2017_immexp bai2019_immune cinsarc dececco2014_int172
#>    <chr>                <dbl>          <dbl>   <dbl>              <dbl>
#>  1 sample1            -3914.           2316.   -13.5              1288.
#>  2 sample10            1077.            575.   801.                811.
#>  3 sample11             501.           -490.  1340.               1244.
#>  4 sample12            2315.           1034.  -151.                981.
#>  5 sample13           -2179.            327.  1737.               1288.
#>  6 sample14            1274.           2551.  1792.               1118.
#>  7 sample15             668.           4211.   734.                961.
#>  8 sample16           -1014.           1283.  1123.                867.
#>  9 sample17            -476.           2509.   384.               2175.
#> 10 sample18            1101.           -315.  1911.                985.
#> 11 sample19            3938.           3566.  -220.               1959.
#> 12 sample2            -3348.           1350. -1070.               1322.
#> 13 sample20            1261.            975.  1442.                703.
#> 14 sample3             1697.           1829.  1805.                685.
#> 15 sample4              366.           5611.   326.               1684.
#> 16 sample5              969.           1224.   290.                718.
#> 17 sample6             3820.           -507.   782.                382.
#> 18 sample7              920.           3452.  1835.               1230.
#> 19 sample8              -13.8          -229.   462.                966.
#> 20 sample9             1819.           3073.   824.                525.
#> # ℹ 34 more variables: eschrich2009_rsi <dbl>, estimate_immune <dbl>,
#> #   estimate_stromal <dbl>, eustace2013_hypoxia <dbl>,
#> #   fan2021_ferroptosis <dbl>, fang2021_irgs <dbl>, han2021_ferroptosis <dbl>,
#> #   he2021_ferroptosis_a <dbl>, he2021_ferroptosis_b <dbl>, hu2021_derbp <dbl>,
#> #   huang2022_ferroptosis <dbl>, ips_cp <dbl>, ips_ec <dbl>, ips_mhc <dbl>,
#> #   ips_sc <dbl>, li2021_ferroptosis_a <dbl>, li2021_ferroptosis_b <dbl>,
#> #   li2021_ferroptosis_c <dbl>, li2021_ferroptosis_d <dbl>, …

# To obtain 0-1 normalized ssGSEA scores, use:
hack_sig(test_expr, method = "ssgsea", sample_norm = "separate")
#> Warning: ℹ No genes are present in 'expr_data' for the following signatures:
#> ✖ zhu2021_ferroptosis
#> ✖ rooney2015_cyt
#> # A tibble: 20 × 39
#>    sample_id ayers2017_immexp bai2019_immune cinsarc dececco2014_int172
#>    <chr>                <dbl>          <dbl>   <dbl>              <dbl>
#>  1 sample1             0             0.461     0.354             0.505 
#>  2 sample10            0.636         0.177     0.628             0.239 
#>  3 sample11            0.562         0.00273   0.808             0.481 
#>  4 sample12            0.793         0.252     0.308             0.334 
#>  5 sample13            0.221         0.136     0.942             0.505 
#>  6 sample14            0.661         0.500     0.960             0.411 
#>  7 sample15            0.584         0.771     0.605             0.323 
#>  8 sample16            0.369         0.293     0.736             0.271 
#>  9 sample17            0.438         0.493     0.488             1     
#> 10 sample18            0.639         0.0314    1                 0.337 
#> 11 sample19            1             0.666     0.285             0.880 
#> 12 sample2             0.0721        0.303     0                 0.524 
#> 13 sample20            0.659         0.242     0.842             0.179 
#> 14 sample3             0.715         0.382     0.964             0.169 
#> 15 sample4             0.545         1         0.468             0.726 
#> 16 sample5             0.622         0.283     0.456             0.188 
#> 17 sample6             0.985         0         0.621             0     
#> 18 sample7             0.616         0.647     0.974             0.473 
#> 19 sample8             0.497         0.0454    0.514             0.326 
#> 20 sample9             0.730         0.585     0.635             0.0796
#> # ℹ 34 more variables: eschrich2009_rsi <dbl>, estimate_immune <dbl>,
#> #   estimate_stromal <dbl>, eustace2013_hypoxia <dbl>,
#> #   fan2021_ferroptosis <dbl>, fang2021_irgs <dbl>, han2021_ferroptosis <dbl>,
#> #   he2021_ferroptosis_a <dbl>, he2021_ferroptosis_b <dbl>, hu2021_derbp <dbl>,
#> #   huang2022_ferroptosis <dbl>, ips_cp <dbl>, ips_ec <dbl>, ips_mhc <dbl>,
#> #   ips_sc <dbl>, li2021_ferroptosis_a <dbl>, li2021_ferroptosis_b <dbl>,
#> #   li2021_ferroptosis_c <dbl>, li2021_ferroptosis_d <dbl>, …

# You can also change the exponent of the ssGSEA running sum with:
hack_sig(test_expr, method = "ssgsea", sample_norm = "separate", alpha = 0.5)
#> Warning: ℹ No genes are present in 'expr_data' for the following signatures:
#> ✖ zhu2021_ferroptosis
#> ✖ rooney2015_cyt
#> # A tibble: 20 × 39
#>    sample_id ayers2017_immexp bai2019_immune cinsarc dececco2014_int172
#>    <chr>                <dbl>          <dbl>   <dbl>              <dbl>
#>  1 sample1             0              0.684    0.263            0.487  
#>  2 sample10            0.741          0.186    0.659            0.316  
#>  3 sample11            0.574          0.0486   0.811            0.562  
#>  4 sample12            0.779          0.280    0.310            0.394  
#>  5 sample13            0.268          0.112    1                0.502  
#>  6 sample14            0.705          0.503    0.960            0.568  
#>  7 sample15            0.674          0.746    0.625            0.507  
#>  8 sample16            0.461          0.299    0.770            0.298  
#>  9 sample17            0.443          0.481    0.424            1      
#> 10 sample18            0.699          0.0223   0.987            0.403  
#> 11 sample19            1              0.716    0.318            0.890  
#> 12 sample2             0.0732         0.325    0                0.537  
#> 13 sample20            0.730          0.399    0.886            0.128  
#> 14 sample3             0.747          0.363    0.980            0.153  
#> 15 sample4             0.555          1        0.468            0.838  
#> 16 sample5             0.639          0.360    0.501            0.220  
#> 17 sample6             0.966          0        0.693            0      
#> 18 sample7             0.622          0.644    0.985            0.466  
#> 19 sample8             0.498          0.0902   0.525            0.395  
#> 20 sample9             0.825          0.584    0.557            0.00933
#> # ℹ 34 more variables: eschrich2009_rsi <dbl>, estimate_immune <dbl>,
#> #   estimate_stromal <dbl>, eustace2013_hypoxia <dbl>,
#> #   fan2021_ferroptosis <dbl>, fang2021_irgs <dbl>, han2021_ferroptosis <dbl>,
#> #   he2021_ferroptosis_a <dbl>, he2021_ferroptosis_b <dbl>, hu2021_derbp <dbl>,
#> #   huang2022_ferroptosis <dbl>, ips_cp <dbl>, ips_ec <dbl>, ips_mhc <dbl>,
#> #   ips_sc <dbl>, li2021_ferroptosis_a <dbl>, li2021_ferroptosis_b <dbl>,
#> #   li2021_ferroptosis_c <dbl>, li2021_ferroptosis_d <dbl>, …

# To obtain combined z-scores for custom gene signatures, use:
custom_list <- list(rand_sig1 = rownames(test_expr)[1:5],
                    rand_sig2 = c(rownames(test_expr)[6:8], "RANDOMGENE"))
hack_sig(test_expr, custom_list, method = "zscore")
#> # A tibble: 20 × 3
#>    sample_id rand_sig1 rand_sig2
#>    <chr>         <dbl>     <dbl>
#>  1 sample1      0.601     0.160 
#>  2 sample10    -0.907    -0.440 
#>  3 sample11    -1.06     -0.0717
#>  4 sample12    -1.46     -0.0504
#>  5 sample13    -0.177    -0.204 
#>  6 sample14    -0.484    -2.43  
#>  7 sample15     0.868     0.0577
#>  8 sample16    -0.266     1.94  
#>  9 sample17    -0.146    -0.0410
#> 10 sample18     0.456     0.895 
#> 11 sample19    -0.0522   -0.142 
#> 12 sample2     -2.03     -0.230 
#> 13 sample20     0.258     0.879 
#> 14 sample3      1.27      1.65  
#> 15 sample4      0.899    -0.498 
#> 16 sample5      0.963    -0.520 
#> 17 sample6      1.61      0.703 
#> 18 sample7     -1.11     -1.11  
#> 19 sample8      2.11     -0.190 
#> 20 sample9     -1.35     -0.364