Getting started with MuData for MultiAssayExperiment
Danila Bredikhin
European Molecular Biology Laboratory, Heidelberg, Germanydanila.bredikhin@embl.de
Ilia Kats
German Cancer Research Center, Heidelberg, Germanyi.kats@dkfz-heidelberg.de
2023-10-12
Source:vignettes/Getting-Started.Rmd
Getting-Started.Rmd
Introduction
Multimodal data format — MuData — has been introduced to address the need for cross-platform standard for sharing large-scale multimodal omics data. Importantly, it develops ideas of and is compatible with AnnData standard for storing raw and derived data for unimodal datasets.
In Bioconductor, multimodal datasets have been stored in MultiAssayExperiment
(MAE) objects. This MuData
package provides functionality
to read data from MuData files into MAE objects as well as to save MAE
objects into H5MU files.
Installation
The most recent dev build can be installed from GitHub:
Stable version of MuData
will be available in future
bioconductor versions.
Writing H5MU files
We’ll use a simple MAE object from the
MultiAssayExperiment
package that we’ll then save in a H5MU
file.
data(miniACC)
miniACC
#> A MultiAssayExperiment object of 5 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 5:
#> [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
#> [2] gistict: SummarizedExperiment with 198 rows and 90 columns
#> [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
#> [4] Mutations: matrix with 97 rows and 90 columns
#> [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
We will now write its contents into an H5MU file with
WriteH5MU
:
writeH5MU(miniACC, "miniacc.h5mu")
Reading H5MU files
We can manually check the top level of the structure of the file:
rhdf5::h5ls("miniacc.h5mu", recursive = FALSE)
#> group name otype dclass dim
#> 0 / mod H5I_GROUP
#> 1 / obs H5I_GROUP
#> 2 / obsm H5I_GROUP
#> 3 / obsmap H5I_GROUP
#> 4 / uns H5I_GROUP
#> 5 / var H5I_GROUP
Or dig deeper into the file:
h5 <- rhdf5::H5Fopen("miniacc.h5mu")
h5&'mod'
#> HDF5 GROUP
#> name /mod
#> filename
#>
#> name otype dclass dim
#> 0 Mutations H5I_GROUP
#> 1 RNASeq2GeneNorm H5I_GROUP
#> 2 RPPAArray H5I_GROUP
#> 3 gistict H5I_GROUP
#> 4 miRNASeqGene H5I_GROUP
rhdf5::H5close()
Creating MAE objects from H5MU files
This package provides ReadH5MU
to create an object with
data from an H5MU file. Since H5MU structure has been designed to
accommodate more structured information than MAE, only some data will be
read. For instance, MAE has no support for loading multimodal embeddings
or pairwise graphs.
acc <- readH5MU("miniacc.h5mu")
#> Reading as SingleCellExperiment where the original object class is matrix
#> Warning: sampleMap[['assay']] coerced with as.factor()
acc
#> A MultiAssayExperiment object of 5 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 5:
#> [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
#> [2] gistict: SummarizedExperiment with 198 rows and 90 columns
#> [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
#> [4] Mutations: SingleCellExperiment with 97 rows and 90 columns
#> [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
Importantly, we recover the information from the original MAE object:
head(colData(miniACC)[,1:4])
#> DataFrame with 6 rows and 4 columns
#> patientID years_to_birth vital_status days_to_death
#> <character> <integer> <integer> <integer>
#> TCGA-OR-A5J1 TCGA-OR-A5J1 58 1 1355
#> TCGA-OR-A5J2 TCGA-OR-A5J2 44 1 1677
#> TCGA-OR-A5J3 TCGA-OR-A5J3 23 0 NA
#> TCGA-OR-A5J4 TCGA-OR-A5J4 23 1 423
#> TCGA-OR-A5J5 TCGA-OR-A5J5 30 1 365
#> TCGA-OR-A5J6 TCGA-OR-A5J6 29 0 NA
head(colData(acc)[,1:4])
#> DataFrame with 6 rows and 4 columns
#> patientID years_to_birth vital_status days_to_death
#> <character> <integer> <integer> <array>
#> TCGA-OR-A5J1 TCGA-OR-A5J1 58 1 1355
#> TCGA-OR-A5J2 TCGA-OR-A5J2 44 1 1677
#> TCGA-OR-A5J3 TCGA-OR-A5J3 23 0 NA
#> TCGA-OR-A5J4 TCGA-OR-A5J4 23 1 423
#> TCGA-OR-A5J5 TCGA-OR-A5J5 30 1 365
#> TCGA-OR-A5J6 TCGA-OR-A5J6 29 0 NA
Features metadata is also recovered:
head(rowData(miniACC[["gistict"]]))
#> DataFrame with 6 rows and 3 columns
#> Gene.Symbol Locus.ID Cytoband
#> <character> <character> <character>
#> DIRAS3 DIRAS3 9077 1p31.3
#> MAPK14 MAPK14 1432 6p21.31
#> YAP1 YAP1 10413 11q22.1
#> CDKN1B CDKN1B 1027 12p13.1
#> ERBB2 ERBB2 2064 17q12
#> G6PD G6PD 2539 Xq28
head(rowData(acc[["gistict"]]))
#> DataFrame with 6 rows and 3 columns
#> Gene.Symbol Locus.ID Cytoband
#> <character> <character> <character>
#> DIRAS3 DIRAS3 9077 1p31.3
#> MAPK14 MAPK14 1432 6p21.31
#> YAP1 YAP1 10413 11q22.1
#> CDKN1B CDKN1B 1027 12p13.1
#> ERBB2 ERBB2 2064 17q12
#> G6PD G6PD 2539 Xq28
Backed objects
It is possible to read H5MU files while keeping matrices (both
.X
and .layers
) on disk.
acc_b <- readH5MU("miniacc.h5mu", backed = TRUE)
#> Reading as SingleCellExperiment where the original object class is matrix
#> Warning: sampleMap[['assay']] coerced with as.factor()
assay(acc_b, "RNASeq2GeneNorm")[1:5,1:3]
#> <5 x 3> DelayedMatrix object of type "double":
#> TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
#> DIRAS3 1487.0317 9.6631
#> MAPK14 778.5783 2823.6469
#> YAP1 1009.6061 2305.0590
#> CDKN1B 2101.3449 4248.9584
#> ERBB2 651.2968 246.4098
#> TCGA-OR-A5J3-01A-11R-A29S-07
#> DIRAS3 18.9602
#> MAPK14 1061.7686
#> YAP1 1561.2502
#> CDKN1B 1348.5410
#> ERBB2 90.0607
The data in the assay is a DelayedMatrix
object:
class(assay(acc_b, "RNASeq2GeneNorm"))
#> [1] "DelayedMatrix"
#> attr(,"package")
#> [1] "DelayedArray"
This is in contrast to the acc
object that has matrices
in memory:
assay(acc, "RNASeq2GeneNorm")[1:5,1:3]
#> TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
#> DIRAS3 1487.0317 9.6631
#> MAPK14 778.5783 2823.6469
#> YAP1 1009.6061 2305.0590
#> CDKN1B 2101.3449 4248.9584
#> ERBB2 651.2968 246.4098
#> TCGA-OR-A5J3-01A-11R-A29S-07
#> DIRAS3 18.9602
#> MAPK14 1061.7686
#> YAP1 1561.2502
#> CDKN1B 1348.5410
#> ERBB2 90.0607
class(assay(acc, "RNASeq2GeneNorm"))
#> [1] "matrix" "array"
References
mudata (Python) documentation
muon documentation and web page
Session Info
sessionInfo()
#> R Under development (unstable) (2023-10-10 r85312)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MultiAssayExperiment_1.27.5 SummarizedExperiment_1.31.1
#> [3] Biobase_2.61.0 GenomicRanges_1.53.2
#> [5] GenomeInfoDb_1.37.6 IRanges_2.35.2
#> [7] MatrixGenerics_1.13.1 matrixStats_1.0.0
#> [9] MuData_0.99.9 rhdf5_2.45.1
#> [11] S4Vectors_0.39.2 BiocGenerics_0.47.0
#> [13] Matrix_1.6-1.1 BiocStyle_2.29.2
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.7 bitops_1.0-7
#> [3] SparseArray_1.1.12 stringi_1.7.12
#> [5] lattice_0.21-9 digest_0.6.33
#> [7] magrittr_2.0.3 evaluate_0.22
#> [9] grid_4.4.0 bookdown_0.35
#> [11] fastmap_1.1.1 rprojroot_2.0.3
#> [13] jsonlite_1.8.7 BiocManager_1.30.22
#> [15] SingleCellExperiment_1.23.0 purrr_1.0.2
#> [17] HDF5Array_1.29.3 textshaping_0.3.7
#> [19] jquerylib_0.1.4 abind_1.4-5
#> [21] cli_3.6.1 rlang_1.1.1
#> [23] crayon_1.5.2 XVector_0.41.1
#> [25] cachem_1.0.8 DelayedArray_0.27.10
#> [27] yaml_2.3.7 S4Arrays_1.1.6
#> [29] tools_4.4.0 memoise_2.0.1
#> [31] Rhdf5lib_1.23.2 GenomeInfoDbData_1.2.10
#> [33] vctrs_0.6.3 R6_2.5.1
#> [35] lifecycle_1.0.3 zlibbioc_1.47.0
#> [37] stringr_1.5.0 fs_1.6.3
#> [39] ragg_1.2.6 desc_1.4.2
#> [41] pkgdown_2.0.7 bslib_0.5.1
#> [43] glue_1.6.2 systemfonts_1.0.5
#> [45] xfun_0.40 knitr_1.44
#> [47] rhdf5filters_1.13.5 htmltools_0.5.6.1
#> [49] rmarkdown_2.25 compiler_4.4.0
#> [51] RCurl_1.98-1.12