Core Concepts
Understanding the mental model of Gunz-CM is key to using it effectively for large-scale genomic analysis.
1. The Unified Matrix Facade
Genomic contact data comes in many flavors: sparse COO (CSV), binary HDF5 (Cooler), and indexed binary (.hic).
Gunz-CM abstracts these details away. Whether you load a 10kb resolution Cooler file or a raw CSV, you interact with the ContactMatrix object. This object provides a consistent API for:
Genomic metadata (Chromosomes, resolution).
Data access (as DataFrames, CSR matrices, or NumPy arrays).
2. Lazy Loading vs. In-Memory
The library is designed for memory efficiency:
Lazy Loaders: When you call
load_cm_data, the library typically only reads the specific genomic range you requested.Memory-Map (Memmap): For massive datasets, Gunz-CM supports memory-mapping, allowing you to treat a file on disk as if it were an in-memory NumPy array without consuming all your RAM.
3. Preprocessing Pipelines
Gunz-CM treats preprocessing as a series of functional transformations. The typical workflow is:
Load: Import raw data into a
ContactMatrix.Filter: Use optimized functions like
filter_empty_rowcolsto remove non-informative regions (e.g., centromeres or unalignable bins).Normalize: Apply balancing weights (KR, ICE, or VC) to correct for sequencing bias.
Analyze: Pass the cleaned matrix to downstream metrics (Spearman R, IPR4, etc.).
4. Matrix Backends
You can choose your underlying data representation based on your task:
Pandas (COO): Best for sparse manipulation and filtering.
SciPy (CSR/CSC): Best for heavy linear algebra and loss calculations.
PyTorch/Tensor: Best for 3D reconstruction and gradient-based optimization.