Core Concepts#

Understanding the mental model of Gunz-CM is key to using it effectively for large-scale genomic analysis.

1. The Unified Matrix Facade#

Genomic contact data comes in many flavors: sparse COO (CSV), binary HDF5 (Cooler), and indexed binary (.hic).

Gunz-CM abstracts these details away. Whether you load a 10kb resolution Cooler file or a raw CSV, you interact with the ContactMatrix object. This object provides a consistent API for:

Genomic metadata (Chromosomes, resolution).
Data access (as DataFrames, CSR matrices, or NumPy arrays).

2. Lazy Loading vs. In-Memory#

The library is designed for memory efficiency:

Lazy Loaders: When you call load_cm_data, the library typically only reads the specific genomic range you requested.
Memory-Map (Memmap): For massive datasets, Gunz-CM supports memory-mapping, allowing you to treat a file on disk as if it were an in-memory NumPy array without consuming all your RAM.

3. Preprocessing Pipelines#

Gunz-CM treats preprocessing as a series of functional transformations. The typical workflow is:

Load: Import raw data into a ContactMatrix.
Filter: Use optimized functions like filter_empty_rowcols to remove non-informative regions (e.g., centromeres or unalignable bins).
Normalize: Apply balancing weights (KR, ICE, or VC) to correct for sequencing bias.
Analyze: Pass the cleaned matrix to downstream metrics (Spearman R, IPR4, etc.).

4. Matrix Backends#

You can choose your underlying data representation based on your task:

Pandas (COO): Best for sparse manipulation and filtering.
SciPy (CSR/CSC): Best for heavy linear algebra and loss calculations.
PyTorch/Tensor: Best for 3D reconstruction and gradient-based optimization.