Core Concepts

Understanding the mental model of Gunz-CM is key to using it effectively for large-scale genomic analysis.

1. The Unified Matrix Facade

Genomic contact data comes in many flavors: sparse COO (CSV), binary HDF5 (Cooler), and indexed binary (.hic).

Gunz-CM abstracts these details away. Whether you load a 10kb resolution Cooler file or a raw CSV, you interact with the ContactMatrix object. This object provides a consistent API for:

  • Genomic metadata (Chromosomes, resolution).

  • Data access (as DataFrames, CSR matrices, or NumPy arrays).

2. Lazy Loading vs. In-Memory

The library is designed for memory efficiency:

  • Lazy Loaders: When you call load_cm_data, the library typically only reads the specific genomic range you requested.

  • Memory-Map (Memmap): For massive datasets, Gunz-CM supports memory-mapping, allowing you to treat a file on disk as if it were an in-memory NumPy array without consuming all your RAM.

3. Preprocessing Pipelines

Gunz-CM treats preprocessing as a series of functional transformations. The typical workflow is:

  1. Load: Import raw data into a ContactMatrix.

  2. Filter: Use optimized functions like filter_empty_rowcols to remove non-informative regions (e.g., centromeres or unalignable bins).

  3. Normalize: Apply balancing weights (KR, ICE, or VC) to correct for sequencing bias.

  4. Analyze: Pass the cleaned matrix to downstream metrics (Spearman R, IPR4, etc.).

4. Matrix Backends

You can choose your underlying data representation based on your task:

  • Pandas (COO): Best for sparse manipulation and filtering.

  • SciPy (CSR/CSC): Best for heavy linear algebra and loss calculations.

  • PyTorch/Tensor: Best for 3D reconstruction and gradient-based optimization.