# Core Concepts Understanding the mental model of **Gunz-CM** is key to using it effectively for large-scale genomic analysis. ## 1. The Unified Matrix Facade Genomic contact data comes in many flavors: sparse COO (CSV), binary HDF5 (Cooler), and indexed binary (.hic). Gunz-CM abstracts these details away. Whether you load a 10kb resolution Cooler file or a raw CSV, you interact with the **`ContactMatrix`** object. This object provides a consistent API for: * Genomic metadata (Chromosomes, resolution). * Data access (as DataFrames, CSR matrices, or NumPy arrays). ## 2. Lazy Loading vs. In-Memory The library is designed for memory efficiency: * **Lazy Loaders:** When you call `load_cm_data`, the library typically only reads the specific genomic range you requested. * **Memory-Map (Memmap):** For massive datasets, Gunz-CM supports memory-mapping, allowing you to treat a file on disk as if it were an in-memory NumPy array without consuming all your RAM. ## 3. Preprocessing Pipelines Gunz-CM treats preprocessing as a series of functional transformations. The typical workflow is: 1. **Load:** Import raw data into a `ContactMatrix`. 2. **Filter:** Use optimized functions like `filter_empty_rowcols` to remove non-informative regions (e.g., centromeres or unalignable bins). 3. **Normalize:** Apply balancing weights (KR, ICE, or VC) to correct for sequencing bias. 4. **Analyze:** Pass the cleaned matrix to downstream metrics (Spearman R, IPR4, etc.). ## 4. Matrix Backends You can choose your underlying data representation based on your task: * **Pandas (COO):** Best for sparse manipulation and filtering. * **SciPy (CSR/CSC):** Best for heavy linear algebra and loss calculations. * **PyTorch/Tensor:** Best for 3D reconstruction and gradient-based optimization.