Tutorial: Load a real Hi-C dataset via GUNZ_CM_TUTORIAL_DATAThis tutorial walks through loading the canonical public GM12878chr1 Hi-C dataset from 4DNucleome (accession 4DNFI1UEG1HD)using the gunz_cm.loaders API. The path to the file isresolved at runtime from the GUNZ_CM_TUTORIAL_DATA env var, sono contributor’s filesystem path is ever hardcoded.## SetupBefore running, populate the data directory once:bashmkdir -p ~/gunz_cm_tutorial_datapython scripts/download_tutorial_data.py --name gm12878_chr1_1mb \    --target ~/gunz_cm_tutorial_dataexport GUNZ_CM_TUTORIAL_DATA=~/gunz_cm_tutorial_data## Learning Objectives* Resolve a canonical dataset via notebooks/_tutorial_data.load_tutorial_dataset.* Load a .hic contact matrix at a specific resolution + chromosome.* Inspect metadata (resolutions, balancing, chromosome info).* Apply KR balancing and read the matrix back as COO.## Estimated TimeApproximately 5 minutes after the data is downloaded.## Prerequisites* gunz-cm installed (this repo).* GUNZ_CM_TUTORIAL_DATA set to a directory containing 4DNFI1UEG1HD.hic.#

import os
import sys
from pathlib import Path

from gunz_cm.loaders import (
    load_cm_data,
    get_resolutions,
    get_chrom_infos,
    get_balancing,
)
from gunz_cm.consts import Balancing, DataStructure

# Resolve _tutorial_data.py via repo-root probe.
_probe = Path.cwd().resolve()
_root = None
while _probe != _probe.parent:
    if (_probe / "notebooks" / "_tutorial_data.py").is_file() and (_probe / "pyproject.toml").is_file():
        _root = _probe
        break
    _probe = _probe.parent
if _root is None:
    raise RuntimeError(f"cannot find gunz-cm repo root from cwd {Path.cwd()}")
sys.path.insert(0, str(_root / "notebooks"))

from _tutorial_data import load_tutorial_dataset, TutorialDataError

try:
    hic_path = load_tutorial_dataset("gm12878_chr1_1mb")
    print(f"Using real dataset: {hic_path}")
except TutorialDataError as exc:
    print(f"Skip: {exc}")
    hic_path = None  # the rest of the cells will skip cleanly

Skip: dataset 'gm12878_chr1_1mb' expected at /home/adhisant/gunz_cm_tutorial_data/4DNFI1UEG1HD.hic but file is missing; run scripts/download_tutorial_data.py --name gm12878_chr1_1mb

1. Inspect the file’s metadataThe .hic file exposes multiple resolutions and chromosomes.We use gunz_cm.loaders to query them.#

if hic_path is None:
    print("Skipping: dataset not downloaded")
else:
    resolutions = get_resolutions(hic_path)
    print(f"Available resolutions: {sorted(resolutions)}")
    chroms = get_chrom_infos(hic_path)
    print("Chromosomes (first 5):")
    for name in list(chroms)[:5]:
        print(f"  {name}: {chroms[name]}")
    balancing = get_balancing(hic_path, resolution=1_000_000, chrom="chr1")
    print(f"Balancing methods (chr1, 1Mb): {balancing}")

Skipping: dataset not downloaded

2. Load the contact matrixUse load_cm_data at 1 Mb resolution on chr1 with KRbalancing. The result is a COO-format contact matrix.#

if hic_path is None:
    print("Skipping: dataset not downloaded")
else:
    cm_df = load_cm_data(
        fpath=hic_path,
        bin_size_bp=1_000_000,
        region1="chr1",
        region2="chr1",
        balancing=Balancing.KR,
        output_format=DataStructure.COO,
    )
    print(f"Loaded contact matrix: shape={cm_df.shape}, nnz={cm_df.nnz}")
    print(f"  min count: {cm_df.data.min()}, max count: {cm_df.data.max()}")

Skipping: dataset not downloaded

Where to go next* Convert this .hic to .gzcm with convert_to_gzcm (tutorial 03).* Compare the same region across resolutions (tutorial 02).* Run the dataset through HiCTileDataset (tutorial 30).#