Tutorial: Load a real Hi-C dataset via GUNZ_CM_TUTORIAL_DATAThis tutorial walks through loading the canonical public GM12878chr1 Hi-C dataset from 4DNucleome (accession 4DNFI1UEG1HD)using the gunz_cm.loaders API. The path to the file isresolved at runtime from the GUNZ_CM_TUTORIAL_DATA env var, sono contributor’s filesystem path is ever hardcoded.## SetupBefore running, populate the data directory once:bashmkdir -p ~/gunz_cm_tutorial_datapython scripts/download_tutorial_data.py --name gm12878_chr1_1mb \ --target ~/gunz_cm_tutorial_dataexport GUNZ_CM_TUTORIAL_DATA=~/gunz_cm_tutorial_data## Learning Objectives* Resolve a canonical dataset via notebooks/_tutorial_data.load_tutorial_dataset.* Load a .hic contact matrix at a specific resolution + chromosome.* Inspect metadata (resolutions, balancing, chromosome info).* Apply KR balancing and read the matrix back as COO.## Estimated TimeApproximately 5 minutes after the data is downloaded.## Prerequisites* gunz-cm installed (this repo).* GUNZ_CM_TUTORIAL_DATA set to a directory containing 4DNFI1UEG1HD.hic.#
import os
import sys
from pathlib import Path
from gunz_cm.loaders import (
load_cm_data,
get_resolutions,
get_chrom_infos,
get_balancing,
)
from gunz_cm.consts import Balancing, DataStructure
# Resolve _tutorial_data.py via repo-root probe.
_probe = Path.cwd().resolve()
_root = None
while _probe != _probe.parent:
if (_probe / "notebooks" / "_tutorial_data.py").is_file() and (_probe / "pyproject.toml").is_file():
_root = _probe
break
_probe = _probe.parent
if _root is None:
raise RuntimeError(f"cannot find gunz-cm repo root from cwd {Path.cwd()}")
sys.path.insert(0, str(_root / "notebooks"))
from _tutorial_data import load_tutorial_dataset, TutorialDataError
try:
hic_path = load_tutorial_dataset("gm12878_chr1_1mb")
print(f"Using real dataset: {hic_path}")
except TutorialDataError as exc:
print(f"Skip: {exc}")
hic_path = None # the rest of the cells will skip cleanly
Skip: dataset 'gm12878_chr1_1mb' expected at /home/adhisant/gunz_cm_tutorial_data/4DNFI1UEG1HD.hic but file is missing; run scripts/download_tutorial_data.py --name gm12878_chr1_1mb
1. Inspect the file’s metadataThe .hic file exposes multiple resolutions and chromosomes.We use gunz_cm.loaders to query them.#
if hic_path is None:
print("Skipping: dataset not downloaded")
else:
resolutions = get_resolutions(hic_path)
print(f"Available resolutions: {sorted(resolutions)}")
chroms = get_chrom_infos(hic_path)
print("Chromosomes (first 5):")
for name in list(chroms)[:5]:
print(f" {name}: {chroms[name]}")
balancing = get_balancing(hic_path, resolution=1_000_000, chrom="chr1")
print(f"Balancing methods (chr1, 1Mb): {balancing}")
Skipping: dataset not downloaded
2. Load the contact matrixUse load_cm_data at 1 Mb resolution on chr1 with KRbalancing. The result is a COO-format contact matrix.#
if hic_path is None:
print("Skipping: dataset not downloaded")
else:
cm_df = load_cm_data(
fpath=hic_path,
bin_size_bp=1_000_000,
region1="chr1",
region2="chr1",
balancing=Balancing.KR,
output_format=DataStructure.COO,
)
print(f"Loaded contact matrix: shape={cm_df.shape}, nnz={cm_df.nnz}")
print(f" min count: {cm_df.data.min()}, max count: {cm_df.data.max()}")
Skipping: dataset not downloaded