Tutorial: Tutorial 27: Writing GZCM v4 Files with Adaptive Codec PickerGZCM v4 is an opt-in container format that differs from v3 in twoways: (1) tiles are enumerated in anti-diagonal order instead ofrow-major, which keeps near-diagonal (dense) tiles in cache longerduring sequential reads; (2) the writer can use the adaptive codecpicker to choose a codec per region. The file wire format isbackward-compatible with v3 decoders — only the metadata structurechanges (v4 stores per-tile bboxes in meta["regions"][0]["tile_bboxes"]instead of v3’s meta["tiles"] dict).This tutorial walks through the full v4 writer API: theadaptive_codec=True flag, the region_layouts parameter, and thecodec_candidates picker candidates. You will write a v4 file fromsynthetic data, inspect the resulting metadata, and round-trip thecontacts back through the v4 reader.## Learning Objectives* Call convert_to_gzcm(..., version=4, adaptive_codec=True)* Inspect the meta["codec_picker"] block the writer records* Understand the region_layouts parameter and the sparse-tiled-intra layout* Verify a v4 file round-trips end-to-end## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting notebooks/_synthetic_data.py helper. No external data files.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE
def _row_col_counts(mat):
"""Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.
GZCM's synthetic_data helper returns a ContactMatrix, not a raw
ndarray, so we unwrap via .data when present.
"""
arr = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(arr)
counts = arr[row_ids, col_ids].astype(np.uint32)
return row_ids, col_ids, counts
rng = np.random.default_rng(42)
1. Build a synthetic v4-compatible test bedv4 works with any source format the loaders support. To keep thistutorial self-contained we mock the loader to return a smallsynthetic contact matrix — the writer doesn’t care where the COOdata came from.#
n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
print(f'Synthetic chr1: {n_bins}x{n_bins} bins, {len(row_ids):,} contacts')
Synthetic chr1: 200x200 bins, 40,000 contacts
2. Write a v4 file with the adaptive codec pickerconvert_to_gzcm(..., version=4, adaptive_codec=True) calls the codecpicker to choose a codec per region (chromosome) based on a 5%sample. The default candidates are ('cmc', 'zstd-3', 'lz4-hc-9');we narrow to just zstd-3 here so the picker always returns it.#
df = pd.DataFrame({
DataFrameSpecs.ROW_IDS: row_ids,
DataFrameSpecs.COL_IDS: col_ids,
DataFrameSpecs.COUNTS: counts,
})
with tempfile.TemporaryDirectory() as tmp:
out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
fake_hic = pathlib.Path(tmp) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic,
output_fpath=out,
region1='chr1',
bin_size_bp=50_000,
version=4,
tile_size=256,
compression='zstd',
overwrite=True,
adaptive_codec=True,
codec_candidates=('zstd-3',),
)
finally:
loaders.load_cm_data = original
reader = GzcmReader(out)
saved = reader.get_metadata()
print(f' layout: {saved["layout"]}')
print(f' version_gzcm: {saved["version_gzcm"]}')
print(f' n_tiles: {saved["n_tiles"]}')
print(f' codec_picker: {saved["codec_picker"]}')
layout: sparse-tiled-intra
version_gzcm: 4
n_tiles: 1
codec_picker: {'adaptive': True, 'candidates': ['zstd-3'], 'chosen': 'zstd-3', 'writer_codec': 'zstd'}
3. Inspect the per-tile bboxesv4 stores per-tile bounding boxes inmeta["regions"][0]["tile_bboxes"] (a list of dicts) instead ofv3’s meta["tiles"] (a dict). Each bbox has tile_name,row_start, col_start, row_end, col_end, and diagonal.The diagonal field is the v4-specific addition: tiles are writtenin increasing diagonal order, which keeps near-diagonal (dense)tiles adjacent on disk for better cache locality.#
with tempfile.TemporaryDirectory() as tmp:
out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
fake_hic = pathlib.Path(tmp) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic, output_fpath=out, region1='chr1',
bin_size_bp=50_000, version=4, tile_size=256,
compression='zstd', overwrite=True,
adaptive_codec=True, codec_candidates=('zstd-3',),
)
finally:
loaders.load_cm_data = original
reader = GzcmReader(out)
saved = reader.get_metadata()
bboxes = saved['regions'][0]['tile_bboxes']
print(f' {len(bboxes)} tiles written')
print(f' first 3 bboxes:')
for bb in bboxes[:3]:
print(f' {bb}')
1 tiles written
first 3 bboxes:
{'col_end': 200, 'col_start': 0, 'diagonal': 0, 'row_end': 200, 'row_start': 0, 'tile_name': 'tile_0'}
4. region_layouts and the v4 layout vocabularyregion_layouts is a dict mapping chromosome name to a layoutidentifier. v4 only writes one layout (sparse-tiled-intra);future versions will add sparse-roaring for inter-chromosomalregions. The v4.1 conversion stores the layout under the region’slayout key, which the reader normalizes to the same internalshape as v3’s meta["tiles"].#
# The default region_layouts (when None) is:
# {region1.split(":")[0]: layout or 'sparse-tiled-intra'}
with tempfile.TemporaryDirectory() as tmp:
out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
fake_hic = pathlib.Path(tmp) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic, output_fpath=out, region1='chr1',
bin_size_bp=50_000, version=4, tile_size=256,
compression='zstd', overwrite=True,
layout='sparse-tiled-intra',
)
finally:
loaders.load_cm_data = original
reader = GzcmReader(out)
saved = reader.get_metadata()
print(f' layout: {saved["layout"]}')
print(f' regions[0][layout]: {saved["regions"][0]["layout"]}')
layout: sparse-tiled-intra
regions[0][layout]: sparse-tiled-intra