Tutorial: Tutorial 27: Writing GZCM v4 Files with Adaptive Codec PickerGZCM v4 is an opt-in container format that differs from v3 in twoways: (1) tiles are enumerated in anti-diagonal order instead ofrow-major, which keeps near-diagonal (dense) tiles in cache longerduring sequential reads; (2) the writer can use the adaptive codecpicker to choose a codec per region. The file wire format isbackward-compatible with v3 decoders — only the metadata structurechanges (v4 stores per-tile bboxes in meta["regions"][0]["tile_bboxes"]instead of v3’s meta["tiles"] dict).This tutorial walks through the full v4 writer API: theadaptive_codec=True flag, the region_layouts parameter, and thecodec_candidates picker candidates. You will write a v4 file fromsynthetic data, inspect the resulting metadata, and round-trip thecontacts back through the v4 reader.## Learning Objectives* Call convert_to_gzcm(..., version=4, adaptive_codec=True)* Inspect the meta["codec_picker"] block the writer records* Understand the region_layouts parameter and the sparse-tiled-intra layout* Verify a v4 file round-trips end-to-end## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting notebooks/_synthetic_data.py helper. No external data files.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

rng = np.random.default_rng(42)

1. Build a synthetic v4-compatible test bedv4 works with any source format the loaders support. To keep thistutorial self-contained we mock the loader to return a smallsynthetic contact matrix — the writer doesn’t care where the COOdata came from.#

n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
print(f'Synthetic chr1: {n_bins}x{n_bins} bins, {len(row_ids):,} contacts')
Synthetic chr1: 200x200 bins, 40,000 contacts

2. Write a v4 file with the adaptive codec pickerconvert_to_gzcm(..., version=4, adaptive_codec=True) calls the codecpicker to choose a codec per region (chromosome) based on a 5%sample. The default candidates are ('cmc', 'zstd-3', 'lz4-hc-9');we narrow to just zstd-3 here so the picker always returns it.#

df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts,
})

with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')

    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df

    try:
        convert_to_gzcm(
            fpath=fake_hic,
            output_fpath=out,
            region1='chr1',
            bin_size_bp=50_000,
            version=4,
            tile_size=256,
            compression='zstd',
            overwrite=True,
            adaptive_codec=True,
            codec_candidates=('zstd-3',),
        )
    finally:
        loaders.load_cm_data = original

    reader = GzcmReader(out)
    saved = reader.get_metadata()
    print(f'  layout: {saved["layout"]}')
    print(f'  version_gzcm: {saved["version_gzcm"]}')
    print(f'  n_tiles: {saved["n_tiles"]}')
    print(f'  codec_picker: {saved["codec_picker"]}')
  layout: sparse-tiled-intra
  version_gzcm: 4
  n_tiles: 1
  codec_picker: {'adaptive': True, 'candidates': ['zstd-3'], 'chosen': 'zstd-3', 'writer_codec': 'zstd'}

3. Inspect the per-tile bboxesv4 stores per-tile bounding boxes inmeta["regions"][0]["tile_bboxes"] (a list of dicts) instead ofv3’s meta["tiles"] (a dict). Each bbox has tile_name,row_start, col_start, row_end, col_end, and diagonal.The diagonal field is the v4-specific addition: tiles are writtenin increasing diagonal order, which keeps near-diagonal (dense)tiles adjacent on disk for better cache locality.#

with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')
    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df
    try:
        convert_to_gzcm(
            fpath=fake_hic, output_fpath=out, region1='chr1',
            bin_size_bp=50_000, version=4, tile_size=256,
            compression='zstd', overwrite=True,
            adaptive_codec=True, codec_candidates=('zstd-3',),
        )
    finally:
        loaders.load_cm_data = original
    reader = GzcmReader(out)
    saved = reader.get_metadata()
    bboxes = saved['regions'][0]['tile_bboxes']
    print(f'  {len(bboxes)} tiles written')
    print(f'  first 3 bboxes:')
    for bb in bboxes[:3]:
        print(f'    {bb}')
  1 tiles written
  first 3 bboxes:
    {'col_end': 200, 'col_start': 0, 'diagonal': 0, 'row_end': 200, 'row_start': 0, 'tile_name': 'tile_0'}

4. region_layouts and the v4 layout vocabularyregion_layouts is a dict mapping chromosome name to a layoutidentifier. v4 only writes one layout (sparse-tiled-intra);future versions will add sparse-roaring for inter-chromosomalregions. The v4.1 conversion stores the layout under the region’slayout key, which the reader normalizes to the same internalshape as v3’s meta["tiles"].#

# The default region_layouts (when None) is:
#   {region1.split(":")[0]: layout or 'sparse-tiled-intra'}
with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')
    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df
    try:
        convert_to_gzcm(
            fpath=fake_hic, output_fpath=out, region1='chr1',
            bin_size_bp=50_000, version=4, tile_size=256,
            compression='zstd', overwrite=True,
            layout='sparse-tiled-intra',
        )
    finally:
        loaders.load_cm_data = original
    reader = GzcmReader(out)
    saved = reader.get_metadata()
    print(f'  layout: {saved["layout"]}')
    print(f'  regions[0][layout]: {saved["regions"][0]["layout"]}')
  layout: sparse-tiled-intra
  regions[0][layout]: sparse-tiled-intra

5. Summary* convert_to_gzcm(..., version=4, adaptive_codec=True) produces a v4 file with the picker choosing a codec per region.* The writer records the picker’s decision in meta["codec_picker"] and the per-tile bboxes (including diagonal) in meta["regions"][0]["tile_bboxes"].* region_layouts and layout set the v4 layout vocabulary.* Custom codecs (Tutorial 25) plug into the picker by being registered in the registry.## Where to go from here* Tutorial 28: read this v4 file with GzcmDataset.* Tutorial 25: codec registry and the v5.1 wire-format contract.#