Tutorial: Tutorial 26: GZCM Codec Picker and Adaptive Codec SelectionGZCM v4 supports `adaptive_codec=True`, which means the writerasks the codec picker (`gunz_cm.compressions.scheme_picker`)to choose a codec per region based on a 5% sample of the contacts.The picker scores each candidate on three axes: decode time, encodetime, and compressed size, with default weights `(0.5, 0.3, 0.2)`.The lowest-score codec wins.This tutorial explains the picker algorithm, shows the pickerchoosing different codecs for different chromosomes (chr1 is densenear the diagonal; chr17 and chr22 are sparser), and demonstrateshow to retune the weights for size-prioritized vs speed-prioritizeduse cases. The output also shows the metadata that gets writtento the `.gzcm` header for downstream inspection.## Learning Objectives* Understand the picker’s scoring algorithm (3-axis weighted score)* See the picker choosing different codecs for different chromosomes* Retune the picker weights for size-prioritized use* Inspect the `meta["codec_picker"]` block written to a v4 file## Prerequisites* gunz-cm installed: `pip install gunz-cm`* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting `notebooks/_synthetic_data.py` helper. We build threechromosomes (chr1, chr17, chr22) to show how the picker scores eachdifferently.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

from pathlib import Path
import sys
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import numpy as np
import _synthetic_data

from gunz_cm.compressions.scheme_picker import pick_codec_for_region, _DEFAULT_CANDIDATES
from gunz_cm.compressions import get_codec, WireFormat
from gunz_cm.compressions import ZstdEncoder, ZstdDecoder, Lz4Encoder, Lz4Decoder

rng = np.random.default_rng(42)

1. The picker scoring algorithmFor a region of `n` contact bins, the picker:1. Samples 5% of the rows with a fixed-seed RNG.2. For each candidate codec (`cmc`, `zstd-3`, `lz4-hc-9` by default): - Encodes the sample (timed). - Decodes the sample (timed). - If encoding or decoding raises (e.g. CMC binary missing), score is `+inf` and the codec loses to any working alternative.3. Computes the weighted score: `score = 0.5 * (decode_ms / raw_size * 1024) + 0.3 * (compressed_size / raw_size) + 0.2 * (encode_ms / raw_size * 1024)` Lower score wins. The weights live in`gunz_cm.compressions.scheme_picker._DEFAULT_CANDIDATES`.#

print('Default picker candidates:', _DEFAULT_CANDIDATES)
print('Default weights: (decode=0.5, size=0.3, encode=0.2)')

Default picker candidates: {'cmc': <function <lambda> at 0x7f142d3298a0>, 'zstd-3': <function <lambda> at 0x7f142d329940>, 'lz4-hc-9': <function <lambda> at 0x7f142d329ee0>}
Default weights: (decode=0.5, size=0.3, encode=0.2)

2. Score three different chromosomesThe picker sees the SAME weights but DIFFERENT data, so it canchoose different codecs for different chromosomes. In a real v4file, the picker is called once per region (chromosome), so eachregion can have its own codec. (Per-tile picking is a v5+ futureproposal.)#

results = {}
for chrom in ('chr1', 'chr17', 'chr22'):
    # Pick n_bins per chromosome; chr1 is biggest, chr22 smallest.
    n_bins = {'chr1': 200, 'chr17': 100, 'chr22': 80}[chrom]
    mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
    mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)

    chosen, bit_width = pick_codec_for_region(
        row_ids, counts, n=n_bins, tile_size=256,
    )
    results[chrom] = (chosen, bit_width, len(row_ids))

for chrom, (codec, bw, n) in results.items():
    print(f'{chrom:6s} n_bins={n_bins}  n_contacts={n:6d}  '
          f'picker_chose={codec:8s}  bit_width={bw}')

chr1   n_bins=80  n_contacts= 40000  picker_chose=lz4-hc-9  bit_width=2
chr17  n_bins=80  n_contacts= 10000  picker_chose=lz4-hc-9  bit_width=2
chr22  n_bins=80  n_contacts=  6400  picker_chose=lz4-hc-9  bit_width=2

3. Retune the weights for size-prioritized useThe default weights favor decode speed (0.5). For storage-constraineduse cases, swap the weights to `(0.2, 0.6, 0.2)` (decode, size, encode)so size dominates. The picker will then prefer `zstd-3` over `lz4-hc-9`for most tiles because lz4 is 1.75× larger than zstd on this data.#

size_weights = (0.2, 0.6, 0.2)  # decode, size, encode
mat = _synthetic_data.make_synthetic_hic(n_bins=200, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)

default_choice, _ = pick_codec_for_region(row_ids, counts, n=200, tile_size=256)
size_choice, _ = pick_codec_for_region(
    row_ids, counts, n=200, tile_size=256, weights=size_weights,
)
print(f'default weights chose: {default_choice}')
print(f'size-prioritized weights chose: {size_choice}')

default weights chose: lz4-hc-9
size-prioritized weights chose: lz4-hc-9

4. The `codec_picker` block in a real v4 fileWhen you call `convert_to_gzcm(..., version=4, adaptive_codec=True)`,the writer records the picker’s decision into `meta["codec_picker"]`.Below we use the low-level writer to inspect what gets written.#

import tempfile, pathlib
from gunz_cm.io.gnz import GzcmWriter, GzcmReader
from gunz_cm.converters.gzcm import _write_gzcm_v4_intra

with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_picker.gzcm'
    mat = _synthetic_data.make_synthetic_hic(n_bins=200, bin_size_bp=50_000)
    mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
    meta = {'resolution': 50_000, 'region': 'chr1', 'chromosome1': 'chr1'}
    writer = GzcmWriter(out, overwrite=True, version=4)
    _write_gzcm_v4_intra(
        writer, row_ids, col_ids, counts, n=200, meta=meta,
        tile_size=256, compression='zstd', adaptive_codec=True,
        codec_candidates=('zstd-3',),
    )
    reader = GzcmReader(out)
    saved = reader.get_metadata()
    print('meta["codec_picker"]:', saved['codec_picker'])

meta["codec_picker"]: {'adaptive': True, 'candidates': ['zstd-3'], 'chosen': 'zstd-3', 'writer_codec': 'zstd'}

5. Summary* The picker scores each candidate codec on `(decode, size, encode)` with default weights `(0.5, 0.3, 0.2)`.* Per-region (per-chromosome) picking is the current v4 default. Per-tile picking is a v5+ proposal.* Retuning weights is cheap: pass `weights=(decode, size, encode)` to `pick_codec_for_region`.## Where to go from here* Tutorial 27: write a v4 file with `convert_to_gzcm(adaptive_codec=True)`.* Tutorial 25: codec registry and the wire-format contract.#

4. The codec_picker block in a real v4 fileWhen you call convert_to_gzcm(..., version=4, adaptive_codec=True),the writer records the picker’s decision into meta["codec_picker"].Below we use the low-level writer to inspect what gets written.#

4. The `codec_picker` block in a real v4 fileWhen you call `convert_to_gzcm(..., version=4, adaptive_codec=True)`,the writer records the picker’s decision into `meta["codec_picker"]`.Below we use the low-level writer to inspect what gets written.#