Tutorial: Tutorial 26: GZCM Codec Picker and Adaptive Codec SelectionGZCM v4 supports adaptive_codec=True, which means the writerasks the codec picker (gunz_cm.compressions.scheme_picker)to choose a codec per region based on a 5% sample of the contacts.The picker scores each candidate on three axes: decode time, encodetime, and compressed size, with default weights (0.5, 0.3, 0.2).The lowest-score codec wins.This tutorial explains the picker algorithm, shows the pickerchoosing different codecs for different chromosomes (chr1 is densenear the diagonal; chr17 and chr22 are sparser), and demonstrateshow to retune the weights for size-prioritized vs speed-prioritizeduse cases. The output also shows the metadata that gets writtento the .gzcm header for downstream inspection.## Learning Objectives* Understand the picker’s scoring algorithm (3-axis weighted score)* See the picker choosing different codecs for different chromosomes* Retune the picker weights for size-prioritized use* Inspect the meta["codec_picker"] block written to a v4 file## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting notebooks/_synthetic_data.py helper. We build threechromosomes (chr1, chr17, chr22) to show how the picker scores eachdifferently.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from pathlib import Path
import sys
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import numpy as np
import _synthetic_data
from gunz_cm.compressions.scheme_picker import pick_codec_for_region, _DEFAULT_CANDIDATES
from gunz_cm.compressions import get_codec, WireFormat
from gunz_cm.compressions import ZstdEncoder, ZstdDecoder, Lz4Encoder, Lz4Decoder
rng = np.random.default_rng(42)
1. The picker scoring algorithmFor a region of n contact bins, the picker:1. Samples 5% of the rows with a fixed-seed RNG.2. For each candidate codec (cmc, zstd-3, lz4-hc-9 by default): - Encodes the sample (timed). - Decodes the sample (timed). - If encoding or decoding raises (e.g. CMC binary missing), score is +inf and the codec loses to any working alternative.3. Computes the weighted score: score = 0.5 * (decode_ms / raw_size * 1024) + 0.3 * (compressed_size / raw_size) + 0.2 * (encode_ms / raw_size * 1024) Lower score wins. The weights live ingunz_cm.compressions.scheme_picker._DEFAULT_CANDIDATES.#
print('Default picker candidates:', _DEFAULT_CANDIDATES)
print('Default weights: (decode=0.5, size=0.3, encode=0.2)')
Default picker candidates: {'cmc': <function <lambda> at 0x7f142d3298a0>, 'zstd-3': <function <lambda> at 0x7f142d329940>, 'lz4-hc-9': <function <lambda> at 0x7f142d329ee0>}
Default weights: (decode=0.5, size=0.3, encode=0.2)
2. Score three different chromosomesThe picker sees the SAME weights but DIFFERENT data, so it canchoose different codecs for different chromosomes. In a real v4file, the picker is called once per region (chromosome), so eachregion can have its own codec. (Per-tile picking is a v5+ futureproposal.)#
results = {}
for chrom in ('chr1', 'chr17', 'chr22'):
# Pick n_bins per chromosome; chr1 is biggest, chr22 smallest.
n_bins = {'chr1': 200, 'chr17': 100, 'chr22': 80}[chrom]
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
chosen, bit_width = pick_codec_for_region(
row_ids, counts, n=n_bins, tile_size=256,
)
results[chrom] = (chosen, bit_width, len(row_ids))
for chrom, (codec, bw, n) in results.items():
print(f'{chrom:6s} n_bins={n_bins} n_contacts={n:6d} '
f'picker_chose={codec:8s} bit_width={bw}')
chr1 n_bins=80 n_contacts= 40000 picker_chose=lz4-hc-9 bit_width=2
chr17 n_bins=80 n_contacts= 10000 picker_chose=lz4-hc-9 bit_width=2
chr22 n_bins=80 n_contacts= 6400 picker_chose=lz4-hc-9 bit_width=2
3. Retune the weights for size-prioritized useThe default weights favor decode speed (0.5). For storage-constraineduse cases, swap the weights to (0.2, 0.6, 0.2) (decode, size, encode)so size dominates. The picker will then prefer zstd-3 over lz4-hc-9for most tiles because lz4 is 1.75× larger than zstd on this data.#
size_weights = (0.2, 0.6, 0.2) # decode, size, encode
mat = _synthetic_data.make_synthetic_hic(n_bins=200, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
default_choice, _ = pick_codec_for_region(row_ids, counts, n=200, tile_size=256)
size_choice, _ = pick_codec_for_region(
row_ids, counts, n=200, tile_size=256, weights=size_weights,
)
print(f'default weights chose: {default_choice}')
print(f'size-prioritized weights chose: {size_choice}')
default weights chose: lz4-hc-9
size-prioritized weights chose: lz4-hc-9
4. The codec_picker block in a real v4 fileWhen you call convert_to_gzcm(..., version=4, adaptive_codec=True),the writer records the picker’s decision into meta["codec_picker"].Below we use the low-level writer to inspect what gets written.#
import tempfile, pathlib
from gunz_cm.io.gnz import GzcmWriter, GzcmReader
from gunz_cm.converters.gzcm import _write_gzcm_v4_intra
with tempfile.TemporaryDirectory() as tmp:
out = pathlib.Path(tmp) / 'tutorial_picker.gzcm'
mat = _synthetic_data.make_synthetic_hic(n_bins=200, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
meta = {'resolution': 50_000, 'region': 'chr1', 'chromosome1': 'chr1'}
writer = GzcmWriter(out, overwrite=True, version=4)
_write_gzcm_v4_intra(
writer, row_ids, col_ids, counts, n=200, meta=meta,
tile_size=256, compression='zstd', adaptive_codec=True,
codec_candidates=('zstd-3',),
)
reader = GzcmReader(out)
saved = reader.get_metadata()
print('meta["codec_picker"]:', saved['codec_picker'])
meta["codec_picker"]: {'adaptive': True, 'candidates': ['zstd-3'], 'chosen': 'zstd-3', 'writer_codec': 'zstd'}