GZCM v3 → v4 migration guide#

date: 2026-06-25 target_audience: gunz-cm users who currently produce or consume v3 .gzcm files

TL;DR#

  • v3 files remain readable indefinitely. No action required.

  • v4 is opt-in via convert_to_gzcm(..., version=4).

  • v4 produces smaller files (~30-50% smaller than v3 zstd on chr1 of 4DNFI1UEG1D) at comparable or better read speed.

  • New defaults: v4.0 will bump convert_to_gzcm’s default version from 1 (current — unhelpfully produces uncompressed dense) to 3 (current compressed-tile). v4 itself remains opt-in.

What changes for users#

Reading files (transparent)#

from gunz_cm.datasets import GzcmTileDataset

# v3 file (existing) — works unchanged.
ds_v3 = GzcmTileDataset("chr1_v3_zstd.gzcm", window_size=1_000_000)

# v4 file (new) — same constructor, same API.
ds_v4 = GzcmTileDataset("chr1_v4.gzcm", window_size=1_000_000)

The reader dispatches on the header’s version field. v3 files use the existing GzcmReader path (unchanged). v4 files use the new GzcmV4Reader.

Writing files (opt-in)#

from gunz_cm.converters.gzcm import convert_to_gzcm

# v3 (current behavior).
convert_to_gzcm("in.hic", "out_v3.gzcm", region1="1", bin_size_bp=50_000, version=3)

# v4 (new).
convert_to_gzcm(
    "in.hic", "out_v4.gzcm",
    region1="1", bin_size_bp=50_000,
    version=4,                                # NEW flag
    chunk_size=10_000_000,                    # NEW: memory-bounded conversion
    enable_region_layout_picker=True,          # NEW: adaptive scheme picker
    enable_anti_diagonal_layout=True,          # NEW: .hic-style for intra
    enable_roaring_row_index=True,             # NEW: sparse-roaring regions
    codec_picker_weights=(0.5, 0.3, 0.2),    # NEW: (decode, size, encode) weights
)

Default behavior change (v2.15.0): convert_to_gzcm’s version default goes from 1 to 3. Rationale: version=1 produces 47 MB files (uncompressed dense) for typical Hi-C data; version=3 produces 9-16 MB (compressed tiles) and has been the de facto default in benchmarks. This is a behavior change that affects any user relying on the default — see “Mitigation” below.

Mitigation for the default-version bump#

Two options, both available:

(a) Explicit version in your pipeline (recommended):

# Before upgrade (relies on default):
convert_to_gzcm("in.hic", "out.gzcm", region1="1", bin_size_bp=50_000)

# After upgrade (explicit v3, same behavior):
convert_to_gzcm("in.hic", "out.gzcm", region1="1", bin_size_bp=50_000, version=3)

(b) Environment variable override (if you can’t audit all call sites):

export GUNZ_CM_GZCM_DEFAULT_VERSION=1   # restores old default

We will emit a DeprecationWarning if version is not specified at the call site, and the warning will become a FutureWarning in v2.17.0 before the default is actually bumped. Plan for this.

What changes for downstream tools#

hictk users#

GZCM v4 is not a .cool file. To interoperate with .cool-based tooling (cooler, cooltools, higlass), convert v4 → v3 first:

# gunz-cm roundtrip: v4 → v3
ds_v4 = GzcmTileDataset("chr1_v4.gzcm", window_size=...)
# Read all regions, write to v3
convert_to_gzcm("chr1_v4.gzcm", "chr1_v3_for_cooltools.gzcm", region1="1", version=3)

Future: a gunz-cm dump-to-cool CLI command that uses tiledbsoma or cooler.create to write .cool files directly from GZCM v4 reads. (Not in v4.0 scope.)

TileDB-SOMA users#

Not yet applicable. GZCM v4 is forward-compatible with a future TileDB-SOMA backend (see specs/gzcm-v4-design.md §13) but no direct bridge ships in v4.0.

HDF5 / cooler users#

No change. v3 GZCM files remain readable. v4 files require conversion to v3 (or to .cool via cooler CLI) for use with HDF5-based tools.

File-format compatibility matrix#

Writer version

Reader (gunz-cm v2.13.x)

Reader (gunz-cm v2.14.x)

Reader (gunz-cm v2.15+)

v1, v2, v3

reads OK

reads OK

reads OK

v4

ERROR: unknown version, defaults to v1 (silent corruption risk)

ERROR: same

reads OK via new GzcmV4Reader

Important: gunz-cm versions < v2.15.0 (the version that ships GZCM v4 support) will silently misread v4 files as v1 because GzcmReader._parse_header defaults to version = header.get("version", 1) if the version key is missing (src/gunz_cm/io/gnz.py:592).

Mitigation: v4.0 always writes "version": 4 in the JSON header. If you write v4 files, do not downgrade the gunz-cm reader to a version < v2.15.0.

v3 → v4 metadata key differences#

Verified against a real v3 file (4DNFI1UEG1D chr1 @ 50 kb, KR, zstd):

v3 metadata keys (top-level metadata dict):

balancing, chromosome1, compression, n_tiles, original_shape,
padded_shape, region, resolution, source_file, tile_size,
tiles, version, version_gzcm

v4 metadata keys (added/replaced vs v3):

  • All v3 keys are preserved (backward compatibility).

  • metadata["regions"] is added: list of per-region descriptors (see spec §4.3 for full shape).

  • metadata["tiles"] (v3 per-tile bbox dict) is superseded by metadata["regions"][i]["tile_bboxes"] in v4. v4 readers still parse tiles if present (legacy compat).

  • NEW weights_KR, weights_VC arrays in arrays: v4 stores balancing weight vectors in the file. v3 did not (weights were fetched on-demand from the source .hic).

v4 regions list example (chrom 1 only, KR):

"regions": [
  {
    "id": 0,
    "name": "chr1:chr1",
    "layout": "sparse-tiled-intra",
    "n_tiles": 16,
    "tile_size": 256,
    "codec_per_tile": ["zstd-3", "zstd-3", "cmc", "cmc", ...]
  }
]

For inter-chromosomal regions (e.g., chr1:chr2), the layout is typically sparse-roaring. For dense regions (>50% density), it’s dense with blosc-lz4 codec.

Storage cost comparison (preliminary benchmarks)#

Tested on 4DNFI1UEG1HD.hic chr1 at 50 kb resolution, 1 Mb window (window_size = 1,000,000):

Format

File size

Single-thread throughput

Notes

GZCM v1 dense

47 MB

11,926 items/s

Uncompressed baseline

GZCM v3 cmc

9.0 MB

87 items/s

Slowest decode, smallest file

GZCM v3 zstd

16 MB

4,467 items/s

v2.14.0 default fallback codec

GZCM v4 (predicted)

6-10 MB

5,000-7,000 items/s

BTRBlocks adaptive + Roaring + simdcomp

These are projected numbers based on the v2.14.0 benchmark at benchmarks/results/tile_dataset_benchmark_v2.14.0.md. Actual v4 numbers will be measured by the new benchmarks/datasets/benchmark_gzcm_v4_vs_v3.py when v4 ships.

When to migrate to v4#

  • Migrate now if you:

    • Store large Hi-C matrices (>10M pixels per region).

    • Run many training epochs over the same file (decode speed matters).

    • Need sparse row index access (Roaring gives 2-4× smaller than uint32 arrays).

    • Hit the v3 zstd/bsc decode speed ceiling.

  • Stay on v3 if you:

    • Have existing production pipelines depending on v3 file format.

    • Have downstream tools that read GZCM directly (rare — most use cooler/h5).

    • Are CPU-bound on data loading (unlikely; v3 is already fast enough).

  • Defer the decision if you:

    • Have small data (<1M pixels per file).

    • Don’t iterate frequently over the same file.

    • Are still evaluating gunz-cm for adoption.

Pre-flight checklist for v4.0 release#

Before tagging v2.15.0:

  • [ ] GzcmV4Reader and GzcmV4Writer implemented.

  • [ ] _write_gzcm_v4 and _convert_chunked_v4 implemented.

  • [ ] convert_to_gzcm(..., version=4) smoke-tested on 4DNFI1UEG1D chr1.

  • [ ] GzcmTileDataset reads v3 + v4 transparently.

  • [ ] benchmark_gzcm_v4_vs_v3.py reports v4 file size ≤ 70% of v3 zstd.

  • [ ] All v3 regression tests pass unchanged.

  • [ ] CHANGELOG entry under v2.15.0 with migration note.

  • [ ] docs/source/user_guide/datasets.md updated with v4 examples.

  • [ ] Deprecation warning fires on convert_to_gzcm without explicit version argument.

This guide is a living document. As v4 ships and we get user feedback, this guide will be updated with concrete migration examples.