GZCM Format#

GZCM (GunZ Contact Matrix) is a binary container format for Hi-C contact matrices, designed for efficient storage, memory-mapped access, and optional compression.

Version Overview#

Version

Storage Type

Compression

Use Case

v1

Dense row-major

None

Small matrices, simple use

v2

Tiled 4D blocks

None

Large matrices, streaming writes

v3

Tiled + CMC

BSC, ZSTD, CMC

Maximum compression for large Hi-C data

File Layout#

All versions share the same header structure:

┌─────────────────────────────────────┐
│ Magic: "GZCM" (4 bytes)            │
├─────────────────────────────────────┤
│ Header Length (4 bytes, uint32)     │
├─────────────────────────────────────┤
│ Header JSON (4096-byte aligned)      │
├─────────────────────────────────────┤
│ Array Data (version-dependent)       │
└─────────────────────────────────────┘

Header Schema#

The JSON header contains version, metadata, and array descriptions:

{
  "version": 3,
  "metadata": {
    "resolution": 10000,
    "region": "chr1",
    "source_file": "/path/to/input.hic"
  },
  "arrays": {
    "matrix": {
      "dtype": "float32",
      "shape": [1000, 1000],
      "offset": 4096,
      "order": "C"
    }
  }
}

Version Details#

v1 — Dense Storage#

The simplest format. Matrix is stored as a flat row-major array.

Characteristics:

  • No chunking — entire matrix must fit in memory

  • No compression — uses full float32 storage

  • Direct memory-mapping available

When to use: Small matrices (< 100k bins) or when simplicity is preferred.

v2 — Tiled Storage#

Matrix is divided into fixed-size blocks (tiles) for cache-efficient access.

Characteristics:

  • Streaming writes without full matrix in memory

  • Cache-efficient block processing

  • Random block access

  • Block sizes: 256, 512, or 1024

When to use: Large matrices where you want block-wise processing without loading the full matrix.

v3 — Compressed Tiles#

Tiles are independently compressed using CMC (Contact Matrix Codec).

Characteristics:

  • ~10-25x compression for Hi-C data

  • Random tile access (decompress single tile)

  • Supported codecs: cmc, cmc_zstd (recommended), bsc, bsc_cmc, zstd

  • Lossless only

When to use: Large matrices where storage size is a concern. Recommended for most production use.


CLI Usage#

Convert to GZCM#

# Convert .hic to GZCM v3 (recommended)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 3 --compression bsc_cmc

# Convert to GZCM v2 (tiled, no compression)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 2

# Convert to GZCM v1 (dense)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 1

Normalize GZCM#

# Knight-Ruiz normalization
gunz-cm converters normalize input.gzcm normalized.gzcm --method kr

# ICE normalization
gunz-cm converters normalize input.gzcm normalized.gzcm --method ice

Python API#

Writing#

from gunz_cm.io.gnz import GzcmWriter
import numpy as np

# Write GZCM v3 (compressed)
writer = GzcmWriter("matrix.gzcm", overwrite=True, version=3)
writer.set_metadata({"resolution": 10000, "region": "chr1"})
writer.init_streaming_array("matrix", (1000, 1000), dtype=np.float32)
writer.write()

mm = writer.get_array_writable("matrix")
mm[:] = contact_matrix
mm.flush()

Reading#

from gunz_cm.io.gnz import GzcmReader

reader = GzcmReader("matrix.gzcm")
matrix = reader.get_array("matrix")
print(f"Shape: {matrix.shape}, Metadata: {reader.metadata}")

Chunked Access#

from gunz_cm.io.gnz import GzcmChunkedReader

chunked = GzcmChunkedReader("matrix.gzcm", chunk_size=1024)
for chunk, r, c in chunked.iter_chunks():
    process(chunk)

Streaming Normalization#

from gunz_cm.io.gnz import kr_normalize_gzcm, ice_normalize_gzcm

# Knight-Ruiz normalization
weights = kr_normalize_gzcm("matrix.gzcm", "normalized.gzcm")

# ICE normalization
weights = ice_normalize_gzcm("matrix.gzcm", "normalized.gzcm")

Supported Codecs (v3)#

Codec

Description

Recommended

bsc_cmc

BSC + CMC (default)

Yes — best compression

cmc_zstd

CMC + ZSTD

Yes — good cross-platform

cmc

CMC only

For compatibility

bsc

BSC only

Good but less portable

zstd

ZSTD only

Fast but less compression


Compression Ratios#

Typical compression ratios for Hi-C data at10kb resolution:

Tile Size

Compression Ratio

64

~10-20x

128

~15-25x


See Also#

  • GZCM v1 Specification — Dense storage format

  • GZCM v2 Specification — Tiled storage format

  • GZCM v3 Specification — CMC compressed format

  • Codec Guide — Detailed codec selection