Tutorial: Downsample a Hi-C contact matrix to a coarser resolution#
This tutorial demonstrates how to downsample a Hi-C contact matrix from a fine resolution (e.g. 1 Mb) to a coarser resolution (e.g. 5 Mb). You will learn how to:
Understand the two main families of downsampling: binning (spatial aggregation) and random subsampling (read-count subsampling)
Downsample a matrix by binning adjacent bins and summing contact counts
Verify the shape transform (
N x N->(N//k) x (N//k))Visualize the original and downsampled contact maps side-by-side
Compare the manual approach with
gunz_cm.preprocs.rand_downsample(the stochastic read-subsampling API)
The worked example uses a synthetic 10x10 matrix at 1 Mb resolution; we downsample to 2x2 at 5 Mb resolution by binning 5x5 blocks and summing.
Initialization#
Set auto-reload so edits to the source code are picked up without restarting the kernel.
%load_ext autoreload
%autoreload 2
Add the repository root to sys.path so the gunz_cm package can be imported.
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
What is downsampling?#
Downsampling reduces the resolution of a Hi-C contact matrix — that is, it reduces the number of bins. This is useful for speeding up downstream analysis, comparing experiments done at different resolutions, or fitting large matrices into memory. Two common strategies are used in genomics: (1) binning (a.k.a. coarse-graining), which groups adjacent bins and aggregates their contact counts (sum or mean), preserving total read mass, and (2) random subsampling, which keeps the bin resolution but discards a fraction of the contacts (e.g. to simulate lower sequencing depth). Common downsample rates are 1 Mb -> 5 Mb (5x), 1 Mb -> 10 Mb (10x), and 25 kb -> 100 kb (4x). This tutorial covers the binning approach, which is the standard for resolution reduction; we also show how to invoke gunz_cm.preprocs.rand_downsample for the random-subsampling case.
Imports#
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from gunz_cm.preprocs import rand_downsample
Step 1: Create a synthetic 1 Mb-resolution contact matrix#
We build a 10x10 contact matrix representing 10 genomic bins at 1 Mb resolution. Contact counts decay with genomic distance (a hallmark of Hi-C data) and we add a hotspot to make structure visible at the coarser resolution.
rng = np.random.default_rng(42)
n = 10
mat = np.zeros((n, n), dtype=float)
for i in range(n):
for j in range(i, n):
dist = abs(i - j)
w = 100.0 * np.exp(-dist / 3.0)
mat[i, j] = w
mat[j, i] = w
mat[3:6, 3:6] += 50.0
mat = mat.astype(int)
print(f'Created 1Mb-resolution matrix, shape={mat.shape}, total counts={mat.sum()}')
Created 1Mb-resolution matrix, shape=(10, 10), total counts=4730
Step 2: Downsample 1 Mb -> 5 Mb by binning 5x5 blocks#
We downsample by a factor of 5 (1 Mb -> 5 Mb). The operation:
Reshape the (10, 10) matrix into a (2, 5, 2, 5) tensor of 5x5 blocks
Sum each block into a single value
The result is a (2, 2) matrix at 5 Mb resolution
The total contact count is preserved (every read is counted exactly once).
factor = 5
n = mat.shape[0]
assert n % factor == 0, f'Matrix size {n} must be divisible by factor {factor}'
m = n // factor
downsampled = mat.reshape(m, factor, m, factor).sum(axis=(1, 3))
print(f'Input shape: {mat.shape} total counts: {mat.sum()}')
print(f'Output shape: {downsampled.shape} total counts: {downsampled.sum()}')
print(f'Downsampled matrix:')
print(downsampled)
Input shape: (10, 10) total counts: 4730
Output shape: (2, 2) total counts: 4730
Downsampled matrix:
[[1770 670]
[ 670 1620]]
Step 3: Verify shape and total counts#
assert downsampled.shape == (2, 2), f'Expected (2, 2), got {downsampled.shape}'
assert downsampled.sum() == mat.sum(), 'Total contact count must be preserved by bin-sum'
assert (downsampled >= 0).all(), 'Binned counts must be non-negative'
print('All checks passed:')
print(f' - shape: {mat.shape} -> {downsampled.shape}')
print(f' - total counts preserved: {mat.sum()} == {downsampled.sum()}')
All checks passed:
- shape: (10, 10) -> (2, 2)
- total counts preserved: 4730 == 4730
Step 4: Visualize original vs downsampled#
fig, axes = plt.subplots(1, 2, figsize=(10, 4.5))
im0 = axes[0].imshow(np.log1p(mat), cmap='Reds', aspect='equal')
axes[0].set_title('1 Mb resolution (10x10)')
axes[0].set_xlabel('bin index')
axes[0].set_ylabel('bin index')
fig.colorbar(im0, ax=axes[0], label='log(1 + count)')
im1 = axes[1].imshow(np.log1p(downsampled), cmap='Reds', aspect='equal')
axes[1].set_title('5 Mb resolution (2x2, binned 5x5)')
axes[1].set_xlabel('bin index')
axes[1].set_ylabel('bin index')
fig.colorbar(im1, ax=axes[1], label='log(1 + count)')
plt.tight_layout()
out_png = '/tmp/gunz_cm_downsample.png'
plt.savefig(out_png, dpi=100)
print(f'Saved: {out_png}')
Saved: /tmp/gunz_cm_downsample.png
Step 5: Compare with gunz_cm.preprocs.rand_downsample (stochastic read-subsampling)#
gunz_cm.preprocs.rand_downsample performs stochastic read subsampling: it keeps the original bin resolution but discards a fraction of contacts to simulate lower sequencing depth. This is a different operation from binning — the shape stays (10, 10) but the total counts drop. The RandomDownsampleCM torch transform in gunz_cm.resolution_enhancements.transforms wraps this API but has known issues (per CHANGELOG), so we call the underlying preprocs.rand_downsample function directly with a try/except guard.
np.random.seed(0)
try:
target_rate = 0.5
subsampled = rand_downsample(mat.copy(), ratio=2)
print(f'rand_downsample succeeded (target_rate={target_rate}):')
print(f' - shape preserved: {subsampled.shape == mat.shape}')
print(f' - total counts: {mat.sum()} -> {subsampled.sum()} ')
print(f' - non-zero pixels: {(mat > 0).sum()} -> {(subsampled > 0).sum()}')
except Exception as e:
print(f'rand_downsample failed: {type(e).__name__}: {e}')
rand_downsample succeeded (target_rate=0.5):
- shape preserved: True
- total counts: 4730 -> 2365
- non-zero pixels: 100 -> 100
Summary#
In this tutorial you learned how to:
Distinguish two downsampling strategies: spatial binning (sum adjacent bins, reduces shape) vs. stochastic read-subsampling (keeps shape, drops counts).
Bin a contact matrix with a one-liner:
mat.reshape(m, k, m, k).sum(axis=(1, 3))to go from(N, N)to(N/k, N/k)while preserving total counts.Verify the shape transform with shape and sum assertions.
Visualize original and downsampled matrices side-by-side via matplotlib.
Compare with
gunz_cm.preprocs.rand_downsamplefor the stochastic read-subsampling case (theRandomDownsampleCMtorch wrapper is wrapped intry/exceptbecause of known issues).
Next steps:
tutorial_load_hic.ipynb— load.hicand.coolmatricestutorial_balance_kr.ipynb— apply KR / VC / ICE balancingtutorial_3d_reconstruction.ipynb— going from 2D contact matrix to 3D structure