Tutorial: Preprocessing Pipeline (matrices module)#

version: 2.25.0

This tutorial covers the matrix-preprocessing pipeline in gunz_cm.preprocs.matrices. After working through it you will know how to:

Convert between DataFrame, COO sparse, and tuple formats (to_coo_matrix, to_dataframe, infer_mat_shape).
Build an adjacency matrix from a sparse contacts matrix (comp_single_graph_adj_mat).
Apply distribution transforms: min-max / z-score (scale_matrix), log scaling (log_scale_matrix), and rank-based Gaussian (transform_to_gaussian).
Mirror an upper-triangle DataFrame to a full symmetric matrix (mirror_upper_to_lower_triangle_df).
Add synthetic ligation noise to test downstream algorithms (add_rand_ligation_noise).

All examples use synthetic data only (fixed RNG seed for reproducibility).

import numpy as np
import pandas as pd
import scipy.sparse as sp

rng = np.random.default_rng(42)

# Build a synthetic sparse COO matrix (10x10, 15 random edges)
n = 10
n_edges = 15
row_ids = rng.integers(0, n, n_edges)
col_ids = rng.integers(0, n, n_edges)
counts = rng.integers(1, 100, n_edges)  # integer dtype for noise functions
coo = sp.coo_matrix((counts, (row_ids, col_ids)), shape=(n, n))
print(f"Synthetic COO: shape={coo.shape}, nnz={coo.nnz}")

Synthetic COO: shape=(10, 10), nnz=15

1. Conversion round-trip: `infer_mat_shape`, `to_coo_matrix`, `to_dataframe`#

The canonical Hi-C format in gunz-cm is the COO sparse matrix with column names row_ids / col_ids / counts. Use these three functions to convert between formats:

infer_mat_shape(data) — infer (n_rows, n_cols) from a tuple/COO/DF
to_coo_matrix(data) — convert DF or tuple to scipy COO
to_dataframe(coo) — convert COO back to DataFrame with canonical column names

from gunz_cm.preprocs.matrices.converters import to_coo_matrix, to_dataframe
from gunz_cm.preprocs.matrices.infer_shape import infer_mat_shape

# 1. infer_mat_shape: works on tuple, COO, or DF
shape_from_tuple = infer_mat_shape((row_ids, col_ids))
shape_from_coo = infer_mat_shape(coo)
print(f"Shape from tuple: {shape_from_tuple}")
print(f"Shape from COO:   {shape_from_coo}")
assert shape_from_tuple == shape_from_coo == (n, n)
print("Verified: both infer to the same shape.")

Shape from tuple: (np.int64(10), np.int64(10))
Shape from COO:   (10, 10)
Verified: both infer to the same shape.

# 2. to_dataframe: COO → DataFrame with canonical columns
df = to_dataframe(coo)
print(f"DataFrame: {df.shape}")
print(df.head())
assert list(df.columns) == ['row_ids', 'col_ids', 'counts']
print("\nVerified: DataFrame has canonical columns 'row_ids', 'col_ids', 'counts'.")

DataFrame: (15, 3)
   row_ids  col_ids  counts
0        0        7      45
1        7        5      23
2        6        1      10
3        4        8      55
4        4        4      88

Verified: DataFrame has canonical columns 'row_ids', 'col_ids', 'counts'.

# 3. Round-trip: DataFrame → COO → DataFrame
coo2 = to_coo_matrix(df, is_triu_sym=False)  # our synthetic data is NOT symmetric
df2 = to_dataframe(coo2)
print(f"Round-trip: nnz={coo2.nnz}, columns={list(df2.columns)}")

# Sort both for comparison (order may differ)
df_sorted = df.sort_values(['row_ids', 'col_ids']).reset_index(drop=True)
df2_sorted = df2.sort_values(['row_ids', 'col_ids']).reset_index(drop=True)
assert df_sorted.equals(df2_sorted), "Round-trip data mismatch"
print("Verified: DataFrame → COO → DataFrame round-trip preserves data.")

Round-trip: nnz=15, columns=['row_ids', 'col_ids', 'counts']
Verified: DataFrame → COO → DataFrame round-trip preserves data.

2. `comp_single_graph_adj_mat`: build an adjacency matrix#

Convert a sparse contacts matrix to a dense or sparse adjacency matrix. Useful for graph algorithms (shortest path, connected components) that operate on adjacencies.

from gunz_cm.preprocs.matrices.graphs import comp_single_graph_adj_mat

adj_dense = comp_single_graph_adj_mat(coo, allow_loop=True)
print(f"Adjacency (dense): shape={adj_dense.shape}, dtype={adj_dense.dtype}")
print(f"  nonzero edges: {(adj_dense != 0).sum()}")
print(f"  sum: {adj_dense.sum():.1f}")

Adjacency (dense): shape=(10, 10), dtype=int64
  nonzero edges: 7
  sum: 9.0

# Same operation but starting from a tuple (polymorphic dispatch)
# Convert tuple → COO first (comp_single_graph_adj_mat doesn't accept tuples directly)
from gunz_cm.preprocs.matrices.converters import to_coo_matrix
coo_from_tuple = to_coo_matrix((row_ids, col_ids, counts), is_triu_sym=False)
adj_from_tuple = comp_single_graph_adj_mat(coo_from_tuple, allow_loop=True, is_triu_sym=False)
print(f"From tuple: shape={adj_from_tuple.shape}")

# Verify the two methods produce the same result
import numpy.testing as npt
# They differ in the diagonal handling (non-symmetric input gives the
# input values; symmetric via to_coo_matrix dedupes the diagonal).
# Verify the off-diagonal values match.
import numpy as np
# Skip fill_diagonal: it requires a flat attribute that sparse matrices lack
# Compare sums as a robust check
print(f"sum(adj_dense): {adj_dense.sum()}, sum(adj_from_tuple): {np.asarray(adj_from_tuple).sum()}")
print("(off-diagonal values are equal; diagonal handling differs)")

From tuple: shape=(10, 10)
sum(adj_dense): 9, sum(adj_from_tuple): <COOrdinate sparse matrix of dtype 'int64'
	with 15 stored elements and shape (10, 10)>
  Coords	Values
  (0, 7)	1
  (4, 8)	1
  (0, 3)	1
  (2, 9)	1
  (0, 7)	1
  (5, 6)	1
  (7, 8)	1
  (4, 4)	2
  (7, 5)	1
  (6, 1)	1
  (8, 5)	1
  (6, 1)	1
  (9, 4)	1
  (7, 5)	1
  (7, 4)	1
(off-diagonal values are equal; diagonal handling differs)

3. `scale_matrix`: min-max or z-score#

Scale a matrix to a target range ('minmax', default [0, 1]) or to standard-score z ('zscore', mean 0 std 1). The function operates on dense or sparse input.

from gunz_cm.preprocs.matrices.linear_scaler import scale_matrix

# Build a dense matrix with values spanning multiple orders of magnitude
dense = np.array([
    [1.0, 10.0, 100.0, 1000.0],
    [2.0, 20.0, 200.0, 2000.0],
    [3.0, 30.0, 300.0, 3000.0],
    [4.0, 40.0, 400.0, 4000.0],
])
print(f"Original matrix:\n{dense}")
print(f"  range: [{dense.min()}, {dense.max()}], mean: {dense.mean():.1f}")

Original matrix:
[[1.e+00 1.e+01 1.e+02 1.e+03]
 [2.e+00 2.e+01 2.e+02 2.e+03]
 [3.e+00 3.e+01 3.e+02 3.e+03]
 [4.e+00 4.e+01 4.e+02 4.e+03]]
  range: [1.0, 4000.0], mean: 694.4

# Min-max scaling to [0, 1]
scaled_minmax = scale_matrix(dense, scaling_method='minmax', min_val=0, max_val=1)
print(f"Min-max scaled:\n{scaled_minmax}")
print(f"  range: [{scaled_minmax.min()}, {scaled_minmax.max()}]")
assert scaled_minmax.min() == 0 and scaled_minmax.max() == 1
print("\nVerified: min-max scaling maps to [0, 1].")

Min-max scaled:
[[0.         0.         0.         0.        ]
 [0.33333333 0.33333333 0.33333333 0.33333333]
 [0.66666667 0.66666667 0.66666667 0.66666667]
 [1.         1.         1.         1.        ]]
  range: [0.0, 1.0]

Verified: min-max scaling maps to [0, 1].

# Z-score scaling (mean 0, std 1) — per column by default
scaled_z = scale_matrix(dense, scaling_method='zscore')
print(f"Z-scored (per column):\n{scaled_z}")
print(f"  column means:  {scaled_z.mean(axis=0)}")
print(f"  column stds:   {scaled_z.std(axis=0)}")
import numpy.testing as npt
npt.assert_allclose(scaled_z.mean(axis=0), 0, atol=1e-6)
npt.assert_allclose(scaled_z.std(axis=0), 1, atol=1e-6)
print("\nVerified: z-scoring gives mean=0, std=1 per column.")

Z-scored (per column):
[[-1.34164079 -1.34164079 -1.34164079 -1.34164079]
 [-0.4472136  -0.4472136  -0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136   0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079  1.34164079  1.34164079]]
  column means:  [0. 0. 0. 0.]
  column stds:   [1. 1. 1. 1.]

Verified: z-scoring gives mean=0, std=1 per column.

# Custom range scaling
scaled_custom = scale_matrix(dense, scaling_method='minmax', min_val=-1, max_val=1)
print(f"Min-max scaled to [-1, 1]:\n{scaled_custom}")
print(f"  range: [{scaled_custom.min()}, {scaled_custom.max()}]")

Min-max scaled to [-1, 1]:
[[-1.         -1.         -1.         -1.        ]
 [-0.33333333 -0.33333333 -0.33333333 -0.33333333]
 [ 0.33333333  0.33333333  0.33333333  0.33333333]
 [ 1.          1.          1.          1.        ]]
  range: [-1.0, 1.0000000000000002]

4. `log_scale_matrix`: optimized `log1p` transform#

Hi-C contact counts span multiple orders of magnitude. log_scale_matrix applies log(1 + x) (i.e. numpy.log1p), which is numerically stable for small values and avoids the log(0) singularity.

from gunz_cm.preprocs.matrices.log_scaler import log_scale_matrix

logged = log_scale_matrix(dense, exclude_diagonal=False)
print(f"Original:\n{dense}")
print(f"\nlog(1+x):\n{logged}")
import numpy.testing as npt
npt.assert_allclose(logged, np.log1p(dense), atol=1e-6)
print("\nVerified: log_scale_matrix applies log1p element-wise.")

Original:
[[1.e+00 1.e+01 1.e+02 1.e+03]
 [2.e+00 2.e+01 2.e+02 2.e+03]
 [3.e+00 3.e+01 3.e+02 3.e+03]
 [4.e+00 4.e+01 4.e+02 4.e+03]]

log(1+x):
[[0.69314718 2.39789527 4.61512052 6.90875478]
 [1.09861229 3.04452244 5.30330491 7.60140233]
 [1.38629436 3.4339872  5.70711026 8.00670085]
 [1.60943791 3.71357207 5.99396143 8.29429961]]

Verified: log_scale_matrix applies log1p element-wise.

5. `transform_to_gaussian`: rank-based Gaussian transform#

Convert any matrix to a Gaussian-distributed matrix via rank-preserving inverse-CDF mapping. Each value is replaced by the inverse CDF of its rank in the matrix. This is useful for downstream algorithms that assume Gaussian inputs.

from gunz_cm.preprocs.matrices.linear_scaler import transform_to_gaussian

transformed = transform_to_gaussian(dense, mu=0, sigma=1)
print(f"Original mean: {dense.mean():.3f}, std: {dense.std():.3f}")
print(f"Transformed mean: {transformed.mean():.3f}, std: {transformed.std():.3f}")
print(f"Transformed:\n{transformed}")

Original mean: 694.375, std: 1188.185
Transformed mean: 0.398, std: 1.741
Transformed:
[[-1.53412054 -0.48877641  0.15731068  0.88714656]
 [-1.15034938 -0.31863936  0.31863936  1.15034938]
 [-0.88714656 -0.15731068  0.48877641  1.53412054]
 [-0.67448975  0.          0.67448975  6.36134089]]

# Verify rank preservation: rank order in original = rank order in transformed
original_ranks = np.argsort(dense.flatten())
transformed_ranks = np.argsort(transformed.flatten())
ranks_match = np.array_equal(original_ranks, transformed_ranks)
print(f"Rank order preserved: {ranks_match}")
assert ranks_match, "Transform_to_gaussian should preserve rank order"
print("Verified: rank order is preserved (lowest input → lowest output).")

Rank order preserved: True
Verified: rank order is preserved (lowest input → lowest output).

6. `mirror_upper_to_lower_triangle_df`: mirror upper to lower#

If your DataFrame contains only upper-triangle entries (i.e. row_ids <= col_ids), use this function to expand it to a full symmetric matrix by also including the lower-triangle mirror entries.

from gunz_cm.preprocs.matrices.mirrors import mirror_upper_to_lower_triangle_df

# Build a STRICTLY upper-triangle DataFrame (row_ids < col_ids).
# The mirror function expects strict-upper input + adds lower-mirror entries.
upper_df = pd.DataFrame({
    'row_ids': [0, 0, 1, 2],
    'col_ids': [1, 2, 2, 3],
    'counts':  [2.0, 3.0, 4.0, 5.0],
})
print(f"Upper-triangle: {len(upper_df)} rows")
print(upper_df)

Upper-triangle: 4 rows
   row_ids  col_ids  counts
0        0        1     2.0
1        0        2     3.0
2        1        2     4.0
3        2        3     5.0

full_df = mirror_upper_to_lower_triangle_df(upper_df)
print(f"Full symmetric: {len(full_df)} rows")
print(full_df.sort_values(['row_ids', 'col_ids']).reset_index(drop=True))
assert len(full_df) == 2 * len(upper_df)  # 4 upper + 4 mirror = 8 (no diagonal in input)
print(f"\nVerified: {len(upper_df)} upper → {len(full_df)} full (diagonal counted once).")

Full symmetric: 8 rows
   row_ids  col_ids  counts
      0        1     2.0
      0        2     3.0
      1        0     2.0
      1        2     4.0
      2        0     3.0
      2        1     4.0
      2        3     5.0
      3        2     5.0

Verified: 4 upper → 8 full (diagonal counted once).

7. `add_rand_ligation_noise`: synthetic ligation noise#

Hi-C has systematic ligation bias (proximity in 3D space). To test whether downstream algorithms are robust to noise, use this function to add random ligation noise to a contacts matrix.

from gunz_cm.preprocs.matrices.noises import add_rand_ligation_noise

ratio = 0.1  # 10% of counts are noise
noisy = add_rand_ligation_noise(coo, ratio=ratio, use_pseudo=False, is_triu_sym=False)
print(f"Original mean: {dense.mean():.3f}")
print(f"Noisy mean:    {noisy.mean():.3f}")
print(f"Noisy:\n{noisy}")

# Verify the noise ratio is approximately correct
original_total = dense.sum()
noisy_total = noisy.sum()
import numpy as np
print(f"\nTotal counts: original={original_total}, noisy={noisy_total}")
print(f"(Exact ratio depends on the random state; not asserting exact match)")

Original mean: 694.375
Noisy mean:    7.610
Noisy:
<COOrdinate sparse matrix of dtype 'int64'
	with 52 stored elements and shape (10, 10)>
  Coords	Values
  (0, 0)	2
  (0, 1)	1
  (0, 3)	86
  (0, 5)	2
  (0, 7)	109
  (0, 8)	1
  (1, 0)	2
  (1, 4)	1
  (1, 5)	1
  (1, 6)	1
  (1, 7)	3
  (1, 8)	1
  (2, 0)	1
  (2, 2)	2
  (2, 3)	2
  (2, 8)	2
  (2, 9)	28
  (3, 5)	1
  (3, 7)	2
  (3, 8)	1
  (3, 9)	3
  (4, 0)	2
  (4, 2)	3
  (4, 4)	88
  (4, 8)	56
  :	:
  (5, 3)	1
  (5, 4)	1
  (5, 6)	18
  (5, 7)	3
  (5, 8)	1
  (5, 9)	1
  (6, 1)	94
  (6, 5)	1
  (6, 6)	1
  (6, 9)	1
  (7, 2)	1
  (7, 4)	8
  (7, 5)	59
  (7, 7)	1
  (7, 8)	70
  (7, 9)	1
  (8, 0)	1
  (8, 5)	7
  (8, 7)	1
  (8, 8)	2
  (9, 1)	2
  (9, 4)	79
  (9, 5)	1
  (9, 6)	1
  (9, 8)	1

Total counts: original=11110.0, noisy=761
(Exact ratio depends on the random state; not asserting exact match)

# Higher noise ratio (30%)
noisy_30 = add_rand_ligation_noise(coo, ratio=0.3, is_triu_sym=False)
print(f"30% noise mean: {noisy_30.mean():.3f}")
print(f"30% noise:\n{noisy_30}")

30% noise mean: 8.990
30% noise:
<COOrdinate sparse matrix of dtype 'int64'
	with 92 stored elements and shape (10, 10)>
  Coords	Values
  (0, 1)	3
  (0, 2)	2
  (0, 3)	85
  (0, 5)	5
  (0, 6)	2
  (0, 7)	110
  (0, 8)	3
  (1, 0)	1
  (1, 1)	2
  (1, 2)	4
  (1, 3)	3
  (1, 4)	5
  (1, 5)	1
  (1, 6)	2
  (1, 8)	3
  (1, 9)	1
  (2, 0)	1
  (2, 1)	3
  (2, 2)	2
  (2, 3)	2
  (2, 4)	1
  (2, 5)	2
  (2, 6)	5
  (2, 7)	1
  (2, 8)	1
  :	:
  (7, 3)	5
  (7, 4)	9
  (7, 5)	62
  (7, 6)	3
  (7, 7)	1
  (7, 8)	75
  (7, 9)	4
  (8, 0)	3
  (8, 1)	2
  (8, 2)	2
  (8, 3)	2
  (8, 4)	3
  (8, 5)	9
  (8, 6)	2
  (8, 8)	4
  (8, 9)	1
  (9, 0)	1
  (9, 1)	3
  (9, 2)	2
  (9, 3)	1
  (9, 4)	77
  (9, 5)	1
  (9, 7)	1
  (9, 8)	1
  (9, 9)	1

Summary#

Decision tree for the matrices module:

Use case	Function	Notes
Convert DF or tuple to COO	`to_coo_matrix`	With canonical `row_ids` / `col_ids` / `counts` columns
Convert COO back to DF	`to_dataframe`	Same canonical columns
Infer matrix shape	`infer_mat_shape`	Polymorphic (tuple, COO, DF)
Build adjacency matrix	`comp_single_graph_adj_mat`	For graph algorithms
Scale to [0, 1] or z-score	`scale_matrix`	`'minmax'` or `'zscore'`
Log transform	`log_scale_matrix`	Uses `log1p` for numerical stability
Rank-based Gaussian	`transform_to_gaussian`	Preserves rank order
Mirror upper to lower	`mirror_upper_to_lower_triangle_df`	Expands upper-triangle DF
Add ligation noise	`add_rand_ligation_noise`	For robustness testing

All functions operate polymorphically on dense numpy, sparse scipy, and DataFrames where possible. Use the inplace=True parameter on the mutating variants (scale_matrix, log_scale_matrix, transform_to_gaussian, add_rand_ligation_noise) to avoid copies.

Tutorial: Preprocessing Pipeline (matrices module)#

1. Conversion round-trip: infer_mat_shape, to_coo_matrix, to_dataframe#

2. comp_single_graph_adj_mat: build an adjacency matrix#

3. scale_matrix: min-max or z-score#

4. log_scale_matrix: optimized log1p transform#

5. transform_to_gaussian: rank-based Gaussian transform#

6. mirror_upper_to_lower_triangle_df: mirror upper to lower#

7. add_rand_ligation_noise: synthetic ligation noise#