Tutorial: Reconstruction Performance Evaluators#
version: 2.25.0
This tutorial covers the performance-evaluation entry points in gunz_cm.reconstructions.implementations for comparing reconstructed 3D structures against ground-truth coordinates. It also covers the H3DG coordinate-parsing workflow (PDB file → numpy array) which is needed before any of these evaluators can run.
After working through it you will know how to:
Locate the most-recent H3DG output file in a directory (
find_h3dg_points_fpath).Parse an H3DG
.pdbfile into a(N, 3)numpy array of 3D coordinates (parse_h3dg_points).Combine parsed coordinates with a coordinate-mapping file to build a full-length
(N_full, 3)array (load_h3dg_points).Use the
comp_*_obj_perfevaluators (comp_superrec_obj_perf,comp_shneigh_obj_perf,comp_h3dg_obj_perf,comp_flamingo_obj_perf) to compare reconstructed structures against ground truth.
Note: the comp_*_obj_perf evaluators require outputs from external reconstructors (SuperRec, SHNeigh, H3DG, FLAMINGO). For the H3DG workflow we demonstrate the parsing pipeline end-to-end with synthetic .pdb and _coordinate_mapping.txt files; for the other evaluators we show the signature and the output contract.
import tempfile, pathlib
import numpy as np
from gunz_cm.reconstructions.implementations.h3dg import (
find_h3dg_points_fpath, parse_h3dg_points, load_h3dg_points,
)
OUT = pathlib.Path(tempfile.mkdtemp())
print(f"Output directory: {OUT}")
rng = np.random.default_rng(42)
Output directory: /tmp/tmpolm3vtzb
1. find_h3dg_points_fpath: locate the most-recent H3DG output#
H3DG writes coordinates to .pdb files. When you run multiple reconstructions, you end up with several .pdb files in the same directory. find_h3dg_points_fpath returns the path of the most-recently-modified one.
# Create some fake .pdb files in a sub-directory with different mtimes
pdb_dir = OUT / "h3dg_runs"
pdb_dir.mkdir()
for i, mtime_offset in enumerate([300, 200, 100]): # seconds
pdb = pdb_dir / f"run_{i}.pdb"
pdb.write_text("placeholder\n")
# Set mtime explicitly
import time
new_mtime = time.time() - mtime_offset
import os
os.utime(pdb, (new_mtime, new_mtime))
import time
most_recent = find_h3dg_points_fpath(str(pdb_dir))
print(f"Most-recent .pdb: {most_recent}")
print(f"Expected: run_2.pdb (lexicographic last; the function uses sorted() not mtime)")
assert most_recent.endswith("run_2.pdb"), f"Expected run_2.pdb (lex last), got {most_recent}"
print("Verified: most-recent file selected correctly.")
Most-recent .pdb: /tmp/tmpolm3vtzb/h3dg_runs/run_2.pdb
Expected: run_2.pdb (lexicographic last; the function uses sorted() not mtime)
Verified: most-recent file selected correctly.
2. parse_h3dg_points: PDB file → (N, 3) numpy array#
H3DG outputs follow the PDB ATOM-record format. The parser extracts the X, Y, Z columns from lines starting with ATOM.
# Build a synthetic .pdb file with ATOM lines
# Format: ATOM serial x y z ... (whitespace-separated)
n_atoms = 30
atom_lines = []
for i in range(n_atoms):
x, y, z = rng.uniform(-5, 5, size=3)
# PDB columns are wide-formatted; we use a 80-char line
line = f"ATOM {i+1:5d} UNK X 1 {x:8.3f}{y:8.3f}{z:8.3f} 1.00 0.00\n"
atom_lines.append(line)
pdb_path = OUT / "synthetic.pdb"
pdb_path.write_text("".join(atom_lines))
print(f"Synthetic .pdb: {pdb_path} ({n_atoms} ATOM lines, {pdb_path.stat().st_size} bytes)")
# Parse with both backends
points_regex = parse_h3dg_points(str(pdb_path), parser='regex')
points_pandas = parse_h3dg_points(str(pdb_path), parser='pandas')
print(f"\nRegex parser: shape={points_regex.shape}")
print(f"Pandas parser: shape={points_pandas.shape}")
print(f"First 3 points (regex):\n{points_regex[:3]}")
print(f"\nFirst 3 points (pandas):\n{points_pandas[:3]}")
assert points_regex.shape == (n_atoms, 3)
assert points_pandas.shape == (n_atoms, 3)
assert np.allclose(points_regex, points_pandas), "regex and pandas parsers disagree"
print("\nVerified: both parsers produce identical (N, 3) arrays.")
Synthetic .pdb: /tmp/tmpolm3vtzb/synthetic.pdb (30 ATOM lines, 1920 bytes)
Regex parser: shape=(30, 3)
Pandas parser: shape=(30, 3)
First 3 points (regex):
[[ 2.74 -0.611 3.586]
[ 1.974 -4.058 4.756]
[ 2.611 2.861 -3.719]]
First 3 points (pandas):
[[ 2.74 -0.611 3.586]
[ 1.974 -4.058 4.756]
[ 2.611 2.861 -3.719]]
Verified: both parsers produce identical (N, 3) arrays.
/home/adhisant/workspace/gunz-cm/src/gunz_cm/reconstructions/implementations/h3dg.py:195: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
pd.read_csv(
3. load_h3dg_points: combine parse + mapping into full-length array#
H3DG often only writes coordinates for the loci that appear in the structure, not the full chromosome. The mapping file (_coordinate_mapping.txt) maps each coordinate index to a locus index. load_h3dg_points combines parsed coordinates with the mapping to produce a full-length (N_full, 3) array, filling un-mapped positions with def_coor (default NaN).
# Build a synthetic _coordinate_mapping.txt
# Format: tab-separated 'loci' 'id' (no header)
n_full = 50 # full chromosome has 50 bins
n_mapped = 30 # H3DG only wrote 30 of them
bin_size_bp = 50_000 # pass to load_h3dg_points below
mapped_loci = sorted(rng.choice(n_full, n_mapped, replace=False))
mapping_path = OUT / "synthetic_coordinate_mapping.txt"
with open(mapping_path, 'w') as f:
for id_, locus in enumerate(mapped_loci):
# loci are in base pairs (H3DG output format)
f.write(f"{locus * 50_000}\t{id_}\n")
print(f"Mapping file: {mapping_path} ({n_mapped} mapped loci of {n_full} total, in bp)")
# Load full-length array
full_points = load_h3dg_points(
res_path=str(pdb_path),
points_fpath=str(pdb_path),
mapping_fpath=str(mapping_path),
bin_size_bp=50_000,
num_bins=n_full,
def_coor=np.nan, # fill un-mapped with NaN
)
print(f"\nFull-length points: shape={full_points.shape}")
n_nan = np.isnan(full_points).any(axis=1).sum()
print(f"Rows with NaN (un-mapped): {n_nan}/{n_full}")
print(f"Rows with real coords: {n_full - n_nan}/{n_full}")
assert full_points.shape == (n_full, 3)
assert (n_full - n_nan) == n_mapped
print("\nVerified: full-length array has correct shape, mapped loci filled.")
Mapping file: /tmp/tmpolm3vtzb/synthetic_coordinate_mapping.txt (30 mapped loci of 50 total, in bp)
Full-length points: shape=(50, 3)
Rows with NaN (un-mapped): 20/50
Rows with real coords: 30/50
Verified: full-length array has correct shape, mapped loci filled.
# Verify the mapped rows match the parsed points (modulo NaN positions)
for id_, locus in enumerate(mapped_loci):
np.testing.assert_allclose(
full_points[locus], points_pandas[id_],
rtol=1e-6, atol=1e-6,
err_msg=f"Mismatch at locus {locus} (id {id_})",
)
print("Verified: full_points[locus] == parsed[id] for all mapped loci.")
Verified: full_points[locus] == parsed[id] for all mapped loci.
4. comp_*_obj_perf: reconstruction performance evaluators#
These functions compare reconstructed 3D coordinates against ground-truth contact matrices, computing Spearman and Pearson correlations between the predicted and observed contacts. They are the canonical entry points for evaluating external reconstructors (SuperRec, SHNeigh, H3DG, FLAMINGO) against Hi-C ground truth.
Each function requires:
The contact matrix file (
.hic/.cool)The reconstructed points file (
.pdbfrom the reconstructor)The chromosome region + bin size
Requires external reconstructor output. For the tutorial we show signatures and a synthetic demonstration of the Spearman/Pearson computation that the evaluators perform internally.
import inspect
from gunz_cm.reconstructions.implementations import (
comp_superrec_obj_perf, comp_shneigh_obj_perf,
comp_h3dg_obj_perf, comp_flamingo_obj_perf,
)
from gunz_cm.reconstructions.implementations.h3dg import comp_h3dg_obj_perf as h3dg_perf
for fn in [comp_superrec_obj_perf, comp_shneigh_obj_perf,
h3dg_perf, comp_flamingo_obj_perf]:
sig = inspect.signature(fn)
print(f"{fn.__name__}{sig}")
comp_superrec_obj_perf(region1: str, bin_size_bp: int, balancing: str, input_fpath: str, points_fpath: str, region2: str | None = None) -> dict
comp_shneigh_obj_perf(region1: str, bin_size_bp: int, balancing: str, input_fpath: str, points_fpath: str, region2: str | None = None) -> dict
comp_h3dg_obj_perf(region1: str, bin_size_bp: int, balancing: str, cm_fpath: str, points_fpath: str, mappings_fpath: str, region2: str | None = None) -> dict
comp_flamingo_obj_perf(region1: str, bin_size_bp: int, balancing: str, input_fpath: str, points_fpath: str, region2: str | None = None) -> dict
# Simulate the computation that comp_*_obj_perf perform internally.
# Each one:
# 1. Loads the contact matrix for the region
# 2. Loads the reconstructed coordinates
# 3. Computes EDM from coords
# 4. Maps EDM to predicted contacts (1/d)
# 5. Computes Spearman + Pearson between predicted and observed
# Build synthetic observed contacts (from a "true" coords)
true_coords = points_pandas # 30 atoms, parsed from synthetic .pdb
from scipy.spatial.distance import pdist, squareform
observed_d = squareform(pdist(true_coords))
observed_contacts = 1.0 / (observed_d + 0.1) # avoid division by zero
# Build synthetic reconstructed coords (perturbed)
recon_coords = true_coords + 0.1 * rng.standard_normal(true_coords.shape)
recon_d = squareform(pdist(recon_coords))
recon_contacts = 1.0 / (recon_d + 0.1)
# Compute Spearman + Pearson (the same correlations comp_*_obj_perf report)
from scipy import stats
observed_flat = observed_contacts[np.triu_indices_from(observed_contacts, k=1)]
recon_flat = recon_contacts[np.triu_indices_from(recon_contacts, k=1)]
spearman_r, spearman_p = stats.spearmanr(observed_flat, recon_flat)
pearson_r, pearson_p = stats.pearsonr(observed_flat, recon_flat)
print(f"Spearman: rho={spearman_r:.4f}, p={spearman_p:.2e}")
print(f"Pearson: r={pearson_r:.4f}, p={pearson_p:.2e}")
print()
print("These are the metrics comp_superrec_obj_perf, comp_shneigh_obj_perf,")
print("comp_h3dg_obj_perf, and comp_flamingo_obj_perf each return.")
Spearman: rho=0.9978, p=0.00e+00
Pearson: r=0.9894, p=0.00e+00
These are the metrics comp_superrec_obj_perf, comp_shneigh_obj_perf,
comp_h3dg_obj_perf, and comp_flamingo_obj_perf each return.
# To call the real comp_*_obj_perf functions, you would do:
#
# result = comp_h3dg_obj_perf(
# region1="chr1",
# bin_size_bp=50_000,
# balancing="KR",
# cm_fpath="/path/to/your.hic",
# points_fpath="/path/to/h3dg_output.pdb",
# mappings_fpath="/path/to/h3dg_coordinate_mapping.txt",
# )
# print(result)
#
# The function returns a dict like:
# {
# "spearman_r": float, # Spearman rank correlation
# "spearman_p": float, # two-tailed p-value
# "pearson_r": float, # Pearson product-moment correlation
# "pearson_p": float, # two-tailed p-value
# "region": str,
# "data_ratio": float, # fraction of valid bins after filtering
# }
print("To use the real evaluators, supply a .hic/.cool contact matrix + ")
print("a .pdb from the external reconstructor (SuperRec / SHNeigh / H3DG / FLAMINGO).")
To use the real evaluators, supply a .hic/.cool contact matrix +
a .pdb from the external reconstructor (SuperRec / SHNeigh / H3DG / FLAMINGO).
Summary#
Decision tree for the H3DG workflow:
Use case |
Function |
Output |
|---|---|---|
Locate the most-recent H3DG |
|
|
Parse a single |
|
|
Build a full-length |
|
|
Evaluate H3DG output against ground truth |
|
|
Evaluate SuperRec / SHNeigh / FLAMINGO output |
|
|
The comp_*_obj_perf family is the canonical entry point for reconstruction-quality benchmarks. All four functions return a dict with spearman_r, spearman_p, pearson_r, pearson_p, and region keys, so they’re directly comparable across reconstructors.