gunz_cm.datasets#

Module contents#

Datasets module for gunz-cm.

Provides various PyTorch Dataset implementations for loading contact matrices.

Examples

class gunz_cm.datasets.GZCMDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

Dataset for .gzcm unified container format.

Supports GZCM v1 (dense), v2 (tiled/csr/block_sparse), and v3 (compressed tiles).

Parameters:

fpath (str) – Path to .gzcm file.
window_size (int) – Window size in bp.
output_type (str, default="sparse") – Output type: “sparse” or “dense”.
downsample_ratio (float or tuple, optional) – Downsampling ratio.
decompress (bool, default=True) – If True, decode compressed tiles on access. If False, return raw bytes.
tile_cache_size (int, default=256) – Maximum number of decoded tiles to keep in the in-memory LRU cache. The cache is thread-safe (threading.RLock) so it is safe under PyTorch DataLoader(num_workers > 0). Set to 0 to disable caching.

Examples

gunz_cm.datasets.GzcmDataset#: alias of GZCMDataset

class gunz_cm.datasets.GzcmTileDataset(*args: Any, **kwargs: Any)[source]#

Bases: TileDataset

TileDataset reading from a .gzcm container file.

Parameters:

fpath (str) – Path to the .gzcm file. May be v1, v2, v3, or v4.
bin_size_bp (int) – Resolution in base pairs per axis bin.
window_size (int) – Tile width in base pairs.

class gunz_cm.datasets.HiCDataset(fpath: str, bin_size_bp: int, window_size: int, blacklist: pandas.core.frame.DataFrame | None = None, downsample_ratio: float | tuple[float, float] | None = None, balancing: gunz_cm.consts.Balancing | None = Balancing.NONE, output_type: str = 'sparse', **kwargs)[source]#

Bases: SparseCODataset

A PyTorch Dataset for on-the-fly loading of Hi-C patches from sparse files.

Inherits from SparseCODataset; subclasses only need to implement _load_patch() (the RCV fetch from the file) and the genomic-index mapping via _patch_boundaries(). The 4-key output dict (coords, features, target, info), the downsampling logic, and the dense output path all live in the base class.

gunz_cm.datasets.HiCSparseDataset#: alias of HiCDataset

class gunz_cm.datasets.HiCTileDataset(*args: Any, **kwargs: Any)[source]#

Bases: TileDataset

TileDataset reading from .hic or .mcool via load_cm_data.

Parameters:

fpath (str) – Path to the .hic / .mcool / .cool source file.
bin_size_bp (int) – Resolution in base pairs per axis bin.
window_size (int) – Tile width in base pairs. Must be a multiple of bin_size_bp.
chrom (str, optional) – Restrict iteration to this chromosome. If None, iterate over every chromosome reported by the loader.
downsample_ratio –
output_type –
balancing –
decompress –
lr_fpath –

:param : :param lr_ds_ratio: :type lr_ds_ratio: forwarded to TileDataset. :param lr_balancing: :type lr_balancing: forwarded to TileDataset.

Examples

>>> ds = HiCTileDataset(
...     fpath="data.hic",
...     bin_size_bp=50_000,
...     window_size=500_000,
...     chrom="chr1",
... )
>>> len(ds) > 0
True
>>> item = ds[0]
>>> item["coords"].dtype
torch.int64

class gunz_cm.datasets.MemmapTileDataset(*args: Any, **kwargs: Any)[source]#

Bases: TileDataset

TileDataset reading from a 2-D numpy array (memmap-friendly).

class gunz_cm.datasets.TileDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

Abstract base class for tile-based Hi-C contact matrix datasets.

Subclasses MUST implement _build_index() and _fetch_patch(s, e). Subclasses MAY override _load_weights() to populate self.weights from format-specific storage (e.g. .hic via load_cm_data, .gzcm via the embedded weights_* array).

Parameters:

window_size (int) – Tile width in base pairs.
bin_size_bp (int) – Resolution in base pairs per axis bin.
output_type ({"sparse", "dense"}) – Output format from __getitem__. Default "sparse".
downsample_ratio (float, tuple, or None) – Binomial subsampling ratio. A tuple (lo, hi) samples uniformly per call. None disables downsampling.
balancing (Balancing or None) – Normalization method applied at fetch time. None skips normalization.
decompress (bool) – If False, GZCM v3 datasets return raw tile bytes instead of decoded arrays. Ignored by non-GZCM subclasses. Default True.
lr_fpath (str or None) – Path to a low-resolution contact matrix for resolution-enhancement training. Setting this enables LR/HR pair output mode.
lr_ds_ratio (int or None) – Downscale factor for on-the-fly low-resolution downsampling when lr_fpath is None. Required when lr_fpath is set; ignored otherwise.
lr_balancing (Balancing or None) – Balancing method for the LR data. Defaults to balancing if unset.

Examples

property index: DataFrame#

Public DataFrame view of the tile index (chrom/start/end + start_bin/end_bin).

Kept for backward compatibility with code that introspects the index directly (e.g. SpatialBatchSampler, test fixtures).

gunz_cm.datasets.sparse_collate_fn(batch: List[Dict[str, Any]]) → Dict[str, Any][source]#

Collate sparse-COO batch into a MinkowskiEngine-style dict.

Every batch item MUST have keys coords, features, target, info. The output has keys coords (with batch index prepended), features, target, infos (a list of per-item info dicts).

Empty input (no items) returns:

coords: shape (0, 3), int64
features: shape (0, 1), float32
target: shape (0,), float32
infos: []