gunz_cm.utils package
Submodules
gunz_cm.utils.intervals module
Genomic interval utilities for binnification and set operations. Implemented to minimize dependencies on bioframe for core dataloading tasks.
- gunz_cm.utils.intervals.binnify(chromsizes: Dict[str, int], binsize: int) DataFrame
Divide a genome into evenly sized bins. Matches bioframe.binnify logic.
- gunz_cm.utils.intervals.subtract(df1: DataFrame, df2: DataFrame) DataFrame
Remove intervals from df1 that overlap with any interval in df2. Simplified implementation of interval subtraction.
- Parameters:
df1 (pd.DataFrame) – Target intervals (e.g., training windows).
df2 (pd.DataFrame) – Excluded intervals (e.g., centromeres, blacklisted regions).
- Returns:
Filtered df1 containing only intervals that do NOT overlap with df2.
- Return type:
pd.DataFrame
gunz_cm.utils.logger module
Centralized logging configuration for the gunz_cm package.
gunz_cm.utils.matrix module
Module.
Examples
gunz_cm.utils.path module
Utilities for path manipulation and repository root discovery.
- gunz_cm.utils.path.append_root_dir() None
Append the root directory to sys.path.
This function first retrieves the root directory of the Git repository using get_root_dir(). If the root directory is not already in sys.path, it appends it.
- gunz_cm.utils.path.get_root_dir() str
Get the root directory of the Git repository.
This function attempts to find the root directory of the Git repository by searching from the current directory upwards. If no Git repository is found, it raises a RuntimeError.
- Returns:
The root directory of the Git repository.
- Return type:
- Raises:
RuntimeError – If a Git repository could not be found in the current or parent directories.
gunz_cm.utils.resources module
Utilities for fetching genomic resources (centromeres, blacklists).
- gunz_cm.utils.resources.fetch_centromeres(genome: str, cache: bool = True, cache_dir: Path = PosixPath('/home/adhisant/.gunz_cm/resources')) DataFrame
Fetch centromere coordinates for a given genome assembly from UCSC.
- Parameters:
genome (str) – Genome assembly name (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
cache (bool, optional) – Whether to cache the downloaded data. Defaults to True.
cache_dir (pathlib.Path, optional) – Directory to store cached files. Defaults to ~/.gunz_cm/resources.
- Returns:
DataFrame with columns: [‘chrom’, ‘start’, ‘end’, ‘name’, ‘gieStain’].
- Return type:
pd.DataFrame
gunz_cm.utils.stream module
Utilities for low-level binary stream and byte manipulation.
- gunz_cm.utils.stream.bstr2int(payload: bytes, order: str | None = 'big') int[source]
Converts a byte string to an integer.
- Parameters:
- Returns:
The integer representation of the byte string.
- Return type:
- Raises:
ValueError – If the order is not recognized or the payload is empty.
- gunz_cm.utils.stream.int2bstr(val: int, len_in_byte: int, order: Literal['big', 'little'] = 'big') bytes[source]
Converts an integer to a byte string.
Converts an integer to a byte string of a specified length and byte order.
- Parameters:
- Returns:
The byte string representation of the integer.
- Return type:
- Raises:
ValueError – If the order is not recognized.
TypeError – If val is not an integer.