spac.data_utils module

spac.data_utils.add_pin_color_rules(adata, label_color_dict: dict, color_map_name: str = '_spac_colors', overwrite: bool = True) → Tuple[dict, str][source]

Adds pin color rules to the AnnData object and scans for matching labels.

This function scans unique labels in each adata.obs and column names in all adata tables, to find the labels defined by the pin color rule.

Parameters:

adata – The anndata object containing upstream analysis.
label_color_dict (dict) – Dictionary of pin color rules with label as key and color as value.
color_map_name (str) – The name to use for storing pin color rules in adata.uns.
overwrite (bool, optional) – Whether to overwrite existing pin color rules in adata.uns with the same name, by default True.

Returns:

label_matches (dict) – Dictionary with the matching labels in each section (obs, var, X, etc.).
result_str (str) – Summary string with the matching labels in each section (obs, var, X, etc.).

Raises:

ValueError – If color_map_name already exists in adata.uns and overwrite is False.

spac.data_utils.add_rescaled_features(adata, min_quantile, max_quantile, layer)[source]

Clip and rescale the features matrix.

The results will be added into a new layer in the AnnData object.

Parameters:

adata (anndata.AnnData) – The AnnData object.
min_quantile (float) – The minimum quantile to rescale to zero.
max_quantile (float) – The maximum quantile to rescale to one.
layer (str) – The name of the new layer to add to the anndata object.

spac.data_utils.append_annotation(data: DataFrame, annotation: dict) → DataFrame[source]

Append a new annotation with single value to a Pandas DataFrame based on mapping rules.

Parameters:

data (pd.DataFrame) – The input DataFrame to which the new observation will be appended.
annotation (dict) – dictionary of string pairs representing the new annotation and its value. Each pair should have this format: <new annotation column name>:<value of the annotation> The values must be a single string or numeric value.

Returns:

The DataFrame with the new observation appended.

Return type:

pd.DataFrame

spac.data_utils.bin2cat(data, one_hot_annotations, new_annotation)[source]

Combine a set of columns representing a binary one hot encoding of categories into a new categorical column.

Parameters:

data (pandas.DataFrame) – The pandas dataframe containing the one hot encoded annotations.
one_hot_annotations (str or list of str) – A string or a list of strings representing python regular expression of the one hot encoded annotations columns in the data frame.
new_annotation (str) – The column name for new categorical annotation to be created.

Returns:

pandas.DataFrame – DataFrame with new categorical column added.
Example
——–
>>> data = pd.DataFrame({
… ‘A’ ([1, 1, 0, 0],)
… ‘B’ ([0, 0, 1, 0])
… })
>>> one_hot_annotations = [‘A’, ‘B’]
>>> new_annotation = ‘new_category’
>>> result = bin2cat(data, one_hot_annotations, new_annotation)
>>> print(result[new_annotation])
0 A
1 A
2 B
3 NaN
Name (new_category, dtype: object)

spac.data_utils.calculate_centroid(data, x_min, x_max, y_min, y_max, new_x, new_y)[source]

Calculate the spatial coordinates of the cell centroid as the average of min and max coordinates.

Parameters:

data (pd.DataFrame) – The input data frame. The dataframe should contain four columns for x_min, x_max, y_min, and y_max for centroid calculation.
x_min (str) – column name with minimum x value
x_max (str) – column name with maximum x value
y_min (str) – column name with minimum y value
y_max (str) – column name with maximum y value
new_x (str) – the new column name of the x dimension of the cientroid, allowing characters are alphabetic, digits and underscore
new_y (str) – the new column name of the y dimension of the centroid, allowing characters are alphabetic, digits and underscore

Returns:

data – dataframe with two new centroid columns addded. Note that the dataframe is modified in place.

Return type:

pd.DataFrame

spac.data_utils.combine_annotations(adata: AnnData, annotations: list, separator: str, new_annotation_name: str) → AnnData[source]

Combine multiple annotations into a new annotation using a defined separator.

Parameters:

adata (AnnData) – The input AnnData object whose .obs will be modified.
annotations (list) – List of annotation column names to combine.
separator (str) – Separator to use when combining annotations.
new_annotation_name (str) – The name of the new annotation to be created.

Returns:

The AnnData object with the combined annotation added.

Return type:

AnnData

spac.data_utils.combine_dfs(dataframes: list)[source]

Combined multiple pandas dataframes into one. Schema of the first dataframe is considered primary. A warming will be printed if schema of current dataframe is different than the primary.

Parameters:: dataframes (list[pd.DataFrame]) – A list of pandas dataframe to be combined
Return type:: A pd.DataFrame of combined dataframs.

spac.data_utils.concatinate_regions(regions)[source]

Concatinate data from multiple regions and create new indexes.

Parameters:: regions (list of anndata.AnnData) – AnnData objects to be concatinated.
Returns:: New AnddData object with the concatinated values in AnnData.X
Return type:: anndata.AnnData

spac.data_utils.downsample_cells(input_data, annotations, n_samples=None, stratify=False, rand=False, combined_col_name='_combined_', min_threshold=5)[source]

Custom downsampling of data based on one or more annotations.

This function offers two primary modes of operation: 1. Grouping (stratify=False):

For a single annotation: The data is grouped by unique values of the annotation, and ‘n_samples’ rows are selected from each group.

For multiple annotations: The data is grouped based on unique combinations of the annotations, and ‘n_samples’ rows are selected from each combined group.

Stratification (stratify=True): - Annotations (single or multiple) are combined into a new column. - Proportionate stratified sampling is performed based on the unique

combinations in the new column, ensuring that the downsampled dataset maintains the proportionate representation of each combined group from the original dataset.

Parameters:

input_data (pd.DataFrame) – The input data frame.
annotations (str or list of str) – The column name(s) to downsample on. If multiple column names are provided, their values are combined using an underscore as a separator.
n_samples (int, default=None) –
The number of samples to return. Behavior differs based on the ‘stratify’ parameter: - stratify=False: Returns ‘n_samples’ for each unique value (or

combination) of annotations.
- stratify=True: Returns a total of ‘n_samples’ stratified by the frequency of every label or combined labels in the annotation(s).
stratify (bool, default=False) – If true, perform proportionate stratified sampling based on the unique combinations of annotations. This ensures that the downsampled dataset maintains the proportionate representation of each combined group from the original dataset.
rand (bool, default=False) – If true and stratify is True, randomly select the returned cells. Otherwise, choose the first n cells.
combined_col_name (str, default='_combined_') – Name of the column that will store combined values when multiple annotation columns are provided.
min_threshold (int, default=5) – The minimum number of samples a combined group should have in the original dataset to be considered in the downsampled dataset. Groups with fewer samples than this threshold will be excluded from the stratification process. Adjusting this parameter determines the minimum presence a combined group should have in the original dataset to appear in the downsampled version.

Returns:

output_data – The proportionately stratified downsampled data frame.

Return type:

pd.DataFrame

Notes

This function emphasizes proportionate stratified sampling, ensuring that the downsampled dataset is a representative subset of the original data with respect to the combined annotations. Due to this proportionate nature, not all unique combinations from the original dataset might be present in the downsampled dataset, especially if a particular combination has very few samples in the original dataset. The min_threshold parameter can be adjusted to determine the minimum number of samples a combined group should have in the original dataset to appear in the downsampled version.

spac.data_utils.ingest_cells(dataframe, regex_str, x_col=None, y_col=None, annotation=None)[source]

Read the csv file into an anndata object.

The function will also intialize features and spatial coordiantes.

Parameters:

dataframe (pandas.DataFrame) – The data frame that contains cells as rows, and cells informations as columns.
regex_str (str or list of str) – A string or a list of strings representing python regular expression for the features columns in the data frame. x_col : str The column name for the x coordinate of the cell.
y_col (str) – The column name for the y coordinate of the cell.
annotation (str or list of str) – The column name for the region that the cells. If a list is passed, multiple annotations will be created in the returned AnnData object.

Returns:

The generated AnnData object

Return type:

anndata.AnnData

spac.data_utils.load_csv_files(file_names)[source]

Read the csv file(s) into a pandas dataframe.

Parameters:: file_names (str or list) – A list of csv file paths to be combined into single list of dataframe output
Returns:: A pandas dataframe of all the csv files. The returned dataset will have an extra column called “loaded_file_name” containing source file name.
Return type:: pandas.dataframe

spac.data_utils.rescale_features(features, min_quantile=0.01, max_quantile=0.99)[source]

Clip and rescale features outside the minimum and maximum quantile.

The rescaled features will be between 0 and 1.

Parameters:

features (pandas.Dataframe) – The DataRrame of features.
min_quantile (float) – The minimum quantile to be consider zero.
max_quantile (float) – The maximum quantile to be considerd 1.

Returns:

The created DataFrame with normalized features.

Return type:

pandas.DataFrame

spac.data_utils.select_values(data, annotation, values=None, exclude_values=None)[source]

Selects values from either a pandas DataFrame or an AnnData object based on the annotation and values.

Parameters:

data (pandas.DataFrame or anndata.AnnData) – The input data. Can be a DataFrame for tabular data or an AnnData object.
annotation (str) – The column name in a DataFrame or the annotation key in an AnnData object to be used for selection.
values (str or list of str) – List of values for the annotation to include. If None, all values are considered for selection.
exclude_values (str or list of str) – List of values for the annotation to exclude. Can’t be combined with values.

Returns:

The filtered DataFrame or AnnData object containing only the selected rows based on the annotation and values.

Return type:

pandas.DataFrame or anndata.AnnData

spac.data_utils.subtract_min_per_region(adata, annotation, layer, min_quantile=0.01)[source]

Substract the minimum quantile of every marker per region.

Parameters:

adata (anndata.AnnData) – The AnnData object.
annotation (str) – The name of the annotation in adata to define batches.
min_quantile (float) – The minimum quantile to rescale to zero.
layer (str) – The name of the new layer to add to the AnnData object.

spac.data_utils.subtract_min_quantile(features, min_quantile=0.01)[source]

Subtract the features defined by the minimum quantile from all columns.

Parameters:

features (pandas.DataFrame) – The dataframe of features.
min_quantile (float) – The minimum quantile to be consider zero.

Returns:

dataframe with rescaled features.

Return type:

pandas.DataFrame

spac.data_utils.summarize_dataframe(df: DataFrame, columns, print_nan_locations: bool = False) → dict[source]

Summarize specified columns in a DataFrame.

For numeric columns, computes summary statistics. For categorical columns, returns unique labels and frequencies. In both cases, missing values (None/NaN) are flagged and their row indices identified.

Parameters:

df (pd.DataFrame) – The DataFrame to summarize.
columns (str or list of str) – The column name or list of column names to analyze.
print_nan_locations (bool, optional) – If True, prints the row indices where None/NaN values occur. Default is False.

Returns:

A dictionary where each key is a column name and its value is another dictionary with:

’data_type’: either ‘numeric’ or ‘categorical’

’missing_count’: int

’missing_indices’: list of row indices with missing values

’summary’: summary statistics if numeric or unique labels with

counts if categorical

Return type:

dict

spac.data_utils module

Functions