- spac.data_utils.downsample_cells(input_data, annotations, n_samples=None, stratify=False, rand=False, combined_col_name='_combined_', min_threshold=5)[source]
Custom downsampling of data based on one or more annotations.
This function offers two primary modes of operation: 1. Grouping (stratify=False):
For a single annotation: The data is grouped by unique values of the annotation, and ‘n_samples’ rows are selected from each group.
For multiple annotations: The data is grouped based on unique combinations of the annotations, and ‘n_samples’ rows are selected from each combined group.
Stratification (stratify=True): - Annotations (single or multiple) are combined into a new column. - Proportionate stratified sampling is performed based on the unique
combinations in the new column, ensuring that the downsampled dataset maintains the proportionate representation of each combined group from the original dataset.
- Parameters:
input_data (pd.DataFrame) – The input data frame.
annotations (str or list of str) – The column name(s) to downsample on. If multiple column names are provided, their values are combined using an underscore as a separator.
n_samples (int, default=None) –
The number of samples to return. Behavior differs based on the ‘stratify’ parameter: - stratify=False: Returns ‘n_samples’ for each unique value (or
combination) of annotations.
stratify=True: Returns a total of ‘n_samples’ stratified by the frequency of every label or combined labels in the annotation(s).
stratify (bool, default=False) – If true, perform proportionate stratified sampling based on the unique combinations of annotations. This ensures that the downsampled dataset maintains the proportionate representation of each combined group from the original dataset.
rand (bool, default=False) – If true and stratify is True, randomly select the returned cells. Otherwise, choose the first n cells.
combined_col_name (str, default='_combined_') – Name of the column that will store combined values when multiple annotation columns are provided.
min_threshold (int, default=5) – The minimum number of samples a combined group should have in the original dataset to be considered in the downsampled dataset. Groups with fewer samples than this threshold will be excluded from the stratification process. Adjusting this parameter determines the minimum presence a combined group should have in the original dataset to appear in the downsampled version.
- Returns:
output_data – The proportionately stratified downsampled data frame.
- Return type:
pd.DataFrame
Notes
This function emphasizes proportionate stratified sampling, ensuring that the downsampled dataset is a representative subset of the original data with respect to the combined annotations. Due to this proportionate nature, not all unique combinations from the original dataset might be present in the downsampled dataset, especially if a particular combination has very few samples in the original dataset. The min_threshold parameter can be adjusted to determine the minimum number of samples a combined group should have in the original dataset to appear in the downsampled version.