`simcats_datasets.loading.pytorch`

Implementation of a pytorch dataset class. Can be used to train machine learning approaches with CSD data.

@author: f.hader

Module Contents

Classes

`SimcatsDataset`	Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.
`SimcatsConcatDataset`	Pytorch ConcatDataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.

Module Implementation Details

class simcats_datasets.loading.pytorch.SimcatsDataset(h5_path, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)

Bases: torch.utils.data.Dataset

Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.

Initializes an object for providing simcats_datasets data to pytorch.

Parameters:

h5_path (str) – The path to the h5 file containing the dataset.
specific_ids (Union[range, List[int], numpy.ndarray, None]) – Determines if only specific ids should be loaded. Using this option, the returned values are sorted according to the specified ids and not necessarily ascending. If set to None, all data is loaded. Default is None.
load_ground_truth (Union[Callable, str, None]) –
Defines the required type of ground truth data to be loaded. Accepts either a callable or a string. Callables must be of the same structure/interface as load_zeros_masks defined in simcats_datasets.loading.load_ground_truth. Strings must map to the function names of the loading functions defined in simcats_datasets.loading.load_ground_truth. If this is None, no ground truth are loaded is used, which restricts what output formats are possible. Default is None.

Example of available types (full list at simcats_datasets.loading.load_ground_truth):
- ’tct_masks’: The Total Charge Transition (TCT) mask generated by SimCATS.
- ’tc_region_masks’: Regions with a fixed number of total charges.
- ’tc_region_minus_tct_masks’: Regions with a fixed number of total charges, but with zeros between the regions (at tcts).
data_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if data should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
- ’min_max_0_1’: Min max scaling of the data to [0, 1]
- ’standardization’: Standardization of the data (mean=0, std=1)
- ’add_newaxis’: Adds new axis as first axis (required for UNET)
ground_truth_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if ground truth should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
- ’only_two_classes’: Reduce the number of classes in a mask to 2 (set every pixel > 1 = 1)
format_output (Union[Callable, str, None]) –
Defines the required type of data format for the output. Accepts either a callable or a string. Callables must be of the same structure/interface as format_dict_csd_float_ground_truth_long defined in simcats_datasets.support_functions.pytorch_format_output. Strings must map to the function names of the format functions defined in simcats_datasets.support_functions.pytorch_format_output. If this is None, format_dict_csd_float_ground_truth_long is used, which does return the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively. Default is None.

Example of available types (full list at simcats_datasets.support_functions.pytorch_format_output):
- ’format_dict_csd_float_ground_truth_long’: formats the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively
preload (bool) – Enables preloading the whole dataset during the initialization (requires more RAM). Default is True.
max_concurrent_preloads (int) – Determines how many CSDs are concurrently loaded from the dataset during the preload phase. This option only affects instances with preload = True. It allows to preload large datasets (for which it might not be possible to load the whole dataset into the memory at once), by loading them step by step and for example converting the CSDs to float32 with a corresponding data preprocessor. Default is 100,000.
progress_bar (bool) – Determines whether to display a progress bar while loading data. Default is False.
sensor_scan_dataset (bool) – Determines whether the dataset is a sensor scan dataset (contains sensor scans instead of CSDs). Default is False.

property h5_path: str

Return type:: str

property sensor_scan_dataset: bool

Return type:: bool

property specific_ids: range | List[int] | numpy.ndarray | None

Return type:: Union[range, List[int], numpy.ndarray, None]

property load_ground_truth: Callable

Return type:: Callable

property data_preprocessors: List[Callable] | None

Return type:: Union[List[Callable], None]

property ground_truth_preprocessors: List[Callable] | None

Return type:: Union[List[Callable], None]

property format_output: Callable

Return type:: Callable

property preload: bool

Return type:: bool

property progress_bar: bool

Return type:: bool

property shape: Tuple[int]

Return type:: Tuple[int]

class simcats_datasets.loading.pytorch.SimcatsConcatDataset(h5_paths, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)

Bases: torch.utils.data.ConcatDataset

Pytorch ConcatDataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.

Initializes an object for providing concatenated simcats_datasets data to pytorch.