simcats_datasets.loading.pytorch
Implementation of a pytorch dataset class. Can be used to train machine learning approaches with CSD data.
@author: f.hader
Module Contents
Classes
Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data. |
|
Initializes an object for providing concatenated simcats_datasets data to pytorch. |
Module Implementation Details
- class simcats_datasets.loading.pytorch.SimcatsDataset(h5_path, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)
Bases:
torch.utils.data.Dataset
Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.
Initializes an object for providing simcats_datasets data to pytorch.
- Parameters:
h5_path (str) – The path to the h5 file containing the dataset.
specific_ids (Union[range, List[int], numpy.ndarray, None]) – Determines if only specific ids should be loaded. Using this option, the returned values are sorted according to the specified ids and not necessarily ascending. If set to None, all data is loaded. Default is None.
load_ground_truth (Union[Callable, str, None]) –
Defines the required type of ground truth data to be loaded. Accepts either a callable or a string. Callables must be of the same structure/interface as load_zeros_masks defined in simcats_datasets.loading.load_ground_truth. Strings must map to the function names of the loading functions defined in simcats_datasets.loading.load_ground_truth. If this is None, no ground truth are loaded is used, which restricts what output formats are possible. Default is None.
Example of available types (full list at simcats_datasets.loading.load_ground_truth):
’tct_masks’: The Total Charge Transition (TCT) mask generated by SimCATS.
’tc_region_masks’: Regions with a fixed number of total charges.
’tc_region_minus_tct_masks’: Regions with a fixed number of total charges, but with zeros between the regions (at tcts).
data_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if data should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.
Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
’min_max_0_1’: Min max scaling of the data to [0, 1]
’standardization’: Standardization of the data (mean=0, std=1)
’add_newaxis’: Adds new axis as first axis (required for UNET)
ground_truth_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if ground truth should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.
Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
’only_two_classes’: Reduce the number of classes in a mask to 2 (set every pixel > 1 = 1)
format_output (Union[Callable, str, None]) –
Defines the required type of data format for the output. Accepts either a callable or a string. Callables must be of the same structure/interface as format_dict_csd_float_ground_truth_long defined in simcats_datasets.support_functions.pytorch_format_output. Strings must map to the function names of the format functions defined in simcats_datasets.support_functions.pytorch_format_output. If this is None, format_dict_csd_float_ground_truth_long is used, which does return the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively. Default is None.
Example of available types (full list at simcats_datasets.support_functions.pytorch_format_output):
’format_dict_csd_float_ground_truth_long’: formats the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively
preload (bool) – Enables preloading the whole dataset during the initialization (requires more RAM). Default is True.
max_concurrent_preloads (int) – Determines how many CSDs are concurrently loaded from the dataset during the preload phase. This option only affects instances with preload = True. It allows to preload large datasets (for which it might not be possible to load the whole dataset into the memory at once), by loading them step by step and for example converting the CSDs to float32 with a corresponding data preprocessor. Default is 100,000.
progress_bar (bool) – Determines whether to display a progress bar while loading data. Default is False.
sensor_scan_dataset (bool) – Determines whether the dataset is a sensor scan dataset (contains sensor scans instead of CSDs). Default is False.
- property h5_path: str
- Return type:
str
- property sensor_scan_dataset: bool
- Return type:
bool
- property specific_ids: range | List[int] | numpy.ndarray | None
- Return type:
Union[range, List[int], numpy.ndarray, None]
- property load_ground_truth: Callable
- Return type:
Callable
- property data_preprocessors: List[Callable] | None
- Return type:
Union[List[Callable], None]
- property ground_truth_preprocessors: List[Callable] | None
- Return type:
Union[List[Callable], None]
- property format_output: Callable
- Return type:
Callable
- property preload: bool
- Return type:
bool
- property progress_bar: bool
- Return type:
bool
- property shape: Tuple[int]
- Return type:
Tuple[int]
- class simcats_datasets.loading.pytorch.SimcatsConcatDataset(h5_paths, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)
Bases:
torch.utils.data.ConcatDataset
Initializes an object for providing concatenated simcats_datasets data to pytorch.
- Parameters:
h5_paths (List[str]) – The paths to the h5 files containing the datasets to be concatenated.
specific_ids (Union[List[Union[range, int, numpy.ndarray, None]], None]) – Determines if only specific ids should be loaded. Using this option, the returned values are sorted according to the specified ids and not necessarily ascending. If set to None, all data is loaded. Expects a list of specific_id settings, with one entry for each provided h5_path. Default is None.
load_ground_truth (Union[Callable, str, None]) –
Defines the required type of ground truth data to be loaded. Accepts either a callable or a string. Callables must be of the same structure/interface as load_zeros_masks defined in simcats_datasets.loading.load_ground_truth. Strings must map to the function names of the loading functions defined in simcats_datasets.loading.load_ground_truth. If this is None, no ground truth are loaded is used, which restricts what output formats are possible. Default is None.
Example of available types (full list at simcats_datasets.loading.load_ground_truth):
’tct_masks’: The Total Charge Transition (TCT) mask generated by SimCATS.
’tc_region_masks’: Regions with a fixed number of total charges.
’tc_region_minus_tct_masks’: Regions with a fixed number of total charges, but with zeros between the regions (at tcts).
data_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if data should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.
Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
’min_max_0_1’: Min max scaling of the data to [0, 1]
’standardization’: Standardization of the data (mean=0, std=1)
’add_newaxis’: Adds new axis as first axis (required for UNET)
ground_truth_preprocessors (Union[List[Union[str, Callable]], None]) –
Defines if ground truth should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.
Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):
’only_two_classes’: Reduce the number of classes in a mask to 2 (set every pixel > 1 = 1)
format_output (Union[Callable, str, None]) –
Defines the required type of data format for the output. Accepts either a callable or a string. Callables must be of the same structure/interface as format_dict_csd_float_ground_truth_long defined in simcats_datasets.support_functions.pytorch_format_output. Strings must map to the function names of the format functions defined in simcats_datasets.support_functions.pytorch_format_output. If this is None, format_dict_csd_float_ground_truth_long is used, which does return the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively. Default is None.
Example of available types (full list at simcats_datasets.support_functions.pytorch_format_output):
’format_dict_csd_float_ground_truth_long’: formats the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively
preload (bool) – Enables preloading the whole dataset during the initialization (requires more RAM). Default is True.
max_concurrent_preloads (int) – Determines how many CSDs are concurrently loaded from the dataset during the preload phase. This option only affects instances with preload = True. It allows to preload large datasets (for which it might not be possible to load the whole dataset into the memory at once), by loading them step by step and for example converting the CSDs to float32 with a corresponding data preprocessor. Default is 100.000.
progress_bar (bool) – Determines whether to display a progress bar while loading data. Default is False.
sensor_scan_dataset (bool) – Determines whether the datasets are sensor scan datasets (contain sensor scans instead of CSDs). Default is False.
- property shape: Tuple[int]
- Return type:
Tuple[int]
- property h5_paths: List[str]
- Return type:
List[str]
- property sensor_scan_dataset: bool
- Return type:
bool
- property specific_ids: List[range | List[int] | numpy.ndarray | None] | None
- Return type:
Union[List[Union[range, List[int], numpy.ndarray, None]], None]
- property load_ground_truth: Callable
- Return type:
Callable
- property data_preprocessors: List[Callable] | None
- Return type:
Union[List[Callable], None]
- property ground_truth_preprocessors: List[Callable] | None
- Return type:
Union[List[Callable], None]
- property format_output: Callable
- Return type:
Callable
- property preload: bool
- Return type:
bool
- property progress_bar: bool
- Return type:
bool