simcats_datasets.loading.pytorch

Implementation of a pytorch dataset class. Can be used to train machine learning approaches with CSD data.

@author: f.hader

Module Contents

Classes

SimcatsDataset

Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.

SimcatsConcatDataset

Initializes an object for providing concatenated simcats_datasets data to pytorch.

Module Implementation Details

class simcats_datasets.loading.pytorch.SimcatsDataset(h5_path, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)

Bases: torch.utils.data.Dataset

Inheritance diagram of simcats_datasets.loading.pytorch.SimcatsDataset

Pytorch Dataset class implementation for SimCATS datasets. Uses simcats_datasets to load and provide (training) data.

Initializes an object for providing simcats_datasets data to pytorch.

Parameters:
  • h5_path (str) – The path to the h5 file containing the dataset.

  • specific_ids (Union[range, List[int], numpy.ndarray, None]) – Determines if only specific ids should be loaded. Using this option, the returned values are sorted according to the specified ids and not necessarily ascending. If set to None, all data is loaded. Default is None.

  • load_ground_truth (Union[Callable, str, None]) –

    Defines the required type of ground truth data to be loaded. Accepts either a callable or a string. Callables must be of the same structure/interface as load_zeros_masks defined in simcats_datasets.loading.load_ground_truth. Strings must map to the function names of the loading functions defined in simcats_datasets.loading.load_ground_truth. If this is None, no ground truth are loaded is used, which restricts what output formats are possible. Default is None.

    Example of available types (full list at simcats_datasets.loading.load_ground_truth):

    • ’tct_masks’: The Total Charge Transition (TCT) mask generated by SimCATS.

    • ’tc_region_masks’: Regions with a fixed number of total charges.

    • ’tc_region_minus_tct_masks’: Regions with a fixed number of total charges, but with zeros between the regions (at tcts).

  • data_preprocessors (Union[List[Union[str, Callable]], None]) –

    Defines if data should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):

    • ’min_max_0_1’: Min max scaling of the data to [0, 1]

    • ’standardization’: Standardization of the data (mean=0, std=1)

    • ’add_newaxis’: Adds new axis as first axis (required for UNET)

  • ground_truth_preprocessors (Union[List[Union[str, Callable]], None]) –

    Defines if ground truth should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):

    • ’only_two_classes’: Reduce the number of classes in a mask to 2 (set every pixel > 1 = 1)

  • format_output (Union[Callable, str, None]) –

    Defines the required type of data format for the output. Accepts either a callable or a string. Callables must be of the same structure/interface as format_dict_csd_float_ground_truth_long defined in simcats_datasets.support_functions.pytorch_format_output. Strings must map to the function names of the format functions defined in simcats_datasets.support_functions.pytorch_format_output. If this is None, format_dict_csd_float_ground_truth_long is used, which does return the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.pytorch_format_output):

    • ’format_dict_csd_float_ground_truth_long’: formats the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively

  • preload (bool) – Enables preloading the whole dataset during the initialization (requires more RAM). Default is True.

  • max_concurrent_preloads (int) – Determines how many CSDs are concurrently loaded from the dataset during the preload phase. This option only affects instances with preload = True. It allows to preload large datasets (for which it might not be possible to load the whole dataset into the memory at once), by loading them step by step and for example converting the CSDs to float32 with a corresponding data preprocessor. Default is 100,000.

  • progress_bar (bool) – Determines whether to display a progress bar while loading data. Default is False.

  • sensor_scan_dataset (bool) – Determines whether the dataset is a sensor scan dataset (contains sensor scans instead of CSDs). Default is False.

property h5_path: str
Return type:

str

property sensor_scan_dataset: bool
Return type:

bool

property specific_ids: range | List[int] | numpy.ndarray | None
Return type:

Union[range, List[int], numpy.ndarray, None]

property load_ground_truth: Callable
Return type:

Callable

property data_preprocessors: List[Callable] | None
Return type:

Union[List[Callable], None]

property ground_truth_preprocessors: List[Callable] | None
Return type:

Union[List[Callable], None]

property format_output: Callable
Return type:

Callable

property preload: bool
Return type:

bool

property progress_bar: bool
Return type:

bool

property shape: Tuple[int]
Return type:

Tuple[int]

class simcats_datasets.loading.pytorch.SimcatsConcatDataset(h5_paths, specific_ids=None, load_ground_truth=None, data_preprocessors=None, ground_truth_preprocessors=None, format_output=None, preload=True, max_concurrent_preloads=100000, progress_bar=False, sensor_scan_dataset=False)

Bases: torch.utils.data.ConcatDataset

Inheritance diagram of simcats_datasets.loading.pytorch.SimcatsConcatDataset

Initializes an object for providing concatenated simcats_datasets data to pytorch.

Parameters:
  • h5_paths (List[str]) – The paths to the h5 files containing the datasets to be concatenated.

  • specific_ids (Union[List[Union[range, int, numpy.ndarray, None]], None]) – Determines if only specific ids should be loaded. Using this option, the returned values are sorted according to the specified ids and not necessarily ascending. If set to None, all data is loaded. Expects a list of specific_id settings, with one entry for each provided h5_path. Default is None.

  • load_ground_truth (Union[Callable, str, None]) –

    Defines the required type of ground truth data to be loaded. Accepts either a callable or a string. Callables must be of the same structure/interface as load_zeros_masks defined in simcats_datasets.loading.load_ground_truth. Strings must map to the function names of the loading functions defined in simcats_datasets.loading.load_ground_truth. If this is None, no ground truth are loaded is used, which restricts what output formats are possible. Default is None.

    Example of available types (full list at simcats_datasets.loading.load_ground_truth):

    • ’tct_masks’: The Total Charge Transition (TCT) mask generated by SimCATS.

    • ’tc_region_masks’: Regions with a fixed number of total charges.

    • ’tc_region_minus_tct_masks’: Regions with a fixed number of total charges, but with zeros between the regions (at tcts).

  • data_preprocessors (Union[List[Union[str, Callable]], None]) –

    Defines if data should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):

    • ’min_max_0_1’: Min max scaling of the data to [0, 1]

    • ’standardization’: Standardization of the data (mean=0, std=1)

    • ’add_newaxis’: Adds new axis as first axis (required for UNET)

  • ground_truth_preprocessors (Union[List[Union[str, Callable]], None]) –

    Defines if ground truth should be preprocessed. Accepts a list of callables or strings. Callables must be of the same structure/interface as example_preprocessor defined in simcats_datasets.support_functions.data_preprocessing. Strings must map to the function names of the preprocessors defined in simcats_datasets.support_functions.data_preprocessing. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.data_preprocessing):

    • ’only_two_classes’: Reduce the number of classes in a mask to 2 (set every pixel > 1 = 1)

  • format_output (Union[Callable, str, None]) –

    Defines the required type of data format for the output. Accepts either a callable or a string. Callables must be of the same structure/interface as format_dict_csd_float_ground_truth_long defined in simcats_datasets.support_functions.pytorch_format_output. Strings must map to the function names of the format functions defined in simcats_datasets.support_functions.pytorch_format_output. If this is None, format_dict_csd_float_ground_truth_long is used, which does return the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively. Default is None.

    Example of available types (full list at simcats_datasets.support_functions.pytorch_format_output):

    • ’format_dict_csd_float_ground_truth_long’: formats the output as dict with entries ‘csd’ and ‘ground_truth’ of dtype float and long, respectively

  • preload (bool) – Enables preloading the whole dataset during the initialization (requires more RAM). Default is True.

  • max_concurrent_preloads (int) – Determines how many CSDs are concurrently loaded from the dataset during the preload phase. This option only affects instances with preload = True. It allows to preload large datasets (for which it might not be possible to load the whole dataset into the memory at once), by loading them step by step and for example converting the CSDs to float32 with a corresponding data preprocessor. Default is 100.000.

  • progress_bar (bool) – Determines whether to display a progress bar while loading data. Default is False.

  • sensor_scan_dataset (bool) – Determines whether the datasets are sensor scan datasets (contain sensor scans instead of CSDs). Default is False.

property shape: Tuple[int]
Return type:

Tuple[int]

property h5_paths: List[str]
Return type:

List[str]

property sensor_scan_dataset: bool
Return type:

bool

property specific_ids: List[range | List[int] | numpy.ndarray | None] | None
Return type:

Union[List[Union[range, List[int], numpy.ndarray, None]], None]

property load_ground_truth: Callable
Return type:

Callable

property data_preprocessors: List[Callable] | None
Return type:

Union[List[Callable], None]

property ground_truth_preprocessors: List[Callable] | None
Return type:

Union[List[Callable], None]

property format_output: Callable
Return type:

Callable

property preload: bool
Return type:

bool

property progress_bar: bool
Return type:

bool