simcats_datasets.generation

Module with functions for creating datasets.

Package Contents

Functions

create_dataset

Function for creating simcats_datasets v2 format datasets from given data.

create_simulated_dataset

Function for generating simulated datasets using SimCATS for simulations.

add_ct_by_dot_masks_to_dataset

Function for adding charge transitions labeled by dot masks to existing simulated datasets.

Package Implementation Details

simcats_datasets.generation.create_dataset(dataset_path, csds=None, sensor_scans=None, occupations=None, tct_masks=None, ct_by_dot_masks=None, line_coordinates=None, line_labels=None, metadata=None, max_len_line_coordinates_chunk=None, max_len_line_labels_chunk=None, max_len_metadata_chunk=None, dtype_csd=np.float32, dtype_sensor_scan=np.float32, dtype_occ=np.float32, dtype_tct=np.uint8, dtype_ct_by_dot=np.uint8, dtype_line_coordinates=np.float32)

Function for creating simcats_datasets v2 format datasets from given data.

Parameters:
  • dataset_path (str) – The path where the new (v2) HDF5 dataset will be stored.

  • csds (Optional[List[numpy.ndarray]]) – The list of CSDs to use for creating the dataset. A dataset can have either CSDs or sensor scans, but never both. Default is None.

  • sensor_scans (Optional[List[numpy.ndarray]]) – The list of sensor scans to use for creating the dataset. A dataset can have either CSDs or sensor scans, but never both. Default is None.

  • occupations (Optional[List[numpy.ndarray]]) – List of occupations to use for creating the dataset. Defaults to None.

  • tct_masks (Optional[List[numpy.ndarray]]) – List of TCT masks to use for creating the dataset. Defaults to None.

  • ct_by_dot_masks (Optional[List[numpy.ndarray]]) – List of CT by dot masks to use for creating the dataset. Defaults to None.

  • line_coordinates (Optional[List[numpy.ndarray]]) – List of line coordinates to use for creating the dataset. Defaults to None.

  • line_labels (Optional[List[dict]]) – List of line labels to use for creating the dataset. Defaults to None.

  • metadata (Optional[List[dict]]) – List of metadata to use for creating the dataset. Defaults to None.

  • max_len_line_coordinates_chunk (Optional[int]) – The expected maximal length for line coordinates in number of float values (each line requires 4 floats). If None, it is set to the largest value of the CSD (or sensor scan) shape. Default is None.

  • max_len_line_labels_chunk (Optional[int]) – The expected maximal length for line labels in number of uint8/char values (each line label, encoded as utf-8 json, should require at most 80 chars). If None, it is set to the largest value of the CSD (or sensor scan) shape * 20 (matching with allowed number of line coords). Default is None.

  • max_len_metadata_chunk (Optional[int]) – The expected maximal length for metadata in number of uint8/char values (each metadata dict, encoded as utf-8 json, should require at most 8000 chars, expected rather something like 4000, but could get larger for dot jumps metadata of high resolution scans). If None, it is set to 8000. Default is None.

  • dtype_csd (numpy.dtype) – Specifies the dtype to be used for saving CSDs. Default is np.float32.

  • dtype_sensor_scan (numpy.dtype) – Specifies the dtype to be used for saving sensor scans. Default is np.float32.

  • dtype_occ (numpy.dtype) – Specifies the dtype to be used for saving Occupations. Default is np.float32.

  • dtype_tct (numpy.dtype) – Specifies the dtype to be used for saving TCTs. Default is np.uint8.

  • dtype_ct_by_dot (numpy.dtype) – Specifies the dtype to be used for saving CT by dot masks. Default is np.uint8.

  • dtype_line_coordinates (numpy.dtype) – Specifies the dtype to be used for saving line coordinates. Default is np.float32.

Return type:

None

simcats_datasets.generation.create_simulated_dataset(dataset_path, simcats_config=default_configs['GaAs_v1'], n_runs=10000, resolution=np.array([100, 100]), volt_range=np.array([0.03, 0.03]), tags=None, num_workers=1, progress_bar=True, max_len_line_coordinates_chunk=100, max_len_line_labels_chunk=2000, max_len_metadata_chunk=8000, dtype_csd=np.float32, dtype_occ=np.float32, dtype_tct=np.uint8, dtype_line_coordinates=np.float32)

Function for generating simulated datasets using SimCATS for simulations.

Warning: This function expects that the simulation config uses IdealCSDGeometric from SimCATS. Other implementations are not guaranteed to work.

Parameters:
  • dataset_path (str) – The path where the dataset will be stored. Can also be an already existing dataset, to which new data is added.

  • simcats_config (dict) – Configuration for simcats simulation class. Default is the GaAs_v1 config provided by simcats.

  • n_runs (int) – Number of CSDs to be generated. Default is 10000.

  • resolution (numpy.ndarray) –

    Pixel resolution for both axis of the CSDs, first number of columns (x), then number of rows (y). Default is np.array([100, 100]).

    Example:

    [res_g1, res_g2]

  • volt_range (numpy.ndarray) – Volt range for both axis of the CSDs. Individual CSDs with the specified size are randomly sampled in the voltage space. Default is np.array([0.03, 0.03]) (usually the scans from RWTH GaAs offler sample are 30mV x 30mV).

  • tags (Optional[dict]) –

    Additional tags for the data to be simulated, which will be added to the dataset DataFrame. Default is None.

    Example:

    {“tags”: “shifted sensor, no noise”, “sample”: “GaAs”}.

  • num_workers (int) – Number of workers to parallelize dataset creation. Minimum is 1. Default is 1.

  • progress_bar (bool) – Determines whether to display a progress bar. Default is True.

  • max_len_line_coordinates_chunk (int) – Maximum number of line coordinates. This is the size of the flattened array, therefore 100 means 20 lines. Default is 100.

  • max_len_line_labels_chunk (int) – Maximum number of chars for the line label dict. Default is 2000.

  • max_len_metadata_chunk (int) – Maximum number of chars for the metadata dict. Default is 8000.

  • dtype_csd (numpy.dtype) – Specifies the dtype to be used for saving CSDs. Default is np.float32.

  • dtype_occ (numpy.dtype) – Specifies the dtype to be used for saving Occupations. Default is np.float32.

  • dtype_tct (numpy.dtype) – Specifies the dtype to be used for saving TCTs. Default is np.uint8.

  • dtype_line_coordinates (numpy.dtype) – Specifies the dtype to be used for saving line coordinates. Default is np.float32.

Return type:

None

simcats_datasets.generation.add_ct_by_dot_masks_to_dataset(dataset_path, num_workers=10, progress_bar=True, dtype_ct_by_dot=np.uint8, batch_size_per_worker=100)

Function for adding charge transitions labeled by dot masks to existing simulated datasets.

Parameters:
  • dataset_path (str) – The path where the dataset is stored.

  • num_workers (int) – Number of workers to parallelize dataset creation. Minimum is 1. Default is 10.

  • progress_bar (bool) – Determines whether to display a progress bar. Default is True.

  • dtype_ct_by_dot (numpy.dtype) – Specifies the dtype to be used for saving CT_by_dot masks. Default is np.uint8.

  • batch_size_per_worker (int) – Determines how many CT_by_dot masks are consecutively calculated by each worker, before saving them. Default is 100.

Return type:

None