Preparing datasets

The following code details the functions and classes that are available to build a dataset of images and prepare it for training. Functions to label unlabelled data to use for semi-supervised learning are also presented.

Examples of how to use these functions can be found in Build a dataset from scratch and in Improve a classification model using unlabelled images.

Data augmentation

class decavision.dataset_preparation.data_augmentation.DataAugmentor(path, distortion=False, flip_horizontal=False, flip_vertical=False, random_crop=False, random_erasing=False, rotate=False, resize=False, skew=False, shear=False, brightness=False, contrast=False, color=False, target_width=299, target_height=299)

Class to generate augmented images for classification purposes. Images must be located in a folder with subfolders for each category.

Parameters
  • path (str) – location of the folders containing images for each category

  • option (bool) – use or not option during the augmentation, option can be anything between distortion, flip_horizontal, flip_vertical, random_crop, random_erasing, rotate, resize, skew, shear, brightness, contrast and color (options described at https://github.com/mdbloice/Augmentor)

  • target_width (int) – new width of resized images (if resized is True)

  • target_height (int) – new height of resized images (if resized is True)

generate_images(class_size_approximation)

Generate augmented images for all categories in the main folder according to the specified options. Images are generated until the desired size is reached and they are saved in the same subfolder as the original images.

Parameters

class_size_approximation (int) – approximate total number of images wanted for each category

generate_images_single_class(class_size_approximation, category)

Generate augmented images for a single category according to the specified options. Images are generated until the desired size is reached and they are saved in an outputs folder at the location of the original images.

Parameters
  • class_size_approximation (int) – approximate total number of images wanted

  • category (str) – folder name where images to augment are located in the main folder

Make tfrecords

class decavision.dataset_preparation.generate_tfrecords.TfrecordsGenerator

Class to transform images into tfrecords format to train neural networks. Resulting files can be saved to google storage or locally. Can’t be used with a TPU because local files need to be read. Strongly inspired by: https://medium.com/@moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36

convert_image_folder(img_folder='data/image_dataset/train', output_folder='data/tfrecords_dataset/train', multilabel=False, img_folder_new=None, target_size=None, shards=16, json_path=None)

Convert all images in a folder (like train or val) to tfrecords. Folder must contain subfolders for each category. Possibility to combine data from two folders to perform progressive learning. Tfrecords can be saved locally or on google storage. A csv file containing the names of the classes is also saved.

For multilabel all the images must be in a single folder and there must exist a json file with the keys being the filenames and the values being lists of labels.

Parameters
  • img_folder (str) – location of the images

  • output_folder (str) – folder to save the results, content of folder is deleted to save new data

  • img_folder_new (str) – if specified, images from this folder are included in the tfrecords as new categories for the purpose of progressive learning

  • multilabel (bool) – True if it is a multilabel problem

  • shards (int) – number of files to create

  • target_size (tuple(int,int)) – size to reshape the images if desired

  • json_path (str) – location of the json file, only used for multilabel

Generate pseudo labels for unlabelled data

class decavision.dataset_preparation.generate_pseudolabels.PseudoLabelGenerator(model_path='model.h5', train_data_path='data/image_dataset/train', unlabeled_path='data/unlabeled', pseudo_data_path='data/train_ssl', output_folder='outputs', csv_filename='data.csv')

Class to generate pseudo labels for unlabeled images using a trained model.

Parameters
  • model_path (str) – location of the h5 tensorflow model to use

  • train_data_path (str) – folder which holds training data

  • unlabeled_path (str) – folder which holds unlabeled data

  • pseudo_data_path (str) – folder to store training data and pseudo data combined

  • output_folder (str) – folder to store outputs

  • csv_filename (str) – name of csv file

generate_pseudolabel_data(plot_confidences=False, threshold=None, move_images=False, batch_size=32)

Use trained model to make pseudo labels and save them into a csv file. Also possible to plot the results and move the unlabeled images directly to the category corresponding to their pseudo label.

Parameters
  • plot_confidences (boolean) – Whether to plot confidence graphs for raw confidences and per class confidences.

  • threshold (float) – Discard images with prediction below this confidence, default is None. Only used if move_images is True.

  • move_images (bool) – Move images into categories or not

  • batch_size (int) – Batch size while making predictions

Returns

A folder with both labeled and pseudo labeled images.

Return type

pseudo_data_path

move_unlabeled_images(threshold=None)

Split unlabeled images into folders of the training data based on pseudo labels. A new training dataset is created with the labeled and pseudo labeled data.

Parameters

threshold (float) – Discard images with prediction below this confidence, default is None.

plot_confidence_scores(per_class=True, overall=True)

Generate bar plots for highest confidence predictions per class and overall and save them.

Parameters
  • per_class (bool) – make bar plots per class or not

  • overall (bool) – make overall bar plot or not