Preparing datasets
The following code details the functions and classes that are available to build a dataset of images and prepare it for training. Functions to label unlabelled data to use for semi-supervised learning are also presented.
Examples of how to use these functions can be found in Build a dataset from scratch and in Improve a classification model using unlabelled images.
Data augmentation
- class decavision.dataset_preparation.data_augmentation.DataAugmentor(path, distortion=False, flip_horizontal=False, flip_vertical=False, random_crop=False, random_erasing=False, rotate=False, resize=False, skew=False, shear=False, brightness=False, contrast=False, color=False, target_width=299, target_height=299)
Class to generate augmented images for classification purposes. Images must be located in a folder with subfolders for each category.
- Parameters
path (str) – location of the folders containing images for each category
option (bool) – use or not option during the augmentation, option can be anything between distortion, flip_horizontal, flip_vertical, random_crop, random_erasing, rotate, resize, skew, shear, brightness, contrast and color (options described at https://github.com/mdbloice/Augmentor)
target_width (int) – new width of resized images (if resized is True)
target_height (int) – new height of resized images (if resized is True)
- generate_images(class_size_approximation)
Generate augmented images for all categories in the main folder according to the specified options. Images are generated until the desired size is reached and they are saved in the same subfolder as the original images.
- Parameters
class_size_approximation (int) – approximate total number of images wanted for each category
- generate_images_single_class(class_size_approximation, category)
Generate augmented images for a single category according to the specified options. Images are generated until the desired size is reached and they are saved in an outputs folder at the location of the original images.
- Parameters
class_size_approximation (int) – approximate total number of images wanted
category (str) – folder name where images to augment are located in the main folder
Make tfrecords
- class decavision.dataset_preparation.generate_tfrecords.TfrecordsGenerator
Class to transform images into tfrecords format to train neural networks. Resulting files can be saved to google storage or locally. Can’t be used with a TPU because local files need to be read. Strongly inspired by: https://medium.com/@moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
- convert_image_folder(img_folder='data/image_dataset/train', output_folder='data/tfrecords_dataset/train', multilabel=False, img_folder_new=None, target_size=None, shards=16, json_path=None)
Convert all images in a folder (like train or val) to tfrecords. Folder must contain subfolders for each category. Possibility to combine data from two folders to perform progressive learning. Tfrecords can be saved locally or on google storage. A csv file containing the names of the classes is also saved.
For multilabel all the images must be in a single folder and there must exist a json file with the keys being the filenames and the values being lists of labels.
- Parameters
img_folder (str) – location of the images
output_folder (str) – folder to save the results, content of folder is deleted to save new data
img_folder_new (str) – if specified, images from this folder are included in the tfrecords as new categories for the purpose of progressive learning
multilabel (bool) – True if it is a multilabel problem
shards (int) – number of files to create
target_size (tuple(int,int)) – size to reshape the images if desired
json_path (str) – location of the json file, only used for multilabel
Generate pseudo labels for unlabelled data
- class decavision.dataset_preparation.generate_pseudolabels.PseudoLabelGenerator(model_path='model.h5', train_data_path='data/image_dataset/train', unlabeled_path='data/unlabeled', pseudo_data_path='data/train_ssl', output_folder='outputs', csv_filename='data.csv')
Class to generate pseudo labels for unlabeled images using a trained model.
- Parameters
model_path (str) – location of the h5 tensorflow model to use
train_data_path (str) – folder which holds training data
unlabeled_path (str) – folder which holds unlabeled data
pseudo_data_path (str) – folder to store training data and pseudo data combined
output_folder (str) – folder to store outputs
csv_filename (str) – name of csv file
- generate_pseudolabel_data(plot_confidences=False, threshold=None, move_images=False, batch_size=32)
Use trained model to make pseudo labels and save them into a csv file. Also possible to plot the results and move the unlabeled images directly to the category corresponding to their pseudo label.
- Parameters
plot_confidences (boolean) – Whether to plot confidence graphs for raw confidences and per class confidences.
threshold (float) – Discard images with prediction below this confidence, default is None. Only used if move_images is True.
move_images (bool) – Move images into categories or not
batch_size (int) – Batch size while making predictions
- Returns
A folder with both labeled and pseudo labeled images.
- Return type
pseudo_data_path
- move_unlabeled_images(threshold=None)
Split unlabeled images into folders of the training data based on pseudo labels. A new training dataset is created with the labeled and pseudo labeled data.
- Parameters
threshold (float) – Discard images with prediction below this confidence, default is None.
- plot_confidence_scores(per_class=True, overall=True)
Generate bar plots for highest confidence predictions per class and overall and save them.
- Parameters
per_class (bool) – make bar plots per class or not
overall (bool) – make overall bar plot or not