"Demonstrating binary classification through machine learning, this project handles large image datasets in a cloud platform. It showcases cloud data management and the fundamentals of neural networks."

1. Dataset

The SEN12-FLOOD dataset [1] is built from 336 time series containing Sentinel 1 and Sentinel 2 sattelite images of areas that suffered a major flooding event during the 2019 winter. The acquisition period goes from December 2018 to May 2019. The observed areas are clustered in East Africa, South West Africa, Middle-East and Australia. A sequence corresponds to several tiles of size 512x512 each of them corresponding to a crop of a given acquisition.

The problem at hand is binary image classification for identidication of the presences of flooding (True = flooding, False = no flooding). The feature data is comprised of the images in the dataset and the labels are obtained from a set of accompanying JSON files containing information on geographic location, image adate, satellite orbit, and flooding status (flooding or no flooding). Examples of the images and their respective lables are provided in Figure 1.

**Figure 1. SEN12-FLOOD Image Dataset Examples.**

2. Image Processing and Partitioning

The dataset comprises more than 36,000 images distributed across various time series sets. Handling this vast amount of data locally could overtax the hard drive of a typical machine. To streamline this process, the images and their corresponding labels are stored in a Google Cloud Storage bucket.

To address the issue of large-scale data processing, a multiprocessing approach is employed. A Python script parses through the dataset directory, identifying each image file and corresponding label file. These files are processed in parallel using a pool of worker processes. Two functions handle the source image files and label files respectively. Each function extracts necessary information from the files, organizing the data into a structured dictionary format.

The source and label data, returned from the functions, are structured into separate dataframes. The dataframes are then merged based on a common key, creating a comprehensive dataframe containing all relevant information from both the source files and label files.

Once the data has been organized, the dataset is split into training, validation, and test sets using a stratified-split function. This function employs a stratified shuffle split approach, preserving the percentage of samples for each class during the splitting process. This is particularly vital when handling imbalanced datasets, ensuring each split has representative samples from all classes.

As a result, the stratified-split function returns separate dataframes for the training, validation, and test datasets, along with their respective labels. This approach ensures the data feeding into the machine learning models is well-structured, representative, and suitably split for optimal model training and performance evaluation.

3. Feature Extraction

By leveraging Google Colab notebooks, the cloud storage bucket can be directly interfaced. This is made possible through the use of specific Python packages designed for this exact purpose. By doing so, file paths in the comprehensive dataframe are used to access individual files in the bucket directly, eliminating the need for intermediate storage. This approach significantly saves on memory and processing space, making the operation more efficient and manageable.

With the dataset securely hosted on Google Cloud Storage, the feature extraction phase (Figure 2) can commence. A custom image dataset class is implemented which allows for effortlessly accessing and processing the large image dataset. This class initializes with the comprehensive dataframe containing the file paths to the images, and optionally, any image transformations one wishes to apply. It leverages the `torch.utils.data.Dataset` interface, providing the freedom to define data interaction.

A class method is defined enabling access to individual images using their indices. This method opens an image file given its path, converts it to grayscale, applies any requested transformations, and finally retrieves its label. The transformations, defined using PyTorch's `transforms` module, include resizing the images to a standard 224x224 format, converting them to tensor format, and normalizing them using standard ImageNet constants. This ensures that the images are in an optimal format for the machine learning model.

Upon defining this class, the process of creating the image datasets is encapsulated into a function. This function takes in feature and target dataframes, along with any transformations, to output a ready-to-use instance of the image dataset.

To make this data accessible for the machine learning model, datasets are wrapped with PyTorch's `DataLoader`. This provides us with an iterable over the datasets, allowing us to easily batch, shuffle, and multi-thread load the images. The `DataLoader` instances - `train_loader`, `val_loader`, and `test_loader`, represent the training, validation, and test datasets respectively.

With the feature extraction pipeline in place, model training can commence. This efficient and organized process allows for handling a large volume of images without overwhelming local resources, making it a scalable solution for image-based machine learning projects.

Feature Extraction Process.png — **Figure 2. Feature Extraction Process.**

4. Model Implementation and Selection

A logistic regression model is implemented as a class employing PyTorch's neural network module. It's initiated with two parameters, the dimensions of the input data and the learning rate for the optimization algorithm. The model uses GPU acceleration if available, otherwise defaults to CPU. The logistic regression model is defined by two parameters: the weights vector "w", initialized to zero, and the bias term "b". The gradients of these parameters are stored in a dictionary called "grads".

The forward propagation of the model is calculated using the sigmoid function, which maps any real-valued number into the range between 0 and 1. This makes it suitable for models where the output needs to represent a probability. Backpropagation computes the gradient of the loss function with respect to the model parameters. This gradient information is later used by the optimizer to adjust the parameters and minimize the loss.

The model's loss function is calculated as the cross-entropy between the predicted probabilities and the true labels. It represents the average negative log-probability of the correct class under the model’s predictions.

A utility function is provided for prediction, where any value greater than or equal to 0.5 is rounded up to 1, otherwise, it's rounded down to 0. The accuracy of the model is then calculated as the fraction of predictions that match the true labels.

For model selection, various learning rates are explored. The model is trained using these learning rates and the one that achieves the highest accuracy on the validation set is selected. During training, both the training and validation sets are passed through the model. After each iteration, the model parameters are updated to minimize the loss function.

There is a mechanism to stop training if the validation accuracy does not improve for a number of iterations, a method known as "Early Stopping". This helps prevent overfitting and reduces computational waste.

Lastly, the model is tested on a test set to evaluate its performance on unseen data. The model's cost and accuracy metrics are logged every 10 iterations for further analysis. The optimal learning rate and the corresponding best validation accuracy are then displayed.

The model parameters are saved whenever there's an improvement in validation accuracy. At the end of training, the parameters of the best model are loaded back into the model for future predictions.

5. Results and Conclusions

Results indicate that the best hyperparameters found during model training are a learning rate of 0.1 and that this configuration of the model achieved the highest validation accuracy of 78.125%. The learning rate is a hyperparameter that determines the step size at each iteration while moving towards a minimum of a loss function, it dictates how much the model parameters will change in response to the estimated error each time the model weights are updated.

The model's performance on both a single batch of the test set (n=32) and the entire test set (n=5704) provides further insight into the model's capabilities and potential limitations. For the single test batch (n=32), the model demonstrated a good accuracy of 0.81 or 81%. This means the model correctly predicted the class 81% of the time. The precision and recall were both at 0.50, indicating the model had a balanced performance in terms of false positives and false negatives for this batch. However, when the model was tested on the entire test set (n=5704), the performance metrics decreased. The accuracy dropped to 0.65, suggesting that the model was correctly predicting the class 65% of the time. This is somewhat lower than the accuracy observed on the validation set and the single test batch. Furthermore, the precision decreased to 0.30, indicating a higher rate of false positives when the model is applied to the entire test set. Similarly, the recall also decreased to 0.36, implying that the model is not capturing all the true positives it could have.

The observed decrease in performance metrics on the entire test set suggests that the model may not be generalizing well to unseen data. There may be overfitting to the training data or the model may not have learned some patterns in the data that are necessary for making accurate predictions on the test set. This could imply the need for further hyperparameter tuning, application of regularization techniques, or a more complex model architecture to improve performance.

In practice, developing a classifier for use in a production environment would involve a much more thorough process. This process would typically include, but is not limited to, rigorous exploratory data analysis, more complex feature engineering and selection strategies, advanced model architectures, extensive hyperparameter tuning, robust evaluation techniques, and stringent model validation methodologies.

This project is intended primarily for demonstration and educational purposes, and it serves as a basic illustration of how a machine learning model can be developed, trained, and evaluated. The techniques used, and the performance of the model, may not be representative of what would be employed or achieved in a real-world, professional setting. The model in this demonstration was trained and tested without extensive fine-tuning. In a real-world scenario, we would typically use more data and apply more sophisticated techniques to maximize the predictive accuracy of the model, prevent overfitting, and ensure it generalizes well to unseen data.

Potential strategies for model accuracy improvement include more hyperparameter tuning and expanding model archtiecture. Other hyperparameters for logistic regression include solver type and regularization. Ranges of these hyperparameters can be tested using a grid or randomized search to identify the best model with the optimal hyperparameter configuration. Experimentation with model architectures more sophisticated than logistic regression could improve accuracy. These more sophisticated architectures could include, deep learning models such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs).

Logistic Regression from Scratch in PyTorch: Binary Image Classification