Synergos in Action

We’ve launched Synergos, a Federated Learning platform AI Singapore has been working on.

The launch marks the start of a long journey. We invite you to join us on this journey. Let’s start by using it as a tool in your machine learning toolbox, and even contribute your code to the repo. Do share with us your feedback and suggestions for new features or areas we can improve.

Now, let’s see a case study of Synergos in action.

 

The Problem Statement

AI Singapore has been working with Seagate Technology—the global leader in mass-data storage solutions, from hard drives to storage-as-a-service and mass-data storage shuttles. Many of its customers deploys a big fleet of data storage devices manufactured by Seagate. Those devices generate telemetry logs on a regular basis. There are also health logs included in similar settings, which helps to monitor the health of the devices.

Seagate is developing a predictive analytics service for its customers, which is expected to forecast which devices are likely in need of servicing. This enables it to provide value-added service ensuring high availability of its customers’ services, right-sizing maintenance for undisrupted operation of their customers with low maintenance cost overhead.

Each customer keeps the logs of its whole fleet of devices locally. Moving data from different customers to a single location for training presents potential challenges in preserving data ownership, residency, and sovereignty. Therefore, to increase customer adoption of telemetry data predictive analytics, while achieving utility from the data, Seagate is collaborating with AI Singapore to apply Synergos in building the predictive maintenance model with Federated Learning.

This is a multi-phase collaboration. Before Federated Learning is deployed into production, Seagate would like to run a pilot to validate the efficacy of Federated Learning. In the current phase, the goal is to validate whether the models built with Synergos (i.e. federated training mode) can achieve the same level of performance as those built with different customers’ data being pooled together into a centralized location (i.e. centralized training mode).

 

The Data

In this phase, we are working with two datasets from Seagate. In keeping with Seagate’s uncompromising ethical standards when it comes to data, the raw log data was anonymized and in no way ties back to user data.

For each data source, the raw data is composed of the device fleets’ telemetry logs and their corresponding health logs. The data is collected daily over a 20-day period. The raw data is processed by Seagate to construct the data used in training.

  • Records: Each record corresponds to one device on a given day. A device would have multiple records since a device may generate telemetry logs on different days.
  • Features: Initially, there are a few hundred features in the telemetry data. After some feature engineering by Seagate (e.g. feature normalization, missing value imputation, etc.), about ⅓ of the features are eventually selected. The feature names are not exposed to the AI Singapore team, with only the feature data types provided. The feature names are represented by their indices.
  • Labels: Labels are derived from the health logs. Health logs record the health status of the devices. There are three possible label values: 0, 1, 2. Each label represents an increasing recency to the device requiring servicing, starting with label 0 as having no failure in the data period.

For Dataset 1, the distribution of labels is 0: 68.2% / 1: 15.4% / 2: 16.4%; while the same for Dataset 2 is 0: 69.6% / 1: 22.3% / 2: 8.1%, as illustrated in the chart below. It can be seen here that individual datasets show a sign of non-IID data, as different data sources have varying ratios of different classes. The numbers of records in Dataset 1 and Dataset 2 are also significantly different, i.e. Dataset 2’s data is about 35 times of Dataset 1’s data. Seagate had confirmed that there is data heterogeneity as their data sources are applying different workloads for the usage of their devices.

When conducting the train-test split, 30% of the unique devices that ever failed (either class 1 or 2) and those that never failed (class 0) are kept in the test set, while the other 70% unique devices of both categories are kept in the train set. By doing so, the datasets were split in a stratified manner.

 

The Methodology

As discussed above, the main objective of the current phase is to validate whether the models built with Synergos in federated training mode can achieve the same level of performance as those built-in centralized training mode.

 

Model Training in Centralized Mode

We first built models in centralized mode. Seagate has worked on a centralized Gradient Boosting Decision Tree (GBDT) model. GBDT is a popular tree-based ensemble model. As non-neural network-based models support is still in development for Synergos, we focus on neural network models in the current phase of the collaboration. We first built a neural network model in centralized mode as a benchmark in the current phase.

In the centralized mode, an aggregated dataset is created by combining data from both datasets. The training data and testing data are aggregated separately after the stratified split is done.

We then ran several experiments in Polyaxon, which is a tool used in AI Singapore for experiment management. The best performing model is selected based on the weighted F1 scores across datasets, and the relative model architecture complexity and size to achieve the scores. It is denoted as NN9. The model architecture is shown below:

This model serves as a baseline when validating the efficacy of Federated Learning.

 

Model Training in Federated Mode

There are two data sources, where each of them is treated as one participant in the federated training. Training is coordinated across these two participants, by a trusted third party (TTP). The same train-test split strategy is applied to each party.

Synergos supports multiple federated aggregation algorithms, including FedAvg, FedProx, FedGKT, etc. In the current phase of the collaboration, we applied two of them, namely FedProx and FedGKT.

FedProx is an enhancement of FedAvg, which is usually seen as the most basic version of a federated aggregation algorithm. In FedAvg, different parties train a global model collectively, with a TTP coordinating the training across different parties. At each global training round t, a global model is sent to all parties. Each party performs local training on their own dataset, typically using mini-batch gradient descent, for E local epochs with B mini-batch size. After every E local epochs, each party sends the parameters from its most recently obtained model state to the TTP. The TTP then updates the global model by conducting a weighted average of the parameters received from multiple parties, with individual parties’ weights θ proportional to their number of records used in the local training. This process iterates until the global model converges or a prefixed number of global training rounds is reached. The diagram below gives a simplified illustration of the FedAvg aggregation process.

FedProx is using a similar aggregation mechanism as FedAvg does. One key improvement FedProx has over FedAvg is that it introduces an additional proximal term to the local training, which essentially restricts the local updates to be closer to the latest global model, which helps the federated training to converge faster. The proximal term is scaled by a hyper-parameter µ, which is to be tuned during training.

Another federated aggregation algorithm used is FedGKT (Federated Group Knowledge Transfer). It was originally proposed to allow low-compute federated training of big CNN-based models with millions of parameters (e.g., ResNet 101, VGG 16/19, Inception, etc.) on resource-constrained edge devices (e.g., Raspberry Pi, Jetson Nano, etc.). The diagram below illustrates the training process of FedGKT.

FedGKT training (diagram adapted from original FedGKT paper)

Essentially, there is one model in FedGKT, but split into two sub-models, i.e. each participating party trains a compact local model (called A); and the TTP trains a larger sub-model (called B). Model A on each party consists of a feature extractor and a classifier, which is trained with the party’s local data only (called local training). After local training, all participating parties generate the same dimensions of output from the feature extractor, which are fed as input to model B at TTP. The TTP then conducts further training of B by minimizing the gap between the ground truth and the soft labels (probabilistic predictions) from the classifier of A. When TTP finishes its training of B, it sends its predicted soft labels back to the participating parties, who further train the classifier of A with only local data. The training also tries to minimize the gap between the ground truth and the soft labels predicted by B. The process iterates multiple rounds until the model converges.

When the training finishes, the final model is a stacked combination of local feature extractor and the shared model B. One of the main benefits of FedGKT is that it enables edge devices to train large CNN models since the heavy compute is effectively shifted to the TTP, who usually has more compute power. Another benefit is model customization, in that different participating parties would have different local feature extractors which will be combined with the shared model B.

In this pilot, shifting heavy computation processes to TTP is not the main motivation. FedGKT is chosen as one of the aggregation algorithms mainly because of the benefit of model customization. When training with FedGKT, the selected baseline model NN9 is split into two parts at layer L. The layers below L is to be trained as the feature extractor (with another layer of softmax acts as the classifier) by the participating parties with local data; while those above L is to be trained by the TTP. L is a hyperparameter to be tuned. We could set L=1 or L=2, since NN9 is a relatively simple model with three hidden layers.

 

Evaluation and Comparison of Different Models

After models are trained in both centralized and federated mode, we compile the results and compare the performance.

There are three models, including the centralized model (which serves as baseline), the federated model trained with FedAvg, and the federated model trained with FedGKT.

When evaluating the performance, we apply the models on the two datasets individually. We focus on the models’ performance on Class 1 and 2 (i.e. devices which failed within recency thresholds). We compare the performance achieved by the centralized model and federated models. The performance of the federated models is expected to be close to the performance of the centralized model. Performance metrics used include precision, recall, and F1.

 

Actions in Synergos

The following illustrates the setup of Synergos. Each party in action operates as a Docker container, allowing convenient and easy implementation. With all the necessary containers initialized for each party, a single human operator can orchestrate the entire federated process, from a Jupyter notebook or GUI. Each customer runs a Synergos Worker container on its own compute resource. We set up Synergos to run a cluster of multiple federated grids. Each federated grid has one TTP, who coordinates the multiple Synergos Workers within the grid. A Director (running Synergos Director container) orchestrates multiple TTPs. The Director leverages a Message Queue Exchange to facilitate the parallel job running across multiple federated grids. The setup is illustrated below, with each terminal representing a docker container running.

Users interact with Synergos via its GUI – Synergos Portal. There are two types of users, namely orchestrators and participants.

The orchestrator interacts with the Director, and defines the configuration of the federated training, i.e. the hierarchy of collaboration, project, experiment, and run.

A collaboration defines a coalition of parties agreeing to work together for a common goal (or problem statement). Within a collaboration, there may be multiple projects. Each project corresponds to a collection of data different parties in the collaboration use. Under a project, there will be multiple experiments. And each of them corresponds to one particular type of model to be trained, e.g. logistic regression, neural network, etc. And there are multiple runs under each experiment, each of them uses a different set of hyper-parameters. In this case, the two datasets form a collaboration. With this collaboration, one project has been defined, where its goal is to build the predictive maintenance model. Under this project, one experiment is defined, as we are only using the NN9 network. Under this experiment, there are multiple runs, each of which corresponds to a different hyper-parameter setting, including the federated aggregation method used (e.g. FedProx and FedGKT).

The interaction flow for the orchestrator is shown below.

With the configuration of federated training completed, the participants can then proceed to register themselves to the collaboration/project they want to contribute. They also declare the compute resource and data they are going to use.

After both the orchestrator and participants provide their meta-data, the orchestrator would then start the federated training. No further actions are required from the participants. The orchestrator can also view the progress of the federated training and the status of various Synergos components in a Command Station Analytics Dashboard. Please refer to our user guide for a walkthrough of the steps to build federated models in Synergos.

 

The Outcome

The performance of the centralized model NN9 is as follows. This serves as the baseline, when comparing with the federated models.

Dataset

Precision

(Class 1)

Recall

(Class 1)

F1

(Class 1)

Precision

(Class 2)

Recall

(Class 2)

F1 

(Class 2)

1

0.483

0.244

0.324

0.534

0.633

0.579

2

0.528

0.408

0.461

0.300

0.410

0.346

Performance of the federated model trained with FedProx:

Dataset

Precision

(Class 1)

Recall

(Class 1)

F1

(Class 1)

Precision

(Class 2)

Recall

(Class 2)

F1

(Class 2)

1

0.384

0.331

0.356

0.557

0.584

0.570

2

0.318

0.556

0.404

0.293

0.356

0.321

Performance of the federated model trained with FedGKT:

Dataset

Precision

(Class 1)

Recall

(Class 1)

F1

(Class 1)

Precision

(Class 2)

Recall

(Class 2)

F1

(Class 2)

1

0.225

0.463

0.302

0.523

0.454

0.486

2

0.295

0.779

0.428

0.297

0.174

0.220

For easy comparison, the difference in terms of F1 score between the federated models and the centralized model is calculated and shown in the table below. A negative value here means that the F1 achieved by the federated model is higher than that achieved by the baseline (i.e. the centralized model), which signifies that the federated model performs better than the baseline.

 

ΔF1 

(Centralized vs. FedProx)

ΔF1 

(Centralized vs. FedGKT)

Dataset

Class 1

Class 2

Class 1

Class 2

1

-0.032

0.009

0.021

0.094

2

0.056

0.025

0.032

0.127

What is reported in the last three tables is the performance of the best performing federated models (FedProx or FedGKT). As shown in the last section, the Director in the Synergos Orchestration component has been used to tune the hyperparameters. In total, 86 models have been trained during the tuning process. The chart below shows the average performance of all the different models trained by the Director.

 

ΔF1 

(Centralized vs.FedProx)

ΔF1 

(Centralized vs. FedGKT)

Dataset

Class 1

Class 2

Class 1

Class 2

1

mean

0.026

0.044

-0.047

0.333

std

0.041

0.032

0.073

0.143

2

mean

0.075

0.025

0.037

0.100

std

0.013

0.003

0.054

0.059

It can be observed that the federated models trained with FedProx can attain comparable model performance (ΔF1 is small) as the baseline model does, while maintaining individual data sources’ data confidentiality. The best performing model trained FedGKT also manages to achieve a performance that is close to that of the baseline. Nevertheless, FedGKT achieves worse performance than FedProx for Class 2 across both datasets. The models trained with FedGKT also exhibit higher variance in performance. This could be because Class 2 is generally a smaller class (compared to Class 0 and 1), and the simple model trained locally is not able to extract meaningful features for this class.

 

Next Step

We have seen that the federated models do achieve a similar level of performance as that of the centralized baseline model, which serves the current phase’s objective. We have also seen how Synergos can be used to train and tune federated models with ease.

In their application of predictive analytics, Seagate originally used a Gradient Boosted Decision Tree (GBDT) model in their experiment. This highlights the case that machine learning in production is not restricted to deep neural network models. We are working on adding federated GBDT support in Synergos to extend the capabilities of the platform. The support of federated GBDT also goes beyond this collaboration. It would provide Synergos users with a greater variety of models, besides the current deep neural network-based models.


Acknowledgments:

We’d like to give thanks to the Seagate Technology team (Hamza Jeljeli, Ed Yasutake, and Saravanan Nagarajan) who provided the use case and support with the launch of our Federated Learning platform Synergos.


The Federated Learning Series

 

Author