MONet-FL Tutorial#

Federated Learning with MONet Bundle and NVIDIA Flare#

This tutorial will guide you through the process of running a nnUNet experiment in a Federated Learning context, by using the MONet Bundle and the NVIDIA Flare API. The goal is to train a model on a dataset that is distributed across multiple sites, while ensuring data privacy and security.

In this tutorial, we will use the nnUNet framework to train a model on a dataset that is distributed across multiple sites. The training will be done in a secure and privacy-preserving manner, by using the Federated Learning capabilities of the NVIDIA Flare API.

To showcase all the functionalities of the MONet Bundle in Federated Learning, we will use the Decathlon Spleen dataset and consider the Spleen segmentation as our task of reference.

The Spleen dataset (Task09_Spleen) was obtained from the Medical Segmentation Decathlon challenge (Simpson et al., 2019).

Model Training#

In detail, the following steps will be performed:

  1. Dataset Preparation: The dataset in the different sites will be prepared for training, harmonizing the data and creating the necessary files for training according to the nnUNet framework.

  2. nnUNet Experiment Planning and Preprocessing: One of the sites will be selected as the main site, where the nnUNet experiment will be planned and the data will be preprocessed. This step has to be done only on one site, as the nnUNet plans will be shared with the other sites.

  3. nnUNet Preprocessing: The data will be preprocessed according to the nnUNet plan in all the other sites.

  4. nnUNet Training: The model will be trained on the data of all the sites, using the nnUNet framework and aggregating the local gradients from the different sites.

image0

Single-Site Training#

Single-Site training is also made available for the clients, to test the training process and evaluate the model at each individual site, establishing a baseline before moving on to the Federated Learning phase. In this fase, Data Preparation, Experiment Planning and Preprocessing and Training will be done on each site separately.

Cross-Site Validation#

After the training is completed, a cross-site validation will be performed to evaluate the model’s performance across different sites. This step will ensure that the model is robust and generalizes well to data from different sites. The validation will be done by using a trained model from one site and evaluating it on the validation data from the other sites. In this step, we perform the inference from a trained model with MONAI Deploy, and then compute the metrics to evaluate the model’s performance on external data.

Model Deployment#

In addition, we will perform some specific steps to prepare the model for deployment: - Convert the trained model to a TorchScript format, supported by MONAI Deploy. - Package the model into a MONAI bundle. - Upload the trained model to MLFlow.

PREREQUISITES#

This tutorial assumes that you have already installed the NVIDIA Flare API and have access to a Federated Learning cluster, with Lead role.

Additionally, all the sites should have the necessary data for training ready in a known folder location (not necessarily the same location across all sites).

To install the NVIDIA Flare API and the required Pytorch version, you can use the following command:

pip install nvflare==2.4.0rc6 light-the-torch
ltt install torch==2.6.0
pip install cryptography==42 # Required version for NVFlare
pip install "monai[all]"
pip install fire monai-nvflare==0.2.4 odict pyhocon

pip install --no-deps git+https://github.com/SimoneBendazzoli93/MONAI.git@dev
pip install git+https://github.com/SimoneBendazzoli93/nnUNet.git
pip install git+https://github.com/SimoneBendazzoli93/monai-deploy-app-sdk.git@nifti-support
pip install highdicom

pip install batchgenerators==0.25 # Required version for nnUNet

The tutorial is designed to be executed within the MAIA Platform. If you are running it somewhere else, you will need to adapt the paths accordingly.

Configure the Federation in POC Mode#

Before starting the tutorial, we need to configure the federation in NVFlare POC (Proof of Concept) mode. This mode allows the federation to be configured within a single site, which is useful for testing and development purposes.

[ ]:
%%bash
export PATH=$PATH:$HOME/.local/bin

#export NVFLARE_POC_WORKSPACE=$(pwd)/"NVFlare_POC"
export NVFLARE_POC_WORKSPACE="/home/maia-user/Data/NVFlare_POC"
nvflare poc prepare
[ ]:
%%bash

#export NVFLARE_POC_WORKSPACE=$(pwd)/"NVFlare_POC"
export NVFLARE_POC_WORKSPACE="/home/maia-user/Data/NVFlare_POC"

nvflare poc start
[ ]:
import os
os.environ["NVFLARE_POC_WORKSPACE"]="/home/maia-user/Data/NVFlare_POC"

or

[ ]:
from nvflare.tool.poc.poc_commands import _prepare_poc, _start_poc, _stop_poc, _clean_poc

_prepare_poc(
    ["site-1","site-2"],
    2,
    "NVFlare_POC",
)
_start_poc("NVFlare_POC",[0])

Split the Data Across Sites#

After starting the federation in POC mode, we need to split the data across the two different sites. In this tutorial, we will randomly split the Spleen Decathlon Spleen dataset into two parts, one for each site.

[ ]:
def subfiles(folder_path, prefix=None, suffix=None, join=True):
    import os
    if prefix is None:
        prefix = ""
    if suffix is None:
        suffix = ""
    files = []
    for root, dirs, file in os.walk(folder_path):
        for f in file:
            if f.startswith(prefix) and f.endswith(suffix):
                if join:
                    files.append(os.path.join(root, f))
                else:
                    files.append(f)
    return files
[ ]:
import os
import tempfile
from monai.apps import DecathlonDataset
[ ]:
os.environ["MONAI_DATA_DIRECTORY"] = "/home/maia-user/Data"
[ ]:
directory = os.environ.get("MONAI_DATA_DIRECTORY")
if directory is not None:
    os.makedirs(directory, exist_ok=True)
root_dir = tempfile.mkdtemp() if directory is None else directory
print(root_dir)
[ ]:
DecathlonDataset(root_dir=root_dir, task="Task09_Spleen", section="training", download=True, cache_num=1)
[ ]:
from random import shuffle
import shutil
from pathlib import Path
image_dir = os.environ["MONAI_DATA_DIRECTORY"] + "/Task09_Spleen/imagesTr"
label_dir = os.environ["MONAI_DATA_DIRECTORY"] + "/Task09_Spleen/labelsTr"

# Site 1 will have the "Decathlon" dataset format
Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath("data/site-1/imagesTr").mkdir(parents=True, exist_ok=True)
Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath("data/site-1/labelsTr").mkdir(parents=True, exist_ok=True)

# Site 2 will have the "Subfolders" dataset format
Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath("data/site-2").mkdir(parents=True, exist_ok=True)

images = subfiles(image_dir, prefix="spleen", suffix=".nii.gz")
labels = subfiles(label_dir, prefix="spleen", suffix=".nii.gz")

print(f"Images: {len(images)}")
print(f"Labels: {len(labels)}")

shuffle(images)

for image in images[:len(images)//2]:
    shutil.copy(image, Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath("data/site-1/imagesTr"))
    shutil.copy(image.replace("imagesTr", "labelsTr"), Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath("data/site-1/labelsTr"))

for image in images[len(images)//2:]:
    id = image.split("/")[-1].split(".")[0]
    Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath(f"data/site-2/{id}").mkdir(parents=True, exist_ok=True)
    shutil.copy(image, Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath(f"data/site-2/{id}/{id}_CT.nii.gz"))
    shutil.copy(image.replace("imagesTr", "labelsTr"), Path(os.environ["NVFLARE_POC_WORKSPACE"]).joinpath(f"data/site-2/{id}/{id}_label.nii.gz"))

Download MONet Bundle#

Next, we download the MONet Bundle, one per client:

[ ]:
%%bash

mkdir -p ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-1
mkdir -p ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2

wget  https://raw.githubusercontent.com/SimoneBendazzoli93/MONet-Bundle/main/MONetBundle.zip -O ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-1/MONetBundle.zip
unzip -o ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-1/MONetBundle.zip -d ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-1
rm ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-1/MONetBundle.zip

wget  https://raw.githubusercontent.com/SimoneBendazzoli93/MONet-Bundle/main/MONetBundle.zip -O ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2/MONetBundle.zip
unzip -o ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2/MONetBundle.zip -d ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2
rm ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2/MONetBundle.zip

cp -r ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/site-2/MONetBundle/src /home/maia-user/shared/src/Spleen/

FL Cluster Authentication#

[ ]:
from nvflare.fuel.flare_api.flare_api import new_secure_session
sess = new_secure_session(
    "admin@nvidia.com",
    "/home/maia-user/Data/NVFlare_POC/example_project/prod_00/admin@nvidia.com"
)
[ ]:
print(sess.get_system_info())

List Jobs#

[ ]:
jobs = sess.list_jobs()
[ ]:
jobs[-1]
[ ]:
job_id = jobs[-1]['job_id']

Terminate Jobs#

[ ]:
sess.abort_job(job_id)

Job Preparation#

The first step in the Federated Learning process is to prepare the job configuration files. The job configuration files contain the necessary information to run the job on the Federated Learning cluster, such as the job type, the resources required, and the parameters for the job execution.

The job configurations files are automatically generated by the script nvflare_generate_job_configs, which is installed together with this package. The script takes as input client-specific configuration, together with the experiment-specific configuration, and generates the job configuration files for each client.

Client Configuration#

The client-specific configuration file should be in the following format:

data_dir: "<DATASET_FOLDER>"
modality_dict:
  ct: "<CT_SUFFIX>"
  label: "<SEG_MASK_SUFFIX>"
dataset_format: "<DATASET_FORMAT>"
patient_id_in_file_identifier: True
nnunet_root_folder: "<NNUNET_ROOT_FOLDER>"
client_name: "<CLIENT_NAME>"
subfolder_suffix: "<SUBFOLDER_SUFFIX>" [OPTIONAL]
bundle_root: "<BUNDLE_ROOT>" [OPTIONAL]
# Optional parameters for MONAI Deploy Inference, we discuss them in the Model Deployment section
app_path: ""
app_model_path: ""
app_output_path: ""
model_name: ""

where:

dataset_format should refer to one of these three different formats, according to the data_dir structure: 1. subfolders: The dataset is organized in subfolders, where each subfolder corresponds to a subject and contains the images and labels for that subject.

[Dataset_folder]
      [Subject_0]
          - Subject_0_image0.nii.gz    # Subject_0 modality 0
          - Subject_0_image1.nii.gz    # Subject_0 modality 1
          - Subject_0_mask.nii.gz      # Subject_0 semantic segmentation mask
      [Subject_1]
          - Subject_1_image0.nii.gz    # Subject_1 modality 0
          - Subject_1_image1.nii.gz    # Subject_1 modality 1
          - Subject_1_mask.nii.gz      # Subject_1 semantic segmentation mask
      ...
  1. decathlon: The dataset is organized in the format of the Medical Decathlon challenge, where the images and labels are stored in separate folders.

[Dataset_folder]
      [imagesTr]
          - Subject_0_image0.nii.gz    # Subject_0 modality 0
          - Subject_0_image1.nii.gz    # Subject_0 modality 1
          - Subject_1_image0.nii.gz    # Subject_1 modality 0
          - Subject_1_image1.nii.gz    # Subject_1 modality 1
      [labelsTr]
          - Subject_1_mask.nii.gz      # Subject_1 semantic segmentation mask
          - Subject_0_mask.nii.gz      # Subject_0 semantic segmentation mask
      ...
  1. nnunet: The dataset has been already prepared according to the nnUNet framework, with the images and labels stored in separate folders.

[nnUNet_raw]
    [DatasetXYZ_TaskName]  # THIS IS THE DATASET FOLDER
        dataset.json
        [imagesTr]
            - Subject_0_image0.nii.gz    # Subject_0 modality 0
            - Subject_0_image1.nii.gz    # Subject_0 modality 1
            - Subject_1_image0.nii.gz    # Subject_1 modality 0
            - Subject_1_image1.nii.gz    # Subject_1 modality 1
        [labelsTr]
            - Subject_1_mask.nii.gz      # Subject_1 semantic segmentation mask
            - Subject_0_mask.nii.gz      # Subject_0 semantic segmentation mask
      ...

nnunet_root_folder should refer to the root folder used by the nnUnet framework, where the nnUNet experiments are stored. For the subfolders and decathlon dataset formats, this folder is created during the dataset preparation step. For the nnunet dataset format, this folder should contain the nnUNet experiments, with the following structure:

[nnunet_root_folder]
    [nnUNet_raw_data_base]
        [DatasetXYZ_TaskName]
            dataset.json
            [imagesTr]
                - Subject_0_image0.nii.gz    # Subject_0 modality 0
                - Subject_0_image1.nii.gz    # Subject_0 modality 1
                - Subject_1_image0.nii.gz    # Subject_1 modality 0
                - Subject_1_image1.nii.gz    # Subject_1 modality 1
            [labelsTr]
                - Subject_1_mask.nii.gz      # Subject_1 semantic segmentation mask
                - Subject_0_mask.nii.gz      # Subject_0 semantic segmentation mask
          ...

modality_dict is a dictionary that maps the modality names to the file suffixes. The suffixes are used to identify the files that correspond to the different modalities in the dataset. For example, if the CT images have the suffix _CT.nii.gz, the entry in the modality_dict should be ct: "_CT.nii.gz".

patient_id_in_file_identifier is a flag used to specify if the patient ID is included in the file name. If this flag is set to True, the patient ID will be extracted from the file name. If this flag is set to False, the patient ID will be extracted from the file path. If set to False, the filename should only contain the modality suffix.

client_name is a unique identifier for the client.

subfolder_suffix is an optional parameter that specifies the suffix of the subfolders that contain the images and labels for each subject. This parameter is used when the dataset is organized in subfolders, and the subfolders have a specific suffix that needs to be removed to extract the patient ID.

bundle_root is an optional parameter that specifies the root folder where the MONet Bundle is stored.

app_path is an optional parameter that specifies the path to the MONAI Deploy application to be used for inference. This parameter is used when the model is deployed using MONAI Deploy.

app_model_path is an optional parameter that specifies the path to the TorchScript model file to be used by the MONAI Deploy application. This parameter is used when the model is deployed using MONAI Deploy.

app_output_path is an optional parameter that specifies the path where the output of the MONAI Deploy Inference will be stored. This parameter is used when the model is deployed using MONAI Deploy.

model_name is an optional parameter that specifies the name of the model to be used by the MONAI Deploy application. This parameter is used when the model is deployed using MONAI Deploy.

Experiment Configuration#

The experiment-specific configuration file should be in the following format:

dataset_name_or_id: "<DATASET_NAME_OR_ID>"
experiment_name: "<EXPERIMENT_NAME>"
tracking_uri: "<TRACKING_URI>"
mlflow_token: "<MLFLOW_TOKEN>" [OPTIONAL]
nnunet_trainer: "<NNUNET_TRAINER>" [OPTIONAL]
num_rounds: "<NUM_ROUNDS>"
start_round: "<START_ROUND>"
local_epochs: "<LOCAL_EPOCHS>"
server_bundle_root: "<SERVER_BUNDLE_ROOT>" [OPTIONAL]
modality_list:
  - "<MODALITY_1>"
  - "<MODALITY_2>"
label_dict:
  class1: 1
# Extra parameters for the MONet Bundle
#bundle_extra_config:
#  resume_epoch: "latest"  # Optional, used to resume training from a specific epoch
#  region_based: True # Optional, used to enable region-based training
#  is_federated: True # Optional, used to enable federated training when preparing the bundle. Set to False if you want to prepare the bundle for single-site training

where:

dataset_name_or_id and experiment_ame are used as a reference to the nnUNet Dataset ID and Experiment Name, respectively. These values are used to identify the nnUNet experiment in the nnUNet framework. Experiment Name is also used to identify the experiment in the MLFlow server, and to generate the zipped nnUNet model file (as <Experiment Name>.zip).

mlflow_token and tracking_uri are used to connect to the MLFlow server, where the experiments are logged, and the trained models are uploaded.

nnunet_trainer is an optional parameter that specifies the nnUNet trainer to be used for training. If this parameter is not specified, the default nnUNet trainer will be used.

num_rounds is the number of rounds to be executed in the Federated Learning process. This parameter is used to control the number of rounds of training that will be performed on the Federated Learning cluster.

start_round is the round number from which to start the training. This parameter is used to control the starting point of the training process.

local_epochs is the number of local epochs to be executed on each client. This parameter is used to control the number of epochs that will run on each client before the local model is sent to the server for aggregation.

server_bundle_root is the root folder where the MONAI Bundle is stored. This parameter is used to specify the location of the MONAI Bundle that will be used for model deployment.

modality_list is a list of the modalities that will be used for training. This parameter is used to specify the modalities that will be used in the nnUNet experiment.

label_dict is a dictionary that maps the class names to the class IDs. The class IDs are used to identify the different classes in the dataset. For example, if the dataset has two classes, Class1 and Class2, with IDs 1 and 2, respectively, the entry in the label_dict should be label_dict: Class1: 1, Class2: 2.

Prepare the Job Configuration Files#

To prepare the job configuration files, run the following command:

monai.nvflare.nvflare_generate_job_configs.generate_configs(client_files, experiment_file, script_dir, job_dir)

where: - <CLIENT_CONFIG_FILE_1>, <CLIENT_CONFIG_FILE_2>, … are the client-specific configuration files for each client. - <experiment_file> is the experiment-specific configuration file. - <script_dir> is the directory where the job scripts and python files are stored. - <job_dir> is the directory where the job configuration files will be stored.

[ ]:
import sys
from pathlib import Path

ROOT_FOLDER = "/home/maia-user/shared"
sys.path.append(ROOT_FOLDER)

Path(ROOT_FOLDER).joinpath("Experiments").mkdir(parents=True, exist_ok=True)
Path(ROOT_FOLDER).joinpath("Clients").mkdir(parents=True, exist_ok=True)
[ ]:
%%writefile /home/maia-user/shared/Experiments/Spleen.yaml
dataset_name_or_id:
    site-1:
        id: "009"
        name: "Task09_Spleen-1"
    site-2:
        id: "010"
        name: "Task09_Spleen-2"
experiment_name: "Task09_Spleen"
tracking_uri: "http://localhost:5000"
nnunet_trainer: "nnUNetTrainer_10epochs"
num_rounds: 10
start_round: 0
local_epochs: 1
server_bundle_root: "/workspace/FedSpleen_Bundle"
modality_list:
- "CT"
label_dict:
    Spleen: 1
bundle_extra_config:
  is_federated: False
  resume_epoch: '"latest"'
[ ]:
%%writefile /home/maia-user/shared/Clients/site-1.yaml
data_dir: "/home/maia-user/Data/NVFlare_POC/data/site-1"
modality_dict:
  ct: ".nii.gz"
  label: ".nii.gz"
dataset_format: "decathlon"
patient_id_in_file_identifier: True
nnunet_root_folder: "/home/maia-user/Data/NVFlare_POC/nnUNet"
client_name: "site-1"
bundle_root: "/home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle"
app_path: ""
app_model_path: ""
app_output_path: ""
model_name: ""
[ ]:
%%writefile /home/maia-user/shared/Clients/site-2.yaml
data_dir: "/home/maia-user/Data/NVFlare_POC/data/site-2"
modality_dict:
  ct: "_CT.nii.gz"
  label: "_label.nii.gz"
dataset_format: "subfolders"
patient_id_in_file_identifier: True
nnunet_root_folder: "/home/maia-user/Data/NVFlare_POC/nnUNet"
client_name: "site-2"
bundle_root: "/home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-2/MONetBundle"
app_path: ""
app_model_path: ""
app_output_path: ""
model_name: ""
[ ]:
experiment = "Spleen"
clients = [
    "site-1",
    "site-2"
]

Path(ROOT_FOLDER).joinpath("src").joinpath(experiment).mkdir(parents=True, exist_ok=True)

[ ]:
%%bash

rm -r /home/maia-user/shared/Jobs/Spleen/
[ ]:
from monai.nvflare.nvflare_generate_job_configs import generate_configs

generate_configs(
    [str(Path(ROOT_FOLDER).joinpath("Clients",client+".yaml")) for client in clients],
    str(Path(ROOT_FOLDER).joinpath("Experiments",experiment+".yaml")),
    str(Path(ROOT_FOLDER).joinpath("src",experiment)),
    str(Path(ROOT_FOLDER).joinpath("Jobs",experiment)),
    tasks = ["prepare_bundle","train"]

)
[ ]:
JOB_DIR=str(Path(ROOT_FOLDER).joinpath("Jobs",experiment))

0.0 Check Python Packages#

This initial step is to check that the required Python packages are installed and available in the environment. This is done by importing the necessary packages and checking their versions.

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "check_client_packages"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")))
[ ]:
sess.monitor_job(job_id)
[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
import json
from pathlib import Path

with open(Path(job_dir).joinpath("workspace","package_report","package_report.json"),"r") as f:
    package_report = json.load(f)
[ ]:
print(package_report)

0.1 Prepare Dataset#

In this step, the dataset in the different sites will be prepared for training, harmonizing the data structures and creating the necessary files for training according to the nnUNet framework.

This step is internally calling the nnUNetV2Runner from MONAI, performing the runner.convert_dataset().

Before running this step, you can start a local MLFlow server, to log the task and the FL experiments:

mkdir MLFlow
cd MLFlow
mlflow server

Submit the Job#

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "prepare"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")))
[ ]:
sess.monitor_job(job_id)
[ ]:
client_id = "site-1"

To monitor the job, or print the logs from either the server or the client side, you can use the following commands:

[ ]:
#print(sess.api.do_command(f"cat server log.txt")['data'][0]['data'])
print(sess.api.do_command(f"cat {client_id} {job_id}/log.txt")['data'][0]['data'])

Download the Job Results#

When the job is completed, you can download the results, which, for the Prepare Dataset job, will contain the dataset.json file, containing the information about the dataset. In detail, for each client, the dataset.json file will list the dataset files and wheter all the files are valid or not.

[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
import json
from pathlib import Path

with open(Path(job_dir).joinpath("workspace","prepare","data_dict.json"),"r") as f:
    dataset_dict = json.load(f)

To inspect the dataset, run the following command:

[ ]:
client_id = "site-1"
[ ]:
verified = True

for case in dataset_dict[client_id]["training"]:
    for key in case:
        if key.endswith("_is_file") and not case[key]:
            file = case[key[:-len("_is_file")]]
            print(f"Error: {file} is not a valid file!")
            verified = False
if verified:
    print(f"Dataset succesfully verified for client {client_id}")
[ ]:
print(len(dataset_dict[client_id]["training"]))

1. Plan and Preprocess#

After the dataset has been prepared, the nnUNet experiment has to be planned and the data preprocessed. This step has to be done only on one site, as the nnUNet plans will be shared with the other sites.

The steps to plan and preprocess the nnUNet experiment are the following: 1. Run the plan_and_preprocess job on the chosen site. 2. Extract the nnUNet plans from the job results. 3. Share the nnUNet plans with the other sites. 4. Run the preprocess job on the other sites.

This step is internally calling the nnUNetV2Runner from MONAI, performing the runner.plan_and_process().

Submit the Job#

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "plan_and_preprocess"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),
                 {"site-1":""} # Select only site-1 for this task
                 )
[ ]:
sess.monitor_job(job_id)
[ ]:
client_id = "site-1"
#print(sess.api.do_command(f"tail server log.txt")['data'][0]['data'])
print(sess.api.do_command(f"cat {client_id} {job_id}/log.txt")['data'][0]['data'])
[ ]:
job_dir = sess.download_job_result(job_id)

Inspect nnUNetPlans.json#

The nnUNet plans are stored in the nnUNetPlans.json file, which contains the configuration for the nnUNet experiment. The file can be found in the workspace/nnUNet_preprocessing folder of the job results.

[ ]:
import json

with open(Path(job_dir).joinpath("workspace","nnUNet_preprocessing","nnUNetPlans.json"),"r") as f:
    nnunet_plans = json.load(f)
[ ]:
print(json.dumps(nnunet_plans["site-1"],indent=4))

Copy nnUNetPlans into Transfer Folder#

To share the nnUNet plans with the other sites, copy the nnUNetPlans.json file into the src/Spleen Folder.

[ ]:
with open(Path("/home/maia-user/shared/src/Spleen/nnUNetPlans.json"),"w") as f:
    json.dump(nnunet_plans["site-1"],f)

And regenerate the job configuration files, so that the other sites can use the nnUNet plans for preprocessing:

[ ]:
generate_configs(
    [str(Path(ROOT_FOLDER).joinpath("Clients",client+".yaml")) for client in clients],
    str(Path(ROOT_FOLDER).joinpath("Experiments",experiment+".yaml")),
    str(Path(ROOT_FOLDER).joinpath("src",experiment)),
    str(Path(ROOT_FOLDER).joinpath("Jobs",experiment)),

)

2. Preprocess#

After the nnUNet plans have been shared with the other sites, the data has to be preprocessed according to the nnUNet plans. This step has to be done on all the sites, except the one where the nnUNet experiment has been planned.

Submit Job#

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "preprocess"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),
                 {"site-2":""} # Select only site-1 for this task
                 )
[ ]:
sess.monitor_job(job_id)
[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
#print(sess.api.do_command(f"cat server log.txt")['data'][0]['data'])
print(sess.api.do_command(f"cat {client_id} {job_id}/log.txt")['data'][0]['data'])
[ ]:
import json

with open(Path(job_dir).joinpath("workspace","nnUNet_preprocessing","nnUNetPlans.json"),"r") as f:
    nnunet_plans = json.load(f)
[ ]:
print(json.dumps(nnunet_plans["site-2"],indent=4))

3.0 Single-Site nnUNet Training#

Once the dataset has been prepared and preprocessed following the nnUNet plans, the nnUNet training is ready to begin. In this phase, we will train a simple nnUNet model on a single site using the MONet Bundle.

First, prepare the MONet Bundle and verify its configuration for the training

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "prepare_bundle"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),
                 {"site-1":""} # Select only site-1 for this task
                 )
[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
import json
from pathlib import Path

with open(Path(job_dir).joinpath("workspace","nnUNet_prepare_bundle","bundle_config.json"),"r") as f:
    bundle_config = json.load(f)
[ ]:
print(json.dumps(bundle_config["site-1"], indent=4))

Then, you can start the training job:

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "train"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),
                 {"site-1":""} # Select only site-1 for this task
                 )

You can monitor the training on MLFlow, or:

[ ]:
sess.monitor_job(job_id)
[ ]:
sess.abort_job(job_id)

Once the training is completed, the validation will be performed on the validation set, and the results will be logged in MLFlow.

[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
from pathlib import Path
import json

with open(Path(job_dir).joinpath("workspace","nnUNet_train","val_summary.json"),"r") as f:
    print(json.load(f))

To check the training logs, run the following command:

[ ]:
print(sess.api.do_command(f"cat {client_id} {job_id}/log.txt")['data'][0]['data'])

Convert the trained MONet Bundle to a TorchScript model#

This step is needed to run the Cross-Site Validation and the Model Deployment steps, as the MONAI Deploy Inference requires a TorchScript model.

IMPORTANT! The native dynamic-network-architectures package, used by nnUNet, is not compatible with the TorchScript format.

Install this version:

pip install --force-reinstall --no-deps git+https://github.com/SimoneBendazzoli93/dynamic-network-architectures.git
[ ]:
%%bash

python convert_ckpt_to_ts.py --bundle_root /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle \
                             --checkpoint_name checkpoint_epoch=10.pt \
                             --fold 0 \
                             --nnunet_trainer_name nnUNetTrainer_10epochs

Cross-Site Evaluation#

To run the cross-site evaluation, we will use the trained model from one site and evaluate it on the validation data from the other sites. As a requirement, you need to need to have the MONAI Deploy package installed, including the ai_spleen_nifti_nnunet_seg_app, which is used to run the inference on the validation NIFTI data.

Update the site-specific configuration files to include the MONAI Deploy application path and the TorchScript model path:

[ ]:
%%bash
# Clone the repo shallowly


cd /home/maia-user/Data/NVFlare_POC/example_project/prod_00/site-2

git clone --depth 1 --filter=blob:none --sparse --branch nifti-support https://github.com/SimoneBendazzoli93/monai-deploy-app-sdk.git
cd monai-deploy-app-sdk

# Enable sparse checkout and set the folder
git sparse-checkout set examples/apps/ai_spleen_nifti_nnunet_seg_app

[ ]:
%%bash
mkdir -p /home/maia-user/Data/CrossSite_Validation/spleen_site-2/spleen_site-1_predictions
[ ]:
%%writefile /home/maia-user/shared/Clients/site-2.yaml
data_dir: "/home/maia-user/Data/NVFlare_POC/data/site-2"
modality_dict:
  ct: ".nii.gz"
  label: ".nii.gz"
dataset_format: "decathlon"
patient_id_in_file_identifier: True
nnunet_root_folder: "/home/maia-user/Data/NVFlare_POC/nnUNet"
client_name: "site-2"
bundle_root: "/home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-2/MONetBundle"
app_path: "monai-deploy-app-sdk/examples/apps/ai_spleen_nifti_nnunet_seg_app"
app_model_path: "/home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle/models/fold_0/model.ts"
app_output_path: "/home/maia-user/Data/CrossSite_Validation/spleen_site-2/spleen_site-1_predictions"
model_name: "spleen_site-1"
[ ]:
generate_configs(
    [str(Path(ROOT_FOLDER).joinpath("Clients",client+".yaml")) for client in clients],
    str(Path(ROOT_FOLDER).joinpath("Experiments",experiment+".yaml")),
    str(Path(ROOT_FOLDER).joinpath("src",experiment)),
    str(Path(ROOT_FOLDER).joinpath("Jobs",experiment)),
    tasks=["cross_site_validation"]

)
[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "cross_site_validation"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),
                 {"site-2":""} # Select only site-1 for this task
                 )

3.1 Federated Learning Training#

The Federated Learning training will aggregate the local gradients from the different sites and update the global model. The Federated Learning training will be conducted in rounds, with each round consisting of a number of local epochs. The Federated Learning training will be conducted in a secure and privacy-preserving manner, with the data remaining on the client side and only the local gradients being shared with the server.

Prior to starting the FL training, follow the steps below to correctly configure the federation.

On the Server Side:

  1. Install the required python packages and download the MONet Bundle to the server.

  2. Upload the plans.json and dataset.json files to the server (in <BUNDLE_ROOT>/models). dataset.json can be in the form:

{
    "task": "Dataset109_Task09_Spleen",
    "dim": 3,
    "test_labels": true,
    "tensorImageSize": "4D",
    "channel_names": {"0": "ct"},
    "labels": {"background": 0, "Spleen": 1},
    "numTraining": 0,
    "numTest": 0,
    "training": [],
    "test": [],
    "file_ending": ".nii.gz"
}
  1. Specify bundle_root in the server train bundle configuration file (<BUNDLE_ROOT>/configs/train.yaml):

network_def_fl:
    _target_: $monai.apps.nnunet.nnunet_bundle.get_network_from_nnunet_plans
    plans_file: "$@bundle_root+'/models/plans.json'"
    dataset_file: "$@bundle_root+'/models/dataset.json'"
    configuration: '@nnunet_configuration'
  1. Optionally, change in the train bundle configuration file (<BUNDLE_ROOT>/configs/train.yaml) the nnunet_trainer_class_name, and nnunet_plans_identifier.

[ ]:
%%bash

export NVFLARE_POC_WORKSPACE="/home/maia-user/Data/NVFlare_POC"

mkdir -p ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server

wget  https://raw.githubusercontent.com/SimoneBendazzoli93/MONet-Bundle/main/MONetBundle.zip -O ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server/MONetBundle.zip
unzip -o ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server/MONetBundle.zip -d ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server
rm ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server/MONetBundle.zip
[ ]:
%%bash

export NVFLARE_POC_WORKSPACE="/home/maia-user/Data/NVFlare_POC"

cp -r ${NVFLARE_POC_WORKSPACE}/nnUNet/nnUNet_preprocessed/Dataset009_site-1/nnUNetPlans.json ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server/MONetBundle/models/plans.json
cp -r ${NVFLARE_POC_WORKSPACE}/nnUNet/nnUNet_preprocessed/Dataset009_site-1/dataset.json ${NVFLARE_POC_WORKSPACE}/MONet-Bundles/server/MONetBundle/models/dataset.json

On the Client Side:

  1. Specify the following parameters in the Experiment configuration:

num_rounds: 100
server_bundle_root: "<SERVER_BUNDLE_ROOT>"
start_round: 0
local_epochs: 10
bundle_extra_config:
   is_federated: True
  1. Re-execute the generate_job_configs script to update the job configurations.

[ ]:
%%writefile /home/maia-user/shared/Experiments/Spleen.yaml
dataset_name_or_id:
    site-1:
        id: "009"
        name: "Task09_Spleen-1"
    site-2:
        id: "010"
        name: "Task09_Spleen-2"
experiment_name: "Task09_Spleen"
tracking_uri: "http://localhost:5000"
nnunet_trainer: "nnUNetTrainer_10epochs"
num_rounds: 10
start_round: 0
local_epochs: 1
server_bundle_root: "/home/maia-user/Data/NVFlare_POC/MONet-Bundles/server/MONetBundle"
modality_list:
- "CT"
label_dict:
    Spleen: 1
bundle_extra_config:
  is_federated: True
#  resume_epoch: 5  # start_round * local_epochs




[ ]:
from monai.nvflare.nvflare_generate_job_configs import generate_configs

generate_configs(
    [str(Path(ROOT_FOLDER).joinpath("Clients",client+".yaml")) for client in clients],
    str(Path(ROOT_FOLDER).joinpath("Experiments",experiment+".yaml")),
    str(Path(ROOT_FOLDER).joinpath("src",experiment)),
    str(Path(ROOT_FOLDER).joinpath("Jobs",experiment)),
    tasks = ["prepare_bundle","train_fl_monet_bundle"]

)

Prepare Bundle for FL Training#

The MONet Bundle has to be prepared for the Federated Learning training. In this step, the train and evaluation bundle parameters will be overwritten to match the Federated Learning training configuration.

To prepare the MONet Bundle for the Federated Learning training, run the following command:

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "prepare_bundle"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),

                 )
[ ]:
sess.monitor_job(job_id)
[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
with open(Path(job_dir).joinpath("workspace","nnUNet_prepare_bundle","bundle_config.json"),"r") as f:
    bundle_config = json.load(f)
[ ]:
print(json.dumps(bundle_config["site-1"], indent=4))

Run FL Training#

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "train_fl_monet_bundle"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),

                 )
[ ]:
sess.monitor_job(job_id)
[ ]:
sess.abort_job(job_id)

To download the trained models from the clients and upload them to the server, run the following command:

[ ]:
job_dir = sess.download_job_result(job_id)

Resume FL Training from Checkpoint#

To resume the Federated Learning training from an existing checkpoint:

  1. Download the global model from the server (sess.download_job_result(job_id)) and upload them to the MONAI Bundle in the server (<BUNDLE_ROOT>/models/fold_<id>)

  2. Add the following parameters to the server configuration file:

network_def_fl:
  _target_: $monai.apps.nnunet.nnunet_bundle.get_network_from_nnunet_plans
  plans_file: "$@bundle_root+'/models/plans.json'"
  dataset_file: "$@bundle_root+'/models/dataset.json'"
  configuration: '@nnunet_configuration'
  model_ckpt: "$@ckpt_dir+'/FL_global_model.pt'"
  1. Update the Experiment Configuration file with the new start_round and num_rounds values.

start_round: <START_ROUND>
num_rounds: <INITIAL_ROUNDS - START_ROUND>
bundle_extra_config:
  resume_epoch: 20  # start_round * local_epochs
  1. Re-execute the generate_job_configs script to update the job configurations.

[ ]:
from monai.nvflare.nvflare_generate_job_configs import generate_configs

generate_configs(
    [str(Path(ROOT_FOLDER).joinpath("Clients",client+".yaml")) for client in clients],
    str(Path(ROOT_FOLDER).joinpath("Experiments",experiment+".yaml")),
    str(Path(ROOT_FOLDER).joinpath("src",experiment)),
    str(Path(ROOT_FOLDER).joinpath("Jobs",experiment)),
    tasks = ["prepare_bundle","train_fl_monet_bundle"]

)

Export Federated Model as MONet Bundle#

To export the trained Federated Learning model as a MONet Bundle, you need to download the global trained model from the server and prepare the MONet Bundle for deployment:

[ ]:
job_dir = sess.download_job_result(job_id)
[ ]:
FL_global_model = Path(job_dir).joinpath("workspace","app_server","FL_global_model.pt")
[ ]:
%%bash
cp $FL_global_model_path \
   /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle/models/fold_0/FL_global_model.pt

cp /home/maia-user/Data/NVFlare_POC/MONet-Bundles/server/MONetBundle/models/dataset.json \
    /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle/models/dataset.json

cp /home/maia-user/Data/NVFlare_POC/MONet-Bundles/server/MONetBundle/models/plans.json \
    /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-1/MONetBundle/models/plans.json

cp $FL_global_model_path \
   /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-2/MONetBundle/models/fold_0/FL_global_model.pt

cp /home/maia-user/Data/NVFlare_POC/MONet-Bundles/server/MONetBundle/models/dataset.json \
    /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-2/MONetBundle/models/dataset.json

cp /home/maia-user/Data/NVFlare_POC/MONet-Bundles/server/MONetBundle/models/plans.json \
    /home/maia-user/Data/NVFlare_POC/MONet-Bundles/site-2/MONetBundle/models/plans.json

Next, we can run the validation of the global Federated model on the individual clients. At the end of the validation, the metrics will be logged in MLFlow.

[ ]:
from pathlib import Path
from monai.nvflare.nvflare_nnunet import run_job

task_name = "finalize"

job_id = run_job(sess, task_name, str(Path(JOB_DIR).joinpath("jobs")),

                 )

References#

Simpson, A. L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., van Ginneken, B., … & Menze, B. H. (2019). A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063.

Roth, H. R., Cheng, Y., Wen, Y., Yang, I., Xu, Z., Hsieh, Y.-T., … Feng, A. (2022) NVIDIA FLARE: Federated Learning from Simulation to Real-World. doi:10.48550/arXiv.2210.13291

Cardoso, M. J., Li, W., Brown, R., Ma, N., Kerfoot, E., Wang, Y., … Feng, A. (2022). MONAI: An open-source framework for deep learning in healthcare.