Running Experiments on CLI

Info

Please ensure you have clone the repo and navigated to the MedAgentGym/ directory in your local environment to begin working with the project.

The most straightforward approach to replicate our experimental results is using the python scripts provided in the MedAgentGym repo with the following pipeline.

Evaluation

Currently, the script ./entrypoint.sh is the entrypoint script to initiate the evaluation on each tasks. Users can modify the file to run any experiments reported in the paper. To evaluate a new model on the MedAgentGym, we should follow the instructions below:

The ./config/ directory contains experimental configurations for each model. For example, we offer the evaluation configurations for the gpt-4.1-mini model on all the evaluation datasets in ./config/gpt_4_1_mini/. All the elements is to control the generator LLM configurations and hyperparameters used in the experimetns:

Agent:
  llm:
    model_type: "Azure"
    model_name: "gpt-4.1-mini"
    max_total_tokens: 32768
    max_input_tokens: 8192
    max_new_tokens: 8192
    log_probs: False
    temperature: 0.0
    deployment_name: "gpt-4.1-mini"
  n_retry: 3
  retry_delay: 10
Data:
  metadata_path: "data/metadata.json"
  data_path: "data/mimic_iii"
Debugger:
  model_type: "Azure"
  model_name: "gpt-4.1-mini"
  max_total_tokens: 32768
  max_input_tokens: 8192
  max_new_tokens: 2048
  log_probs: False
  temperature: 0.0
  deployment_name: "gpt-4.1-mini"
Env:
  n_retry: 3
task: "mimic_iii"
credentials_path: "./credentials.toml"
work_dir: "./workdir/gpt_4_1_mini"
result_dir_tag: "train_gpt-4_1-mini-mimiciii"
start_idx: 0
end_idx: -1
num_steps: 15

In the example, The provided configuration file specifies training details using the Azure-deployed GPT-4.1-mini language model for a task involving the MIMIC-III dataset. The agent is configured to handle up to 32,768 total tokens, processing an input of up to 8,192 tokens and generating outputs up to 8,192 tokens, with deterministic behavior (temperature set to 0.0). The agent includes a retry mechanism that attempts failed requests up to three times with a 10-second delay. Data for training is sourced from the data/mimic_iii directory, guided by metadata located at data/metadata.json. A similar GPT-4.1-mini configuration is employed for debugging, albeit with a lower maximum output token limit of 2,048. The execution environment also incorporates three retries for robustness, and results are systematically organized under the directory tag train_gpt-4_1-mini-mimiciii within the specified working directory. Training will utilize the entire dataset (from index 0 to the end) for 15 steps. Credentials needed for accessing resources are managed through the provided credentials.toml file.

Info

For the interpretation of each argument, please check the medagentgym.args API or directly refer to the code implementation. Notice that the API documentation may not be entirely comprehensive.

After preparing the configuration files, we need to add the experment-running scripts in the entrypoint.sh file. The script format should be as below:

python main.py --config /home/configs/medgemma-4b/mimic_iii.yaml --async_run --parallel_backend joblib --n_jobs 10

Warning

For parallel experiments, the argument --async_run is vital to guarantee all the experiments are assigned to different threads and run simultaneously. If the argument is not listed, the code will run all the experiments sequentially by eefault. For the parallel backend engine, there are two choices: joblib and ray. It depends on the hardware condition of users.

To finally run the experiments, you need to run the command sh test_docker.sh instead of directly run the entrypoint.sh file, as the dockerfile has already included the entrypoint as pre-built function:

docker run \
    --network host \
    --shm-size=64g \
    --gpus='"device=0"' \
    -v /main.py:/home/main.py \
    -v /rollout.py:/home/rollout.py \
    -v /configs:/home/configs \
    -v /cache:/home/cache \
    -v /workdir:/home/workdir \
    -v /entrypoint.sh:/home/entrypoint.sh \
    -v /credentials.toml:/home/credentials.toml \
    -e TASK_NAME="eicu" \
    -it ehr_gym:latest

The motivation is to isolate the main functions of the MedAgentGym outside the docker to ensure safety and execution. Besides, if the docker has already been built and you are modifying the running commands, you can directly mount the new shell scripts into the docker.

Supervised Fine-Tuning

By default, this project uses the verl package for supervised fine-tuning. As the training trajectories contain multiple turns of interactions between LLM agents and environmental feedbacks. We leverage multi-turn SFT of the package to train LLMs on multi-turn conversational datasets, enabling models to learn from full dialog exchanges rather than single-turn prompts. Under the hood, this is handled by a specialized dataset class (see multiturn_sft_dataset.py) that takes a list of role-tagged messages and formats them into a single input sequence using the model's chat template.

In practice, to use this feature you can cactivate it in the training config: for example, the FSDP SFT trainer (verl.trainer.fsdp_sft_trainer) will switch to multi-turn mode when you set data.multiturn.messages_key=conversations. The verl/examples/sft/multiturn example demonstrates how to launch a multi-turn SFT run with these settings: pointing to your training dataset (such as a JSON/Parquet file of dialogues), defining model initialization (e.g., a base model checkpoint via model.partial_pretrain) and then running torchrun -m verl.trainer.fsdp_sft_trainer with the multi-turn flags enabled. This workflow allows researchers to easily fine-tune a model on multi-turn dialogues: the SFT trainer will automatically use the MultiTurnSFTDataset logic to prepare each converstaion turn as training samples and handle the labeling of the assistant responses, so you can plug in your own converstaional data and adapt the config (or Hydra overrides) to train custom multi-turn models with minimal code changes. Below is an example of the configuration file we leveraged to SFT the 7B model:

#!/bin/bash
set -x

nproc_per_node=$1
save_path=$2

# Shift the arguments so $@ refers to the rest
shift 2
export EXPERIMENT_NAME=qwen-3-7b-sft
torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
     -m verl.trainer.fsdp_sft_trainer \
    data.train_files=/PATH_TO_DATA/train.parquet \
    data.val_files=/PATH_TO_DATA/test.parquet \
    data.multiturn.enable=true \
    data.multiturn.messages_key=messages \
    data.micro_batch_size=4 \
    data.max_length=40000 \
    model.partial_pretrain=Qwen/Qwen2.5-7B-Instruct \
    model.enable_gradient_checkpointing=True \
    trainer.default_local_dir=$save_path \
    trainer.project_name=multiturn-sft \
    trainer.experiment_name=qwen2.5-7b-sft \
    trainer.logger=['console','wandb'] \
    trainer.total_epochs=5 \
    trainer.default_hdfs_dir=null $@ \
    trainer.project_name=gym \
    ulysses_sequence_parallel_size=2 \
    trainer.experiment_name=$EXPERIMENT_NAME \
    optim.lr=1e-4 \
    use_remove_padding=true \
    +model.fsdp_config.grad_offload=True \
    +model.fsdp_config.optimizer_offload=True \
    +model.fsdp_config.param_offload=True

To run the script, we need to initiate the following command: bash run_qwen_7b_medagentgym.sh 4 /SAVE_PATH/ to run the SFT experiments distributed over 4 GPUs and save all the checkpoints in /SAVE_PATH/.

Direct Preference Optimization (DPO)

As verl does not support DPO training, we need to use another well-known open-sourced package OpenRLHF to conduct the DPO training. The OpenRLHF package provides an efficient and scalable implementation of Direct Preference Optimization (DPO), a reinforcement learning framework designed for aligning large language models with human preference data. DPO leverages paired comparisons—where human annotators indicate which response is preferred—to directly optimize the model's outputs without requiring reward modeling or explicit reinforcement learning rollouts. In OpenRLHF, DPO training is facilitated by modular components in the openrlhf/training/dpo/ directory, including dedicated dataset classes for preference data and trainer scripts for distributed training. To launch DPO training, users can configure their experiment via the provided Hydra-based YAML files (e.g., openrlhf/configs/dpo.yaml), specifying paths to preference datasets (in formats such as JSONL or Parquet), model checkpoints, and training hyperparameters. The repository's examples/dpo folder includes runnable scripts and sample configs demonstrating how to initiate DPO with various base models (e.g., Llama or Qwen), and details how the DPO loss is computed between chosen and rejected responses to drive model alignment. This workflow enables ML researchers to efficiently perform preference-based fine-tuning on open-source LLMs, supporting large-scale, multi-GPU training and custom data integration with minimal code modification.

set -x

read -r -d '' training_commands <<EOF
train_dpo \
   --save_path /SAVE_PATH/ \
   --save_steps -1 \
   --logging_steps 1 \
   --eval_steps -1 \
   --train_batch_size 64 \
   --micro_train_batch_size 1 \
   --pretrain Qwen/Qwen2.5-7B-Instruct \
   --bf16 \
   --max_epochs 1 \
   --max_len 8192 \
   --zero_stage 3 \
   --learning_rate 5e-6 \
   --beta 0.1 \
   --dataset json@DATA_PATH \
   --apply_chat_template \
   --chosen_key chosen \
   --rejected_key rejected \
   --load_checkpoint \
   --gradient_checkpointing \
   --label_smoothing 0.1 \
   --use_wandb yczhuang
EOF

# --use_wandb

if [[ ${1} != "slurm" ]]; then
    deepspeed  --master_port=29400 --include localhost:0,1,2,3 --module $training_commands
fi

To run the script, we need to initiate the following command: bash dpo_training.sh to run the DPO experiments distributed over 4 GPUs and save all the checkpoints in /SAVE_PATH/.

Iterative DPO

We also conducted some experiments that need collaboration between these two packages. To conduct DPO over warmed-up models (SFT-ed), we can first leverage the verl package to perform multi-turn supervised fine-tuning (SFT) over the randomly selected 100 samples to briefly learn the response style and format for the questions. Once SFT is completel, the resulting checkpoint can be used to sample new online preference pairs on the training data according to the final outcome, and can be seamlessly passed to the OpenRLHF package for DPO training.

For advanced workflows such as iterative DPO, we may perform SFT with verl, run an initial round of DPO with OpenRLHF, and obtain the checkpoint to sample new online preference pairs on the training data. After the data sampling is over, we then repeat the DPO process using the updated preference data.