Compare revisions

spmohanty · spmohanty · spmohanty · yilun_jin · yilun_jin · yilun_jin
--- a/.dockerignore
+++ b/.dockerignore
+.git/
+models/**
+data/
\ No newline at end of file
--- a/Dockerfile
+++ b/Dockerfile
+FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
+
+ENV DEBIAN_FRONTEND=noninteractive \
+    LANG=en_US.UTF-8 \
+    LANGUAGE=en_US:en \
+    LC_ALL=en_US.UTF-8 \
+    USER_NAME=aicrowd \
+    HOME_DIR=/home/aicrowd \
+    CONDA_DIR=/home/aicrowd/.conda \
+    PATH=/home/aicrowd/.conda/bin:${PATH} \
+    SHELL=/bin/bash
+
+# Install system dependencies and clean up in one layer
+COPY apt.txt /tmp/apt.txt
+RUN apt -qq update && apt -qq install -y --no-install-recommends `cat /tmp/apt.txt | tr -d '\r'` locales wget build-essential \
+    && locale-gen en_US.UTF-8 \
+    && rm -rf /var/cache/apt/* /var/lib/apt/lists/* \
+    && apt clean
+
+# Set up user
+RUN groupadd -g 1001 aicrowd && \
+    useradd -m -s /bin/bash -u 1001 -g aicrowd -G sudo aicrowd
+
+USER ${USER_NAME}
+WORKDIR ${HOME_DIR}
+
+# Install Miniconda and Python packages. You can change the python version by using another Miniconda. 
+RUN wget -nv -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_22.11.1-1-Linux-x86_64.sh \
+    && bash miniconda.sh -b -p ${CONDA_DIR} \
+    && . ${CONDA_DIR}/etc/profile.d/conda.sh \
+    && conda install cmake -y \
+    && conda clean -y -a \
+    && rm -rf miniconda.sh
+
+COPY --chown=1001:1001 requirements.txt ${HOME_DIR}/requirements.txt
+RUN pip install -r requirements.txt --no-cache-dir
+COPY --chown=1001:1001 requirements_eval.txt ${HOME_DIR}/requirements_eval.txt
+RUN pip install -r requirements_eval.txt --no-cache-dir
+
+## Add your custom commands below
--- a/README.md
+++ b/README.md
-![AMAZON KDD CUP 2024: MULTI-TASK ONLINE SHOPPING CHALLENGE FOR LLMS](https://images.aicrowd.com/raw_images/challenges/social_media_image_file/1139/566667103918dae81381.jpg)
+![AMAZON KDD CUP 2024: MULTI-TASK ONLINE SHOPPING CHALLENGE FOR LLMS](https://aicrowd-production.s3.eu-central-1.amazonaws.com/challenge_images/amazon-kdd-cup-2024/amazon-kdd-cup-24-banner.jpg)
 [![Discord](https://img.shields.io/discord/565639094860775436.svg)](https://discord.gg/yWurtB2huX)

 # 🛒 [Amazon KDD CUP 2024: Multi-Task Online Shopping Challenge for LLMs](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) Starter Kit
@@ -55,7 +55,9 @@ The development datasets will be given in json format with the following fields.
 - `input_field`: This field contains the instructions and the question that should be answered by the model. 
 - `output_field`: This field contains the ground truth answer to the question. 
 - `task_type`: This field contains the type of the task (Details in the next Section, "Tasks")
+- `task_name`: This field contains the name of the task. However, the exact task names are redacted, and we only provide participants with hashed task names (e.g. `task1`, `task2`). 
 - `metric`: This field contains the metric used to evaluate the question (Details in Section "Evaluation Metrics"). 
+- `track`: This field specifies the track the question comes from. 

 However, the test dataset (which will be hidden from participants) will have a different format with only two fields: 
 - `input_field`, which is the same as above. 
@@ -116,18 +118,18 @@ Please follow the instructions in [models/README.md](models/README.md) for instr

 1. **Add your SSH key** to AIcrowd GitLab

-You can add your SSH Keys to your GitLab account by going to your profile settings [here](https://gitlab.aicrowd.com/profile/keys). If you do not have SSH Keys, you will first need to [generate one](https://docs.gitlab.com/ee/ssh/README.html#generating-a-new-ssh-key-pair).
+You can add your SSH Keys to your GitLab account by going to your profile settings [here](https://gitlab.aicrowd.com/-/profile/keys). If you do not have SSH Keys, you will first need to [generate one](https://docs.gitlab.com/ee/user/ssh.html).

 2. **Fork the repository**. You can use [this link](https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/forks/new) to create a fork.

-2.  **Clone the repository**
+3.  **Clone the repository**

    ```bash
-    git clone git@gitlab.aicrowd.com:aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit.git
+    git clone git@gitlab.aicrowd.com:<YOUR-AICROWD-USER-NAME>/amazon-kdd-cup-2024-starter-kit.git
    cd amazon-kdd-cup-2024-starter-kit
    ```

-3. **Install** competition specific dependencies!
+4. **Install** competition specific dependencies!
    ```bash
    cd amazon-kdd-cup-2024-starter-kit
    pip install -r requirements.txt
@@ -135,13 +137,13 @@ You can add your SSH Keys to your GitLab account by going to your profile settin
    pip install -r requirements_eval.txt
    ```

-4. Write your own model as described in [How to write your own model](#how-to-write-your-own-model) section.
+5. Write your own model as described in [How to write your own model](#how-to-write-your-own-model) section.

-5. Test your model locally using `python local_evaluation.py`.
+6. Test your model locally using `python local_evaluation.py`.

-6. Accept the Challenge Rules on the main [challenge page](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) by clicking on the **Participate** button. Also accept the Challenge Rules on the Task specific page (link on the challenge page) that you want to submit to.
+7. Accept the Challenge Rules on the main [challenge page](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) by clicking on the **Participate** button. Also accept the Challenge Rules on the Task specific page (link on the challenge page) that you want to submit to.

-7. Make a submission as described in [How to make a submission](#-how-to-make-a-submission) section.
+8. Make a submission as described in [How to make a submission](#-how-to-make-a-submission) section.


 ## 📮 How to make a submission?
@@ -153,8 +155,22 @@ This also includes instructions on [specifying your software runtime](docs/submi

 ## 💻 What hardware does my code run on ?
 You can find more details about the hardware and system configuration in [docs/hardware-and-system-config.md](docs/hardware-and-system-config.md).
-In summary, we provide you `2` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 1; and `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.
+In summary, we provide you `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.

+Your solution will be given a certain amount of time for inference, after which it would be immediately killed and no results would be available. The time limit is set at 
+| Phase  | Track 1 | Track 2 | Track 3 | Track 4 | Track 5 |
+| ------ | ------- | ------- | ------- | ------- | ------- |
+| **Phase 2**| 70 minutes | 20 minutes | 30 minutes | 20 minutes | 140 minutes |
+
+For reference, the baseline solution with zero-shot LLaMA3-8B-instruct consumes the following amount of time. 
+
+| Phase  | Track 1 | Track 2 | Track 3 | Track 4 | 
+| ------ | ------- | ------- | ------- | ------- | 
+| **Phase 2**| 1490s | 397s | 576s | 359s | 
+
+We limit the prediction time of each sample to at most **10 seconds**. This limit applies at a batch level. For example, for a batch of 8 samples, you should return the prediction after at most 80 seconds. Otherwise, your submission will be killed. 
+
+Your maximum repo size is 200GB. 

 ## 🧩 How are my model responses parsed by the evaluators ?
 Please refer to [parsers.py](parsers.py) for more details on how we parse your model responses.

--- a/apt.txt
+++ b/apt.txt
+git
\ No newline at end of file
--- a/data/development.json
+++ b/data/development.json
--- a/docker_run.sh
+++ b/docker_run.sh
+#!/bin/bash
+
+#!/bin/bash
+
+# This script builds a Docker image from the current directory
+# and runs a container from this image, executing local_evaluation.py
+# with the current directory mounted at /submission inside the container.
+
+# Step 1: Define the name of the Docker image.
+LAST_COMMIT_HASH=$(git rev-parse --short HEAD)
+IMAGE_NAME="aicrowd/amazon-kddcup24-submission:${LAST_COMMIT_HASH}"
+
+# Step 2: Build the Docker image.
+# The '.' at the end specifies that the Docker context is the current directory.
+# This means Docker will look for a Dockerfile in the current directory to build the image.
+START_TIME=$(date +%s)
+DOCKER_BUILDKIT=1 docker build -t $IMAGE_NAME .
+BUILD_STATUS=$?
+if [ $BUILD_STATUS -ne 0 ]; then
+    echo "Docker build failed. Exiting..."
+    exit $BUILD_STATUS
+fi
+END_TIME=$(date +%s)
+BUILD_TIME=$((END_TIME - START_TIME))
+echo "Total build time: $BUILD_TIME seconds"
+
+# Step 3: Run the Docker container.
+# -v "$(pwd)":/submission mounts the current directory ($(pwd) outputs the current directory path)
+# to /submission inside the container. This way, the container can access the contents
+# of the current directory as if they were located at /submission inside the container.
+# 'python /submission/local_evaluation.py' is the command executed inside the container.
+# the -w sets the workind directory to /submission.
+# It then local_evaluation.py using software runtime set up in the Dockerfile.
+docker run \
+    --gpus all \
+    -v "$(pwd)":/submission \
+    -w /submission \
+    --shm-size=10.24gb\
+    $IMAGE_NAME python local_evaluation.py
+
+# Note: We assume you have nvidia-container-toolkit installed and configured 
+# to use the --gpus all flag. If you are not using GPUs, you can remove this flag.
+
+
+# Note 1: Please refer to the Dockerfile to understand how the software runtime is set up.
+# The Dockerfile should include all necessary commands to install Python, the necessary
+# dependencies, and any other software required to run local_evaluation.py.
+
+# Note 2: Note the .dockerignore file in the root of this directory.
+# In the .dockerignore file, specify any files or directories that should not be included
+# in the Docker context. This typically includes large files, models, or datasets that
+# are not necessary for building the Docker image. Excluding these can significantly
+# speed up the build process by reducing the size of the build context sent to the Docker daemon.
+
+# Ensure your Dockerfile and .dockerignore are properly set up before running this script.
--- a/docs/download-baseline-model-weights.md
+++ b/docs/download-baseline-model-weights.md
+### Setting Up and Downloading Baseline Model weighta with Hugging Face
+
+This guide outlines the steps to download (and check in) the models weights required for the baseline models.
+We will focus on the `Meta-Llama-3-8B-Instruct`.
+But the steps should work equally well for any other models on hugging face. 
+
+#### Preliminary Steps:
+
+1. **Install the Hugging Face Hub Package**:
+   
+   Begin by installing the `huggingface_hub` package, which includes the `hf_transfer` utility, by running the following command in your terminal:
+
+   ```bash
+   pip install huggingface_hub[hf_transfer]
+   ```
+
+2. **Accept the LLaMA Terms**:
+   
+   You must accept the LLaMA model's terms of use by visiting: [meta-llama/Meta-Llama-3-8B-Instruct Terms](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
+
+3. **Create a Hugging Face CLI Token**:
+   
+   Generate a CLI token by navigating to: [Hugging Face Token Settings](https://huggingface.co/settings/tokens). You will need this token for authentication.
+
+#### Hugging Face Authentication:
+
+1. **Login via CLI**:
+   
+   Authenticate yourself with the Hugging Face CLI using the token created in the previous step. Run:
+
+   ```bash
+   huggingface-cli login
+   ```
+
+   When prompted, enter the token.
+
+#### Model Downloads:
+
+1. **Download LLaMA-2-7b Model**:
+
+   Execute the following command to download the `Meta-Llama-3-8B-Instruct` model to a local subdirectory. This command excludes unnecessary files to save space:
+
+   ```bash
+   HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
+       meta-llama/Meta-Llama-3-8B-Instruct \
+       --local-dir-use-symlinks False \
+       --local-dir models/meta-llama/Meta-Llama-3-8B-Instruct \
+       --exclude *.pth # These are alternates to the safetensors hence not needed
+   ```
+
+#### Version Control with Git LFS:
+
+1. **Track Model Weights**:
+   
+   Use Git Large File Storage (LFS) to track the model directories. This ensures efficient handling of large files:
+
+   ```bash
+   git lfs track "models/meta-llama/*"
+   ```
+
+2. **Commit and Push**:
+   
+   Add the models to your Git repository, commit the changes, and push them to your remote repository:
+
+   ```bash
+   git add models/
+   git commit -am "add weights"
+   git push origin master
+   ```
+If you are struggling with GIT-LFS, you are very much encouraged to check out [this post](https://discourse.aicrowd.com/t/how-to-upload-large-files-size-to-your-submission/2304).
--- a/docs/hardware-and-system-config.md
+++ b/docs/hardware-and-system-config.md
@@ -11,18 +11,19 @@ We apply a limit on the hardware available to each participant to run their solu
    - `40` x vCPU (`20` physical CPU cores)
    - `180GB` RAM 

+**Note**: When running in `gpu:false` mode, you will have access to `4` x vCPUs (`2` physical cores) and `8GB` RAM. 

 Please note that NVIDIA T4 uses a somewhat outdated architectures and is thus not compatible with certain acceleration toolkits (e.g. Flash Attention), so please be careful about compatibility.

 Besides, the following restrictions will also be imposed: 

- Network connection will be disabled (except for HuggingFace to download open-source checkpoints). 
- Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows **[TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]**. 
+- Network connection will be disabled. 
+- Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows. 

 | Phase  | Track 1 | Track 2 | Track 3 | Track 4 | Track 5 |
 | ------ | ------- | ------- | ------- | ------- | ------- |
 | **Phase 1**| 140 minutes | 40 minutes | 60 minutes | 60 minutes | 5 hours |

- Each team will be able to make up to **4 submissions per week**, with a maximum of **2 Track 5 all-around submissions** **[TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]**. 
+- Each team will be able to make up to **2 submissions per week** per track for Tracks 1-4, and **1 submission per week** for track 5 all-around. 

 Based on the hardware and system configuration, we recommend participants to begin with 7B models. According to our experiments, 7B models like Vicuna-7B and Mistral can perform inference smoothly on 2 NVIDIA T4 GPUs, while 13B models will result in OOM. 
--- a/docs/runtime.md
+++ b/docs/runtime.md
@@ -17,11 +17,13 @@ Few of the most common ways are as follows:
    [...]
    ```

+We would suggest participants to keep the `requirements.txt` to the minimum, with only necessary packages in it. Chances are that, the more (unnecessary) packages you put in it, the more likely you may encounter an error on some (maybe totally unnecessary) packages. 
+
 * `apt.txt` -- The Debian packages (via aptitude) used by your inference code!

 These files are used to construct your **AIcrowd submission docker containers** in which your code will run.

-* `Dockerfile` -- **For advanced users only**. `Dockerfile` gives you more flexibility on defining the software runtime used during evaluations.
+* `Dockerfile` -- `Dockerfile` gives you more flexibility on defining the software runtime used during evaluations. The `Dockerfile` under the root path of the starter kit will be used to build your solution. Feel free to modify anything in it, and test it locally. 

 ----


--- a/docs/submission.md
+++ b/docs/submission.md
@@ -20,9 +20,9 @@ This document is designed to assist you in making your initial submission smooth

 Our platform supports custom runtime environments. This means you have the flexibility to choose any libraries or frameworks necessary for your project. Here’s how you can specify your runtime and dependencies:

- **`requirements.txt`**: List any PyPI packages your project needs.
+- **`requirements.txt`**: List any PyPI packages your project needs. **Do specify versions, as we observe significant difference in inference time between different `transformer` versions.**
 - **`apt.txt`**: Include any apt packages required.
- **`Dockerfile`**: Optionally, you can provide your own Dockerfile. An example is located at `utilities/_Dockerfile`, which can serve as a helpful starting point.
+- **`Dockerfile`**: The one located at the root will be used by default to build your submission. **You can specify the python version here if you need specific ones**. 

 For detailed setup instructions regarding runtime dependencies, refer to the documentation in the `docs/runtime.md` file.

@@ -32,18 +32,21 @@ Your project should follow the structure outlined in the starter kit. Here’s a

 ```
 .
+├── .dockerignore                   # Please specify the paths to your model checkpoints so that the large files won't be built into the docker image. 
 ├── README.md                       # Project documentation and setup instructions
 ├── aicrowd.json                    # Submission meta information - like your username, track name
 ├── data
 │   └── development.json            # Development dataset local testing
 ├── docs
-│   └── runtime.md                  # Documentation on the runtime environment setup, dependency confifgs
+│   └── runtime.md                  # Documentation on the runtime environment setup, dependency configs
+├── Dockerfile                      # The Dockerfile that will be used to build your submission and all dependencies. The default one will work fine, but you can write your own. 
+├── docker_run.sh                   # This script builds your submission locally and calls `local_evaluation.py`. It can be used to debug (if your submission fails to build). 
 ├── local_evaluation.py             # Use this to check your model evaluation flow locally
 ├── metrics.py                      # Scripts to calculate evaluation metrics for your model's performance
 ├── models
 │   ├── README.md                   # Documentation specific to the implementation of model interfaces
 │   ├── base_model.py               # Base model class 
-│   ├── dummy_model.py              # A simple or placeholder model for demonstration or testing
+│   ├── dummy_model.py              # A simple or placeholder model for demonstration or testing. We also implement a simple Vicuna-7B baseline here. 
 │   └── user_config.py              # IMPORTANT: Configuration file to specify your model 
 ├── parsers.py                      # Model output parser
 ├── requirements.txt                # Python packages to be installed for model development
@@ -52,7 +55,7 @@ Your project should follow the structure outlined in the starter kit. Here’s a
    └── _Dockerfile                 # Example Dockerfile for specifying runtime via Docker
 ```

-Remember, **your submission metadata JSON (`aicrowd.json`)** is crucial for mapping your submission to the challenge. Ensure it contains the correct `challenge_id`, `authors`, and other necessary information. To utilize GPUs, set the `"gpu": true` flag in your `aicrowd.json`.
+Remember, **your submission metadata JSON (`aicrowd.json`)** is crucial for mapping your submission to the challenge. Ensure it contains the correct `challenge_id`, `authors`, and other necessary information. **To utilize GPUs, set the `"gpu": true` flag in your `aicrowd.json`.**

 ## Submitting to Different Tracks

@@ -112,10 +115,12 @@ For more information on how to upload large files to your submission and detaile

 To submit your code, push a tag beginning with "submission-" to your repository on [GitLab](https://gitlab.aicrowd.com/). Follow these steps to make a submission:

+Assuming, you have cloned the repo already by following the instructions [here](../README.md#setup) and made your changes.
+
 1. Commit your changes with `git commit -am "Your commit message"`.
 2. Tag your submission (e.g., `git tag -am "submission-v0.1" submission-v0.1`).
-3. Push your changes and tags to the AIcrowd repository (replace `<YOUR_AICROWD_USER_NAME>` with your actual username).
+3. Push your changes and tags to the AIcrowd repository (e.g. `git push origin submission-v0.1`)

-After pushing your tag, you can view your submission details at `https://gitlab.aicrowd.com/<YOUR_AICROWD_USER_NAME>/amazon-kdd-cup-2024-starter-kit/issues`.
+After pushing your tag, you can view your submission details at `https://gitlab.aicrowd.com/<YOUR-AICROWD-USER-NAME>/amazon-kdd-cup-2024-starter-kit/issues`. It may take about **30 minutes** for each submission to build and begin evaluation, so please be patient. 

-Ensure your `aicrowd.json` is correctly filled with the necessary metadata, and you've replaced `<YOUR_AICROWD_USER_NAME>` with your GitLab username in the provided URL.
\ No newline at end of file
+Ensure your `aicrowd.json` is correctly filled with the necessary metadata, and you've replaced `<YOUR-AICROWD-USER-NAME>` with your GitLab username in the provided URL.
--- a/local_evaluation.py
+++ b/local_evaluation.py
-import pandas as pd
-from tqdm import tqdm
-import torch
-import numpy as np
 import os

-from sentence_transformers import SentenceTransformer
 import metrics
+import numpy as np
+import pandas as pd
 import parsers
+import torch
+from tqdm import tqdm
+
+VERSION = "0.1.0"


 def print_sample(idx, generation, truth, metric, score):
@@ -51,18 +52,36 @@ def generate_model_outputs(data_df, model):
    - A list containing the model outputs for each entry in the data DataFrame.
    """
    outputs = []
-    for _, row in tqdm(
-        data_df.iterrows(), total=len(data_df), desc="Generating Responses"
-    ):
-        is_multiple_choice = row["task_type"] == "multiple-choice"
-        prompt = row["input_field"]
-        model_output = model.predict(prompt, is_multiple_choice)
-        outputs.append(model_output)
-    return outputs
+    task_grouped_df = data_df.groupby(by=["task_type"])
+    
+    for task_type, task_group_data_df in task_grouped_df:
+        task_group_data_df = task_group_data_df.reset_index(drop=True)
+        
+        is_multiple_choice = task_type[0] == "multiple-choice"
+        batch_size = model.get_batch_size()
+        
+        batches = [task_group_data_df[i:i+batch_size] for i in range(0,len(task_group_data_df),batch_size)]
+        
+        for batch_df in batches:
+            batch = {
+                "prompt": batch_df["input_field"].tolist(),
+            }
+            model_output = model.batch_predict(
+                    batch, 
+                    is_multiple_choice
+                )
+            outputs.append(
+                pd.DataFrame({
+                    "input_field": batch["prompt"],
+                    "model_output_str": model_output
+                }))
+    
+    df_outputs = pd.concat(outputs)
+    return df_outputs


 # Function to evaluate the generated model outputs
-def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
+def evaluate_outputs(data_df, log_every_n_steps=1):
    """
    Evaluate the model outputs against ground truth values using specified metrics.

@@ -81,21 +100,18 @@ def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
    for row_idx, row in tqdm(
        data_df.iterrows(), total=len(data_df), desc="Evaluating"
    ):
-        task_type, metric, ground_truth = (
+        task_name, task_type, metric, ground_truth, model_output_str = (
+            row["task_name"],
            row["task_type"],
            row["metric"],
            row["output_field"],
+            row["model_output_str"],
        )

        if metric not in eval_methods:
            raise NotImplementedError(f"No metric for {metric=}")

-        task_name = f"{task_type}---{metric}"
-        # Note: In practice, here we are using the task_type-metric pair as a unique identifier, calling it as the task_name.
-        # During the actual evaluations, the task names are more semantically defined, meaning, there could be multiple tasks
-        # with the same task_type and metric.
-
-        model_output = task_parsers[task_type].parse(outputs[row_idx])
+        model_output = task_parsers[task_type].parse(model_output_str)
        eval_fn = eval_methods[metric]
        metric_score = eval_fn(model_output, ground_truth)

@@ -108,9 +124,9 @@ def evaluate_outputs(data_df, outputs, log_every_n_steps=1):

        per_task_metrics[task_name]["sample_score"].append(metric_score)

-        if row_idx % log_every_n_steps == 0:
+        if (row_idx + 1) % log_every_n_steps == 0:
            print_sample(
-                row_idx, model_output, ground_truth, metric, metric_score
+                row_idx + 1, model_output, ground_truth, metric, metric_score
            )

    return per_task_metrics
@@ -143,7 +159,7 @@ def aggregate_scores(per_task_metrics):
        overall_score = (
            np.mean(sample_scores)
            if metric != "micro f1"
-            else metrics.compute_f1_score(sample_scores)
+            else metrics.calculate_f1_score(sample_scores)
        )

        overall_metrics["task_name"].append(task_name)
@@ -163,26 +179,28 @@ def get_evaluation_methods():
    Returns:
    - A dictionary mapping metric names to their respective evaluation functions.
    """
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    sentence_all_lm = SentenceTransformer("all-MiniLM-L6-v2").to(device)
-    sentence_multilingual = SentenceTransformer(
-        "paraphrase-multilingual-MiniLM-L12-v2"
-    ).to(device)
-
    return {
-        "accuracy": metrics.accuracy,
-        "hit rate@3": metrics.hit_rate_3,
-        "rougel": metrics.rougel,
-        "sent-transformer": lambda g, t: metrics.sent_transformer(
-            g, t, sentence_all_lm
+        "accuracy": metrics.calculate_per_sample_accuracy,
+        "hit rate@3": metrics.calculate_hit_rate_3,
+        "rougel": metrics.calculate_rougel,
+        "sent-transformer": lambda generated_text, reference_texts: metrics.calculate_cosine_similarity(
+            generated_text=generated_text,
+            reference_texts=reference_texts,
+            model_name="all-MiniLM-L6-v2",
+        ),
+        "multilingual-sent-transformer": lambda generated_text, reference_texts: metrics.calculate_cosine_similarity(
+            generated_text=generated_text,
+            reference_texts=reference_texts,
+            model_name="paraphrase-multilingual-MiniLM-L12-v2",
        ),
-        "multilingual-sent-transformer": lambda g, t: metrics.sent_transformer(
-            g, t, sentence_multilingual
+        "micro f1": metrics.calculate_true_positive_false_positives_false_negatives,
+        "ndcg": metrics.calculate_ndcg,
+        "bleu": metrics.calculate_bleu_score,
+        "jp-bleu": lambda generated_text, reference_text: metrics.calculate_bleu_score(
+            generated_text=generated_text,
+            reference_text=reference_text,
+            is_japanese=True,
        ),
-        "micro f1": metrics.tp_fp_fn,
-        "ndcg": metrics.ndcg_eval,
-        "bleu": metrics.bleu,
-        "jp-bleu": lambda g, t: metrics.bleu(g, t, jp=True),
    }


@@ -208,14 +226,14 @@ def get_task_parsers():
 # Main execution function to load data, generate model outputs, evaluate, and aggregate scores
 def main():
    # Load development data
-    # Please download the development data from : https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/dataset_files
+    # Please download the development data from : https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms/dataset_files
    # and place it at: ./data/development.json
    DATA_FILENAME = "./data/development.json"

    if not os.path.exists(DATA_FILENAME):
        raise FileNotFoundError(
            f"Development data file not found at {DATA_FILENAME}."
-            "Please download the development data from : https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/dataset_files"
+            "Please download the development data from : https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms/dataset_files"
            "and place it at: ./data/development.json"
        )

@@ -229,14 +247,15 @@ def main():
    model = UserModel()

    # Generate model outputs
-    outputs = generate_model_outputs(data_df, model)
-    data_df["outputs"] = (
-        outputs  # Optional: Add outputs back to DataFrame for inspection
-    )
-    print(data_df.head())
+    df_outputs = generate_model_outputs(data_df, model)
+    
+    # add outputs to the data_df
+    merged_data_df = pd.merge(data_df, df_outputs, on="input_field")
+        
+    print(merged_data_df.head())

    # Evaluate the generated outputs and calculate metrics
-    per_task_metrics = evaluate_outputs(data_df, outputs)
+    per_task_metrics = evaluate_outputs(merged_data_df)

    # Aggregate and display the evaluation scores
    overall_metrics = aggregate_scores(per_task_metrics)

--- a/metrics.py
+++ b/metrics.py
+import os
+from typing import List, Tuple, Union
+
+import evaluate
+import numpy as np
+import torch
+from loguru import logger
 from rouge_score import rouge_scorer
 from sentence_transformers import SentenceTransformer
-import numpy as np
-import evaluate
-
-from typing import List

 sacrebleu = None
+sentence_transformer_model_cache = {}
+

+def calculate_per_sample_accuracy(prediction: int, truth: int) -> bool:
+    """
+    Computes the accuracy of a single prediction.

-def accuracy(prediction: int, truth: int):
+    This function checks if a given prediction matches the ground truth.
+    
+    Parameters:
+    - prediction (int): The predicted value.
+    - truth (int): The actual ground truth value.
+    
+    Returns:
+    - bool: True if the prediction matches the truth, False otherwise.
+    """
    return prediction == truth


-def hit_rate_3(retrieved_int: List[int], truth: List[int]):
+def calculate_hit_rate_3(retrieved_int: List[int], truth: List[int]) -> float:
+    """
+    Calculates the hit rate within the top 3 retrieved integers.
+
+    This function assesses how many of the truth integers are present 
+    within the first three elements of the retrieved list of integers.
+    
+    Parameters:
+    - retrieved_int (List[int]): The list of retrieved integers, ordered by relevance.
+    - truth (List[int]): The list of ground truth integers.
+    
+    Returns:
+    - float: The hit rate, calculated as the proportion of truth integers found 
+      in the top 3 retrieved integers, relative to the total number of truth integers.
+    """
+    # Calculate the number of hits within the top 3 retrieved integers
    hit = len(set(truth).intersection(set(retrieved_int[:3])))
-    hit /= len(truth)
-    return hit
+    # Normalize the hit count by the total number of truth integers to get the hit rate
+    hit_rate = hit / len(truth)
+    return hit_rate
+

+def calculate_rougel(generation: str, truth: str) -> float:
+    """
+    Calculates the ROUGE-L F-measure score between a generated string and the truth string.

-def rougel(generation: str, truth: str):
+    ROUGE-L measures the longest common subsequence between the generated text and the truth text,
+    considering both the precision and recall of the sequences. It is widely used in evaluating
+    the quality of text generation systems.
+    
+    Parameters:
+    - generation (str): The generated text to evaluate.
+    - truth (str): The ground truth text to compare against.
+    
+    Returns:
+    - float: The ROUGE-L F-measure score, indicating the quality of the generated text.
+    """
+    # Initialize the ROUGE scorer with the ROUGE-L metric
    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
+    # Calculate the ROUGE scores between the generated text and the truth text
    scores = scorer.score(generation, truth)
+    # Extract and return the ROUGE-L F-measure score
    return scores["rougeL"].fmeasure


-def sent_transformer(generation: str, truth: str, sent_transformer_model):
-    generation_embedding = sent_transformer_model.encode([generation])[0]
+def load_sentence_transformer_model(model_name: str) -> SentenceTransformer:
+    """
+    Loads a Sentence Transformer model by its name and moves it to the appropriate device.

-    if isinstance(truth, str):
-        truth_embedding = sent_transformer_model.encode([truth])[0]
-        score = (generation_embedding * truth_embedding).sum()
-        score /= np.linalg.norm(generation_embedding, ord=2) * np.linalg.norm(
-            truth_embedding, ord=2
-        )
-        if score > 0:
-            return score
-        else:
-            return 0
+    Parameters:
+    - model_name (str): The name of the model to load.
+
+    Returns:
+    - SentenceTransformer: The loaded SentenceTransformer model.
+    """
+    
+    global sentence_transformer_model_cache
+    
+    # a model cache ensure we do not load the model on every call
+    if model_name not in sentence_transformer_model_cache:
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        model = SentenceTransformer(model_name).to(device)
+        sentence_transformer_model_cache[model_name] = model
+        
+    return sentence_transformer_model_cache[model_name]
+
+def calculate_cosine_similarity(generated_text: str, reference_texts: Union[str, List[str]], model_name) -> float:
+    """
+    Computes the cosine similarity score(s) between a generated text and reference text(s) using a sentence embedding model.
+    
+    This function calculates the cosine similarity between the embedding of the generated text and the embedding(s) 
+    of reference text(s). The embeddings are generated using a specified sentence embedding model. The cosine similarity 
+    score is a measure of similarity between two vectors, ranging from -1 (completely different) to 1 (exactly the same).
+    
+    Parameters:
+    - generated_text (str): The text generated by the model.
+    - reference_texts (Union[str, List[str]]): The reference text(s) for comparison. Can be a single string or a list of strings.
+    - model_name: The sentence embedding model used to generate text embeddings.
+    
+    Returns:
+    - float: The average cosine similarity score between the generated text and the reference text(s). If reference_texts is a single 
+      string, a single score is returned. If reference_texts is a list of strings, the average score across all references is returned.
+      The score is bounded between 0 (no similarity) and 1 (identical), with negative scores adjusted to 0.
+    """
+    # Load/Reference model
+    model = load_sentence_transformer_model(model_name)
+    
+    # Embedding for the generated text
+    generated_embedding = model.encode([generated_text])[0]
+
+    # Handling a single reference text
+    if isinstance(reference_texts, str):
+        # Embedding for the single reference text
+        reference_embedding = model.encode([reference_texts])[0]
+        # Compute cosine similarity
+        similarity_score = np.dot(generated_embedding, reference_embedding) / (np.linalg.norm(generated_embedding) * np.linalg.norm(reference_embedding))
+        # Ensure non-negative score
+        return max(similarity_score, 0)
+    
+    # Handling multiple reference texts
    else:
-        scores = []
-        for label_item in truth:
-            truth_embedding = sent_transformer_model.encode([label_item])[0]
-            score_ = (generation_embedding * truth_embedding).sum()
-            score_ /= np.linalg.norm(
-                generation_embedding, ord=2
-            ) * np.linalg.norm(truth_embedding, ord=2)
-            scores.append(score_)
-        if np.mean(scores) > 0:
-            return np.mean(scores)
-        else:
-            return 0
-
-
-def tp_fp_fn(entity_list, truth):
-    answer_lower = []
-    for a in entity_list:
-        answer_lower.append(a.lower().lstrip(" ").rstrip(" "))
-    truth_lower = []
-    for l in truth:
-        truth_lower.append(l.lower())
-    true_positive = len(set(answer_lower).intersection(set(truth_lower)))
-    false_positive = len(answer_lower) - true_positive
-    false_negative = len(truth_lower) - true_positive
-    return true_positive, false_positive, false_negative
-
-
-def compute_f1_score(tp_fp_fn_list):
-    total_tp = 0
-    total_fp = 0
-    total_fn = 0
-    for tp, fp, fn in tp_fp_fn_list:
+        similarity_scores = []
+        for reference_text in reference_texts:
+            # Embedding for each reference text
+            reference_embedding = model.encode([reference_text])[0]
+            # Compute cosine similarity for each reference
+            individual_score = np.dot(generated_embedding, reference_embedding) / (np.linalg.norm(generated_embedding) * np.linalg.norm(reference_embedding))
+            similarity_scores.append(individual_score)
+        # Calculate and ensure non-negative average score
+        return max(np.mean(similarity_scores), 0)
+    
+def calculate_true_positive_false_positives_false_negatives(extracted_entities: List[str], ground_truth_entities: List[str]) -> Tuple[int, int, int]:
+    """
+    Calculates true positives, false positives, and false negatives for entity extraction.
+
+    This function compares a list of extracted entities against a list of ground truth entities
+    to determine the count of true positives (correctly extracted entities), false positives
+    (incorrectly extracted entities), and false negatives (missed entities).
+
+    Both lists are case-insensitive, and leading/trailing spaces in extracted entities are ignored.
+
+    Parameters:
+    - extracted_entities (List[str]): The list of entities extracted by the model.
+    - ground_truth_entities (List[str]): The list of actual entities (ground truth).
+
+    Returns:
+    - Tuple[int, int, int]: A tuple containing the counts of true positives, false positives, and false negatives.
+    """
+    # Normalize the extracted entities by making them lowercase and stripping leading/trailing spaces
+    normalized_extracted_entities = [entity.lower().strip() for entity in extracted_entities]
+    
+    # Normalize the ground truth entities by making them lowercase
+    normalized_ground_truth_entities = [entity.lower() for entity in ground_truth_entities]
+
+    # Calculate true positives by finding the intersection between extracted and ground truth entities
+    true_positives = len(set(normalized_extracted_entities).intersection(set(normalized_ground_truth_entities)))
+
+    # Calculate false positives as extracted entities not in ground truth
+    false_positives = len(normalized_extracted_entities) - true_positives
+
+    # Calculate false negatives as ground truth entities not extracted
+    false_negatives = len(normalized_ground_truth_entities) - true_positives
+
+    return true_positives, false_positives, false_negatives
+
+def calculate_f1_score(metrics_list: List[Tuple[int, int, int]]) -> float:
+    """
+    Calculates the F1 score from a list of tuples containing true positives, false positives, and false negatives.
+
+    Parameters:
+    - metrics_list (List[Tuple[int, int, int]]): A list of tuples, where each tuple contains counts of true positives,
+      false positives, and false negatives in that order for various classifications or entity extractions.
+
+    Returns:
+    - float: The computed F1 score, ranging from 0 to 1.
+    """
+    total_tp, total_fp, total_fn = 0, 0, 0
+
+    # Aggregate total true positives, false positives, and false negatives
+    for tp, fp, fn in metrics_list:
        total_tp += tp
        total_fp += fp
        total_fn += fn
-    precision = total_tp / (total_tp + total_fp)
-    recall = total_tp / (total_tp + total_fn)
+
+    # Calculate precision and recall
+    precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
+    recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
+
+    # Calculate F1 score, handling the case where precision + recall equals 0
    if precision + recall == 0:
        return 0
    else:
        return 2 * precision * recall / (precision + recall)

+def calculate_ndcg(predicted_relevance_scores: List[int], true_relevance_weights: List[float]) -> float:
+    """
+    Calculates and evaluates the Normalized Discounted Cumulative Gain (NDCG) score directly from predicted relevance scores
+    against true relevance weights. It normalizes the scores to ensure a fair comparison, trimming the predicted scores
+    if necessary to match the length of the true relevance weights.
+
+    Parameters:
+    - predicted_relevance_scores (List[int]): Indices of items ranked by the algorithm, expected to be integers starting from 1.
+    - true_relevance_weights (List[float]): Actual relevance weights for the items, with higher values indicating greater relevance.
+
+    Returns:
+    - float: The NDCG score, normalized against the ideal ranking, ranging from 0 to 1.
+    """
+    # Trim the predicted scores to match the true scores length if necessary
+    if len(predicted_relevance_scores) > len(true_relevance_weights):
+        predicted_relevance_scores = predicted_relevance_scores[:len(true_relevance_weights)]

-def ndcg(ranked_list, weight):
-    idcg = 0
-    dcg = 0
-    for i in range(len(ranked_list)):
-        position = i + 1
-        if ranked_list[i] - 1 < len(weight):
-            relevance = weight[ranked_list[i] - 1]
+    dcg, idcg = 0.0, 0.0
+
+    # Calculate DCG for the predicted ranking
+    for i, score_index in enumerate(predicted_relevance_scores, start=1):
+        if score_index - 1 < len(true_relevance_weights):
+            relevance = true_relevance_weights[score_index - 1]
        else:
            relevance = 0
-        dcg += (np.power(2, relevance) - 1) / np.log2(position + 1)
-    weight.sort(reverse=True)
-    for i in range(len(weight)):
-        position = i + 1
-        relevance = weight[i]
-        idcg += (np.power(2, relevance) - 1) / np.log2(position + 1)
-    return dcg / idcg
+        dcg += (np.power(2, relevance) - 1) / np.log2(i + 1)
+    
+    # Calculate IDCG using sorted true relevance weights
+    for i, weight in enumerate(sorted(true_relevance_weights, reverse=True), start=1):
+        idcg += (np.power(2, weight) - 1) / np.log2(i + 1)
+    
+    # Avoid division by zero
+    return 0 if idcg == 0 else dcg / idcg


-def ndcg_eval(relevance_scores: List[float], truth: List[float]):
-    if len(relevance_scores) > len(truth):
-        relevance_scores = relevance_scores[: len(truth)]
-    return ndcg(relevance_scores, truth)
+def calculate_bleu_score(generated_text: str, reference_text: str, is_japanese: bool = False) -> float:
+    """
+    Calculates the BLEU score for a generated text compared to a reference truth text. This function supports
+    both general text and Japanese-specific evaluation by using the sacrebleu library.

+    Parameters:
+    - generated_text (str): The generated text to be evaluated.
+    - reference_text (str): The reference truth text.
+    - is_japanese (bool, optional): Flag to indicate whether the text is in Japanese, requiring special tokenization.

-def bleu(generation, truth, jp=False):
+    Returns:
+    - float: The BLEU score as a percentage (0 to 1 scale) for the generated text against the reference truth.
+    """
    global sacrebleu
    if sacrebleu is None:
-        print("\nsacrebleu loading...")
        sacrebleu = evaluate.load("sacrebleu")

-    generation = generation.lstrip("\n").rstrip("\n").split("\n")[0]
-    candidate = [generation]
-    reference = [[truth]]
-    if not jp:
-        score = (
-            sacrebleu.compute(
-                predictions=candidate, references=reference, lowercase=True
-            )["score"]
-            / 100
-        )
-    else:
-        score = (
-            sacrebleu.compute(
-                predictions=candidate,
-                references=reference,
-                lowercase=True,
-                tokenize="ja-mecab",
-            )["score"]
-            / 100
-        )
+    # Preprocess input texts
+    generated_text = generated_text.lstrip("\n").rstrip("\n").split("\n")[0]
+    candidate = [generated_text]
+    reference = [[reference_text]]
+
+    # Compute BLEU score with or without Japanese-specific tokenization
+    bleu_args = {"predictions": candidate, "references": reference, "lowercase": True}
+    if is_japanese:
+        bleu_args["tokenize"] = "ja-mecab"
+    score = sacrebleu.compute(**bleu_args)["score"] / 100
+
    return score
--- a/models/README.md
+++ b/models/README.md
@@ -4,7 +4,7 @@
 For a streamlined experience, we suggest placing the code for all your models within the `models` directory. This is a recommendation for organizational purposes, but it's not a strict requirement.

 ## Model Base Class
-Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `predict` method.
+Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `batch_predict` method.

 ## Configuring Your Model
 To ensure your model is recognized and utilized correctly, please specify your model class name in the [`user_config.py`](user_config.py) file, by following the instructions in the inline comments.
@@ -12,12 +12,14 @@ To ensure your model is recognized and utilized correctly, please specify your m
 ## Model Inputs and Outputs

 ### Inputs
-Your model will receive two pieces of information for every task:
- `prompt` (`str`): This is the specific task's input prompt.
+- `batch` (`Dict[str, Any]`): A batch of inputs as a dictionary, where the dictionary has the following key:
+    - `prompt` (`List[str]`): `A list if prompts representing the tasks in a batch`
 - `is_multiple_choice` (`bool`): This indicates whether the task is a multiple choice question.

 ### Outputs
-The output from your model's `predict` function should always be a string. Depending on the task, this could be:
+
+The output from your model's `batch_predict` function should be a list of string responses for all the prompts in the input batch.
+Depending on the task, each response could be:
 - A single integer (in the range [0, 3]) for multiple choice tasks.
 - A comma-separated list of integers for ranking tasks.
 - A comma-separated list of named entities for Named Entity Recognition (NER) tasks.

--- a/models/base_model.py
+++ b/models/base_model.py
+from typing import Any, Dict, List
+
+
 class ShopBenchBaseModel:
    def __init__(self):
        pass

-    def predict(self, prompt: str, is_multiple_choice: bool) -> str:
+    def get_batch_size(self) -> int:
+        """
+        Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
+
+        Returns:
+            int: The batch size, an integer between 1 and 16. This value indicates how many
+                 queries should be processed together in a single batch. It can be dynamic
+                 across different batch_predict calls, or stay a static value.
+        """
+        raise NotImplementedError("get_batch_size method not implemented")
+
+    def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
        """
-        Generates a prediction based on the input prompt and task type.
+        Generates a batch of prediction based on associated prompts and task_type

        For multiple choice tasks, it randomly selects a choice.
        For other tasks, it returns a list of integers as a string,
        representing the model's prediction in a format compatible with task-specific parsers.

-        Args:
-            prompt (str): The input prompt for the model.
-            is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
+        Parameters:
+            - batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
+                - prompt (List[str]): a list of input prompts for the model.
+    
+            - is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.

        Returns:
-            str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
+            str: A list of predictions for each of the prompts received in the batch.
+                    Each prediction is
+                           a string representing a single integer[0, 3] for multiple choice tasks,
                        or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
                        or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
                        or a string representing the (unconstrained) generated response for the generation tasks

--- a/models/dummy_model.py
+++ b/models/dummy_model.py
-from typing import List, Union
-import random
 import os
+import random
+from typing import Any, Dict, List

 from .base_model import ShopBenchBaseModel

@@ -19,33 +19,55 @@ class DummyModel(ShopBenchBaseModel):
        """Initializes the model and sets the random seed for consistency."""
        random.seed(AICROWD_RUN_SEED)

-    def predict(self, prompt: str, is_multiple_choice: bool) -> str:
+    def get_batch_size(self) -> int:
+        """
+        Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
+
+        Returns:
+            int: The batch size, an integer between 1 and 16. This value indicates how many
+                 queries should be processed together in a single batch. It can be dynamic
+                 across different batch_predict calls, or stay a static value.
        """
-        Generates a prediction based on the input prompt and task type.
+        self.batch_size = 4
+        return self.batch_size
+
+    def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
+        """
+        Generates a batch of prediction based on associated prompts and task_type

        For multiple choice tasks, it randomly selects a choice.
        For other tasks, it returns a list of integers as a string,
        representing the model's prediction in a format compatible with task-specific parsers.

-        Args:
-            prompt (str): The input prompt for the model.
-            is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
+        Parameters:
+            - batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
+                - prompt (List[str]): a list of input prompts for the model.
+    
+            - is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.

        Returns:
-            str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
+            str: A list of predictions for each of the prompts received in the batch.
+                    Each prediction is
+                           a string representing a single integer[0, 3] for multiple choice tasks,
                        or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
                        or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
                        or a string representing the (unconstrained) generated response for the generation tasks
                        Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
        """
+        prompts = batch["prompt"]
+
        possible_responses = [1, 2, 3, 4]

-        if is_multiple_choice:
-            # Randomly select one of the possible responses for multiple choice tasks
-            return str(random.choice(possible_responses))
-        else:
-            # For other tasks, shuffle the possible responses and return as a string
-            random.shuffle(possible_responses)
-            return str(possible_responses)
-            # Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
-            # For generation tasks, this should ideally return an unconstrained string.
+        batch_response = []
+        for prompt in prompts:
+            if is_multiple_choice:
+                # Randomly select one of the possible responses for multiple choice tasks
+                batch_response.append(str(random.choice(possible_responses)))
+            else:
+                # For other tasks, shuffle the possible responses and return as a string
+                random.shuffle(possible_responses)
+                batch_response.append(str(possible_responses))
+                # Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
+                # For generation tasks, this should ideally return an unconstrained string.
+
+        return batch_response
--- a/models/user_config.py
+++ b/models/user_config.py
@@ -7,6 +7,7 @@ from models.dummy_model import DummyModel
 # This approach allows for easier reference to your model class when evaluating your models,
 UserModel = DummyModel

+
 # When implementing your own model please follow this pattern:
 #
 # from models.your_model import YourModel
@@ -17,3 +18,11 @@ UserModel = DummyModel
 # Finally, assign YourModel to UserModel as shown below to use it throughout your script.
 #
 # UserModel = YourModel
+
+
+# For example, to use the Llama3 8B Instruct baseline, you can comment the lines below:
+# please remember to download the model weights and checking them into the repository 
+# before submitting
+
+# from models.vanilla_llama3_baseline import Llama3_8B_ZeroShotModel
+# UserModel = Llama3_8B_ZeroShotModel
--- a/models/vanilla_llama3_baseline.py
+++ b/models/vanilla_llama3_baseline.py
+import os
+import random
+from typing import Any, Dict, List
+
+import vllm
+
+from .base_model import ShopBenchBaseModel
+
+#### CONFIG PARAMETERS ---
+
+# Set a consistent seed for reproducibility
+AICROWD_RUN_SEED = int(os.getenv("AICROWD_RUN_SEED", 773815))
+
+# Batch size you wish the evaluators will use to call the `batch_generate_answer` function
+AICROWD_SUBMISSION_BATCH_SIZE = 16 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
+
+# VLLM Parameters 
+VLLM_TENSOR_PARALLEL_SIZE = 4 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
+VLLM_GPU_MEMORY_UTILIZATION = 0.85 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
+
+
+class Llama3_8B_ZeroShotModel(ShopBenchBaseModel):
+    """
+    A dummy model implementation for ShopBench, illustrating how to handle both
+    multiple choice and other types of tasks like Ranking, Retrieval, and Named Entity Recognition.
+    This model uses a consistent random seed for reproducible results.
+    """
+
+    def __init__(self):
+        """Initializes the model and sets the random seed for consistency."""
+        random.seed(AICROWD_RUN_SEED)
+        self.initialize_models()
+
+    def initialize_models(self):
+        # Initialize Meta Llama 3 - 8B Instruct Model
+        self.model_name = "models/meta-llama/Meta-Llama-3-8B-Instruct"
+
+        if not os.path.exists(self.model_name):
+            raise Exception(
+                f"""
+            The evaluators expect the model weights to be checked into the repository,
+            but we could not find the model weights at {self.model_name}
+            
+            Please follow the instructions in the docs below to download and check in the model weights.
+                https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/blob/master/docs/download-baseline-model-weights.md
+            
+            """
+            )
+
+        # initialize the model with vllm
+        self.llm = vllm.LLM(
+            self.model_name,
+            worker_use_ray=True,
+            tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE, 
+            gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION, 
+            trust_remote_code=True,
+            dtype="half", # note: bfloat16 is not supported on nvidia-T4 GPUs
+            enforce_eager=True
+        )
+        self.tokenizer = self.llm.get_tokenizer()
+
+
+
+    def get_batch_size(self) -> int:
+        """
+        Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
+
+        Returns:
+            int: The batch size, an integer between 1 and 16. This value indicates how many
+                 queries should be processed together in a single batch. It can be dynamic
+                 across different batch_predict calls, or stay a static value.
+        """
+        self.batch_size = AICROWD_SUBMISSION_BATCH_SIZE
+        return self.batch_size
+
+    def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
+        """
+        Generates a batch of prediction based on associated prompts and task_type
+
+        For multiple choice tasks, it randomly selects a choice.
+        For other tasks, it returns a list of integers as a string,
+        representing the model's prediction in a format compatible with task-specific parsers.
+
+        Parameters:
+            - batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
+                - prompt (List[str]): a list of input prompts for the model.
+    
+            - is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
+
+        Returns:
+            str: A list of predictions for each of the prompts received in the batch.
+                    Each prediction is
+                           a string representing a single integer[0, 3] for multiple choice tasks,
+                        or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
+                        or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
+                        or a string representing the (unconstrained) generated response for the generation tasks
+                        Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
+        """
+        prompts = batch["prompt"]
+        
+        # format prompts using the chat template
+        formatted_prompts = self.format_prommpts(prompts)
+        # set max new tokens to be generated
+        max_new_tokens = 100 
+        
+        if is_multiple_choice:
+            max_new_tokens = 1 # For MCQ tasks, we only need to generate 1 token
+        
+        
+        # Generate responses via vllm
+        responses = self.llm.generate(
+            formatted_prompts,
+            vllm.SamplingParams(
+                n=1,  # Number of output sequences to return for each prompt.
+                top_p=0.9,  # Float that controls the cumulative probability of the top tokens to consider.
+                temperature=0,  # randomness of the sampling
+                seed=AICROWD_RUN_SEED, # Seed for reprodicibility
+                skip_special_tokens=True,  # Whether to skip special tokens in the output.
+                max_tokens=max_new_tokens,  # Maximum number of tokens to generate per output sequence.
+            ),
+            use_tqdm = False
+        )
+        # Aggregate answers into List[str]
+        batch_response = []
+        for response in responses:
+            batch_response.append(response.outputs[0].text)        
+            
+        if is_multiple_choice:
+            print("MCQ: ", batch_response)
+
+        return batch_response
+
+    def format_prommpts(self, prompts):
+        """
+        Formats prompts using the chat_template of the model.
+            
+        Parameters:
+        - queries (list of str): A list of queries to be formatted into prompts.
+            
+        """
+        system_prompt = "You are a helpful online shopping assistant. Please answer the following question about online shopping and follow the given instructions.\n\n"
+        formatted_prompts = []
+        for prompt in prompts:
+            formatted_prompts.append(system_prompt + prompt)
+
+        return formatted_prompts
--- a/parsers.py
+++ b/parsers.py
 import ast

+from loguru import logger
+
+VERSION = "0.1.1"
+
+MAX_RESPONSE_CHARACTERS = 5000
+

 class ShoppingBenchTaskParsers:
    """
@@ -49,6 +55,9 @@ class ShoppingBenchTaskParsers:
            response, str
        ), f"Response must be a string, but got {type(response)}"

+        # Consider only the first MAX_RESPONSE_CHARACTERS
+        response = response[:MAX_RESPONSE_CHARACTERS]
+
        # Attempt to retrieve the appropriate parser method for the task type.
        parser_method = task_parser_methods.get(self.task_type)

@@ -73,10 +82,15 @@ class ShoppingBenchTaskParsers:
            An integer representing the selected option. Returns -1 if the parsing fails due to
            an invalid response format.
        """
+        default_response = -1
        try:
-            return int(response.strip()[0])
-        except ValueError:
-            return -1
+            response = response.strip()
+            return int(response[0])
+        except Exception as e:
+            logger.warning(
+                f"SHOPBENCH_PARSER_WARNING::: Error parsing multichoice response: {e}. Responding with default : {default_response}"
+            )
+            return default_response

    def _parse_ranking(self, response: str) -> list:
        """
@@ -91,6 +105,7 @@ class ShoppingBenchTaskParsers:
            A list of integers representing the items in ranked order. Limits to the first 5 unique
            elements. Returns an empty list if duplicates are found or parsing fails.
        """
+        default_respomse = []
        # Keep only numeric characters and specific punctuation.
        cleaned_response = "".join(
            c for c in response if c.isnumeric() or c in [",", " "]
@@ -101,7 +116,9 @@ class ShoppingBenchTaskParsers:
        for item in cleaned_response.split(","):
            try:
                # Attempt to convert each item to an integer and add it to the list.
-                ranked_items.append(int(item))
+                int_item = int(item)
+                if int_item <= 5:  # we know int_item can be at most 5
+                    ranked_items.append(int_item)
            except ValueError:
                pass  # Skip non-numeric items.

@@ -110,7 +127,7 @@ class ShoppingBenchTaskParsers:

        # If there are duplicates, empty the list
        if len(ranked_items) != len(set(ranked_items)):
-            ranked_items = []
+            ranked_items = default_respomse
        return ranked_items

    def _parse_generation(self, response: str) -> str:
@@ -139,24 +156,30 @@ class ShoppingBenchTaskParsers:
        Returns:
            A list of integers representing the first 3 unique retrieved item indices.
        """
-        # Similar to ranking parser, but only returns the first 3 elements.
-        cleaned_response = "".join(
-            c for c in response if c.isnumeric() or c in [",", " "]
-        )
-
-        # Convert to list of integers
-        response = []
-        for item in cleaned_response.split(","):
-            try:
-                # Attempt to convert each item to an integer and add it to the list.
-                response.append(int(item))
-            except ValueError:
-                pass  # Skip non-numeric items.
-
-        # consider only the first 3 elements
-        retrieved_items = response[:3]
+        default_response = []
+        try:
+            # Similar to ranking parser, but only returns the first 3 elements.
+            cleaned_response = "".join(
+                c for c in response if c.isnumeric() or c in [",", " "]
+            )

-        return retrieved_items
+            # Convert to list of integers
+            response = []
+            for item in cleaned_response.split(","):
+                try:
+                    # Attempt to convert each item to an integer and add it to the list.
+                    response.append(int(item))
+                except ValueError:
+                    pass  # Skip non-numeric items.
+
+            # consider only the first 3 elements
+            retrieved_items = response[:3]
+            return retrieved_items
+        except Exception as e:
+            logger.warning(
+                f"SHOPBENCH_PARSER_WARNING::: Error parsing retrieval response: {e}. Responding with default : {default_response}"
+            )
+            return default_response

    def _parse_named_entity_recognition(self, response: str) -> list:
        """
@@ -182,78 +205,124 @@ class ShoppingBenchTaskParsers:
                raise SyntaxError(
                    "Unexpected Syntax error - fall back to comma separated list."
                )
-        except (SyntaxError, ValueError):
+        except Exception as e:
            # Fallback: split the string by commas and strip whitespace.
-            return [entity.strip() for entity in response.split(",")]
+            # we remove empty entities. it will not cause bug, just an implementation choice.
+            return [
+                entity.strip()
+                for entity in response.split(",")
+                if entity.strip() != ""
+            ]
+
+
+import unittest
+
+
+class TestShoppingBenchTaskParsers(unittest.TestCase):
+
+    def test_multichoice(self):
+        parser = ShoppingBenchTaskParsers("multichoice")
+        # Check for a valid numeric response
+        self.assertEqual(parser.parse("2"), 2)
+        # Check for an invalid (alphabetic) response, expecting failure code -1
+        self.assertEqual(parser.parse("a"), -1)
+        # Check handling of newline-only input, expecting failure code -1
+        self.assertEqual(parser.parse("\n"), -1)
+        # Check handling of space-only input, expecting failure code -1
+        self.assertEqual(parser.parse(" "), -1)
+        # Check handling of leading space before a valid response
+        self.assertEqual(parser.parse(" 2"), 2)
+        # Check handling of newline before a valid response
+        self.assertEqual(parser.parse("\n1"), 1)
+        # Check for newline and space before a valid response
+        self.assertEqual(parser.parse("\n 3"), 3)
+        # Check for newline and space only, expecting failure code -1
+        self.assertEqual(parser.parse("\n "), -1)
+
+    def test_ranking(self):
+        parser = ShoppingBenchTaskParsers("ranking")
+        # Basic successful parse of a comma-separated list of numbers
+        self.assertEqual(parser.parse("1, 2, 3, 4, 5"), [1, 2, 3, 4, 5])
+        # Successfully parses even when wrapped in square brackets
+        self.assertEqual(parser.parse("[1, 2, 3, 4, 5]"), [1, 2, 3, 4, 5])
+        # Fails (empty list) when numbers are repeated
+        self.assertEqual(parser.parse("1, 2, 2, 3"), [])
+        # Filters out non-numeric values correctly, keeping the valid numbers
+        self.assertEqual(parser.parse("1, 2, 4, aicrowd, 5"), [1, 2, 4, 5])
+        # Check handling of newline-only input, expecting empty list
+        self.assertEqual(parser.parse("\n"), [])
+        # Check handling of space and newline input, expecting empty list
+        self.assertEqual(parser.parse(" \n"), [])
+        # Parses numbers correctly even when prefixed by non-numeric text
+        self.assertEqual(
+            parser.parse("The answer is: 1, 2, 3, 4, 5"), [1, 2, 3, 4, 5]
+        )
+        # Correctly handles a leading comma
+        self.assertEqual(parser.parse(",1,2,3,4,5"), [1, 2, 3, 4, 5])
+        # Fails (empty list) when numbers are not comma-separated
+        self.assertEqual(parser.parse("1 2"), [])
+
+    def test_generation(self):
+        parser = ShoppingBenchTaskParsers("generation")
+        # Verifies correct response without modification
+        self.assertEqual(
+            parser.parse("This is a generated response."),
+            "This is a generated response.",
+        )
+        # Handles and trims extraneous newlines and spaces correctly
+        self.assertEqual(
+            parser.parse("\nThe answer is \n\n good.\n\n\n\n\n\n\n"),
+            "The answer is \n\n good.",
+        )
+        # Correctly returns empty string for newline and space-only inputs
+        self.assertEqual(parser.parse("\n \n"), "")
+
+    def test_retrieval(self):
+        parser = ShoppingBenchTaskParsers("retrieval")
+        # Basic successful parse of a comma-separated list of numbers
+        self.assertEqual(parser.parse("100, 200, 300"), [100, 200, 300])
+        # Successfully handles shorter than expected input lists
+        self.assertEqual(parser.parse("100, 200"), [100, 200])
+        # Filters out non-numeric values correctly, keeping the valid numbers
+        self.assertEqual(parser.parse("100, 200, jjhg"), [100, 200])
+        # Correctly parses numbers despite excessive spacing and newlines
+        self.assertEqual(
+            parser.parse("100,           200, \n\n\n 300"), [100, 200, 300]
+        )
+        # Limits output to first three elements if more are provided
+        self.assertEqual(parser.parse("100, 200, 300, 400"), [100, 200, 300])
+        # Correctly handles newline before valid input
+        self.assertEqual(parser.parse("\n 100, 200, 300"), [100, 200, 300])
+        # Returns empty list for newline-only inputs
+        self.assertEqual(parser.parse("\n \n \n"), [])
+
+    def test_named_entity_recognition(self):
+        parser = ShoppingBenchTaskParsers("named_entity_recognition")
+        # Successfully parses a list of strings, correctly interpreting them as separate entities
+        self.assertEqual(
+            parser.parse("['New York', 'ShopBench', 'Amazon']"),
+            ["New York", "ShopBench", "Amazon"],
+        )
+        # Successfully parses comma-separated entities without brackets or quotes
+        self.assertEqual(
+            parser.parse("New York, ShopBench, Amazon"),
+            ["New York", "ShopBench", "Amazon"],
+        )
+        # Incorrectly includes the opening bracket in the first entity and the closing bracket in the last entity,
+        # indicating an unintentional parsing error with brackets when quotes are not used.
+        self.assertEqual(
+            parser.parse("[New York, ShopBench, Amazon]"),
+            ["[New York", "ShopBench", "Amazon]"],
+        )
+        # Correctly parses entities even when the input starts with a newline and a comma, trimming unnecessary characters
+        self.assertEqual(
+            parser.parse("\n, New York, ShopBench"), ["New York", "ShopBench"]
+        )
+        # Returns an empty list when parsing only a space, indicating no entities found
+        self.assertEqual(parser.parse(" "), [])
+        # Returns an empty list for inputs consisting only of newlines and spaces, indicating no entities found
+        self.assertEqual(parser.parse("\n \n"), [])


 if __name__ == "__main__":
-    # Example usage of the ShoppingBenchTaskParsers class for various task types.
-
-    # MULTICHOICE EXAMPLE
-    multic_choice_parser = ShoppingBenchTaskParsers("multichoice")
-    print("Multichoice Example:")
-    print(multic_choice_parser.parse("2"))  # Expected output: 2
-    print(
-        multic_choice_parser.parse("a")
-    )  # Expected output (failure case): -1
-    print()
-
-    # RANKING EXAMPLE
-    ranking_parser = ShoppingBenchTaskParsers("ranking")
-    print("Ranking Example:")
-    print(
-        ranking_parser.parse("1, 2, 3, 4, 5")
-    )  # Expected output: [1, 2, 3, 4, 5]
-    print(
-        ranking_parser.parse("[1, 2, 3, 4, 5]")
-    )  # Expected output: [1, 2, 3, 4, 5] - tolerant to [, ]
-    print(
-        ranking_parser.parse("1, 2, 2, 3")
-    )  # Expected output (failure case): [] # because of repeating numbers
-    print(
-        ranking_parser.parse("1, 4, 5, aicrowd, 6")
-    )  # Expected output: [1, 4, 5, 6] # remove alphanumeric chars
-
-    print()
-
-    # GENERATION EXAMPLE
-    generation_parser = ShoppingBenchTaskParsers("generation")
-    print("Generation Example:")
-    print(
-        generation_parser.parse("This is a generated response")
-    )  # Expected output: 'This is a generated response.'
-    print()
-
-    # RETRIEVAL EXAMPLE
-    retrieval_parser = ShoppingBenchTaskParsers("retrieval")
-    print("Retrieval Example:")
-    print(
-        retrieval_parser.parse("100, 200, 300")
-    )  # Expected output: [100, 200, 300]
-    print(
-        retrieval_parser.parse("100, 200")
-    )  # Expected output (shorter than 3): [100, 200]
-    print(
-        retrieval_parser.parse("100, 200, jjhg")
-    )  # Expected output (removed alphhanumeric chars): [100, 200]
-    print(
-        retrieval_parser.parse("100, 200, 300, 400")
-    )  # Expected output (only consider first 3 elems): [100, 200, 300]
-
-    print()
-
-    # NAMED ENTITY RECOGNITION EXAMPLE
-    ner_parser = ShoppingBenchTaskParsers("named_entity_recognition")
-    print("Named Entity Recognition Example:")
-    print(
-        ner_parser.parse("['New York', 'ShopBench', 'Amazon']")
-    )  # Expected output: ['New York', 'ShopBench', 'Amazon']
-    print(
-        ner_parser.parse("New York, ShopBench, Amazon")
-    )  # Expected output: ['New York', 'ShopBench', 'Amazon']
-    print(
-        ner_parser.parse("[New York, ShopBench, Amazon]")
-    )  # failure case - not tolerant to [ if quotes not used
-    # - extra '[' characters added to boundary elems]): ['[New York', 'ShopBench', 'Amazon]']
-    # Expected output: ['[New York', 'ShopBench', 'Amazon]']
+    unittest.main()
--- a/requirements.txt
+++ b/requirements.txt
-torch
\ No newline at end of file
+torch
+vllm>=0.4.2
+loguru
--- a/requirements_eval.txt
+++ b/requirements_eval.txt
@@ -3,5 +3,5 @@ pandas
 sentence-transformers
 rouge_score
 evaluate
-sacrebleu
-sacrebleu[ja]
\ No newline at end of file
+sacrebleu==2.4.1
+sacrebleu[ja]
No results found