Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • gaojingtong/amazon-kdd-cup-2024-starter-kit
  • pp/amazon-kdd-cup-2024
  • xbtl/amazon-kdd-cup-2024-starter-kit
  • jeremy_shi/amazon-kdd-cup-2024-starter-kit
  • zeng_biao_jie/amazon-kdd-cup-2024-starter-kit
  • der2933/amazon-kdd-cup-2024-starter-kit
  • zbt2702160239/amazon-kdd-cup-2024-starter-kit
  • pokce/amazon-kdd-cup-2024-starter-kit
  • boren/amazon-kdd-cup-2024-starter-kit
  • simon_jegou/amazon-kdd-cup-2024-starter-kit
  • li_zhi_peng/amazon-kdd-cup-2024-starter-kit
  • shisong_qin/amazon-kdd-cup-2024-starter-kit
  • lei_ding5/amazon-kdd-cup-2024-starter-kit
  • Pokce2/amazon-kdd-cup-2024-starter-kit
  • lizhipeng/amazon-kdd-cup-2024-starter-kit
  • giba/amazon-kdd-cup-2024-starter-kit
  • liuxiaoming1412/amazon-kdd-cup-2024-phase-2-lxm-07
  • haoyuzhang/amazon-kdd-cup-2024-starter-kit
  • aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit
19 results
Show changes
Commits on Source (66)
Showing with 999 additions and 289 deletions
.git/
models/**
data/
\ No newline at end of file
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive \
LANG=en_US.UTF-8 \
LANGUAGE=en_US:en \
LC_ALL=en_US.UTF-8 \
USER_NAME=aicrowd \
HOME_DIR=/home/aicrowd \
CONDA_DIR=/home/aicrowd/.conda \
PATH=/home/aicrowd/.conda/bin:${PATH} \
SHELL=/bin/bash
# Install system dependencies and clean up in one layer
COPY apt.txt /tmp/apt.txt
RUN apt -qq update && apt -qq install -y --no-install-recommends `cat /tmp/apt.txt | tr -d '\r'` locales wget build-essential \
&& locale-gen en_US.UTF-8 \
&& rm -rf /var/cache/apt/* /var/lib/apt/lists/* \
&& apt clean
# Set up user
RUN groupadd -g 1001 aicrowd && \
useradd -m -s /bin/bash -u 1001 -g aicrowd -G sudo aicrowd
USER ${USER_NAME}
WORKDIR ${HOME_DIR}
# Install Miniconda and Python packages. You can change the python version by using another Miniconda.
RUN wget -nv -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_22.11.1-1-Linux-x86_64.sh \
&& bash miniconda.sh -b -p ${CONDA_DIR} \
&& . ${CONDA_DIR}/etc/profile.d/conda.sh \
&& conda install cmake -y \
&& conda clean -y -a \
&& rm -rf miniconda.sh
COPY --chown=1001:1001 requirements.txt ${HOME_DIR}/requirements.txt
RUN pip install -r requirements.txt --no-cache-dir
COPY --chown=1001:1001 requirements_eval.txt ${HOME_DIR}/requirements_eval.txt
RUN pip install -r requirements_eval.txt --no-cache-dir
## Add your custom commands below
![AMAZON KDD CUP 2024: MULTI-TASK ONLINE SHOPPING CHALLENGE FOR LLMS](https://images.aicrowd.com/raw_images/challenges/social_media_image_file/1139/566667103918dae81381.jpg)
![AMAZON KDD CUP 2024: MULTI-TASK ONLINE SHOPPING CHALLENGE FOR LLMS](https://aicrowd-production.s3.eu-central-1.amazonaws.com/challenge_images/amazon-kdd-cup-2024/amazon-kdd-cup-24-banner.jpg)
[![Discord](https://img.shields.io/discord/565639094860775436.svg)](https://discord.gg/yWurtB2huX)
# 🛒 [Amazon KDD CUP 2024: Multi-Task Online Shopping Challenge for LLMs](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) Starter Kit
......@@ -55,7 +55,9 @@ The development datasets will be given in json format with the following fields.
- `input_field`: This field contains the instructions and the question that should be answered by the model.
- `output_field`: This field contains the ground truth answer to the question.
- `task_type`: This field contains the type of the task (Details in the next Section, "Tasks")
- `task_name`: This field contains the name of the task. However, the exact task names are redacted, and we only provide participants with hashed task names (e.g. `task1`, `task2`).
- `metric`: This field contains the metric used to evaluate the question (Details in Section "Evaluation Metrics").
- `track`: This field specifies the track the question comes from.
However, the test dataset (which will be hidden from participants) will have a different format with only two fields:
- `input_field`, which is the same as above.
......@@ -116,18 +118,18 @@ Please follow the instructions in [models/README.md](models/README.md) for instr
1. **Add your SSH key** to AIcrowd GitLab
You can add your SSH Keys to your GitLab account by going to your profile settings [here](https://gitlab.aicrowd.com/profile/keys). If you do not have SSH Keys, you will first need to [generate one](https://docs.gitlab.com/ee/ssh/README.html#generating-a-new-ssh-key-pair).
You can add your SSH Keys to your GitLab account by going to your profile settings [here](https://gitlab.aicrowd.com/-/profile/keys). If you do not have SSH Keys, you will first need to [generate one](https://docs.gitlab.com/ee/user/ssh.html).
2. **Fork the repository**. You can use [this link](https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/forks/new) to create a fork.
2. **Clone the repository**
3. **Clone the repository**
```bash
git clone git@gitlab.aicrowd.com:aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit.git
git clone git@gitlab.aicrowd.com:<YOUR-AICROWD-USER-NAME>/amazon-kdd-cup-2024-starter-kit.git
cd amazon-kdd-cup-2024-starter-kit
```
3. **Install** competition specific dependencies!
4. **Install** competition specific dependencies!
```bash
cd amazon-kdd-cup-2024-starter-kit
pip install -r requirements.txt
......@@ -135,13 +137,13 @@ You can add your SSH Keys to your GitLab account by going to your profile settin
pip install -r requirements_eval.txt
```
4. Write your own model as described in [How to write your own model](#how-to-write-your-own-model) section.
5. Write your own model as described in [How to write your own model](#how-to-write-your-own-model) section.
5. Test your model locally using `python local_evaluation.py`.
6. Test your model locally using `python local_evaluation.py`.
6. Accept the Challenge Rules on the main [challenge page](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) by clicking on the **Participate** button. Also accept the Challenge Rules on the Task specific page (link on the challenge page) that you want to submit to.
7. Accept the Challenge Rules on the main [challenge page](https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms) by clicking on the **Participate** button. Also accept the Challenge Rules on the Task specific page (link on the challenge page) that you want to submit to.
7. Make a submission as described in [How to make a submission](#-how-to-make-a-submission) section.
8. Make a submission as described in [How to make a submission](#-how-to-make-a-submission) section.
## 📮 How to make a submission?
......@@ -153,8 +155,22 @@ This also includes instructions on [specifying your software runtime](docs/submi
## 💻 What hardware does my code run on ?
You can find more details about the hardware and system configuration in [docs/hardware-and-system-config.md](docs/hardware-and-system-config.md).
In summary, we provide you `2` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 1; and `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.
In summary, we provide you `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.
Your solution will be given a certain amount of time for inference, after which it would be immediately killed and no results would be available. The time limit is set at
| Phase | Track 1 | Track 2 | Track 3 | Track 4 | Track 5 |
| ------ | ------- | ------- | ------- | ------- | ------- |
| **Phase 2**| 70 minutes | 20 minutes | 30 minutes | 20 minutes | 140 minutes |
For reference, the baseline solution with zero-shot LLaMA3-8B-instruct consumes the following amount of time.
| Phase | Track 1 | Track 2 | Track 3 | Track 4 |
| ------ | ------- | ------- | ------- | ------- |
| **Phase 2**| 1490s | 397s | 576s | 359s |
We limit the prediction time of each sample to at most **10 seconds**. This limit applies at a batch level. For example, for a batch of 8 samples, you should return the prediction after at most 80 seconds. Otherwise, your submission will be killed.
Your maximum repo size is 200GB.
## 🧩 How are my model responses parsed by the evaluators ?
Please refer to [parsers.py](parsers.py) for more details on how we parse your model responses.
......
git
\ No newline at end of file
This diff is collapsed.
#!/bin/bash
#!/bin/bash
# This script builds a Docker image from the current directory
# and runs a container from this image, executing local_evaluation.py
# with the current directory mounted at /submission inside the container.
# Step 1: Define the name of the Docker image.
LAST_COMMIT_HASH=$(git rev-parse --short HEAD)
IMAGE_NAME="aicrowd/amazon-kddcup24-submission:${LAST_COMMIT_HASH}"
# Step 2: Build the Docker image.
# The '.' at the end specifies that the Docker context is the current directory.
# This means Docker will look for a Dockerfile in the current directory to build the image.
START_TIME=$(date +%s)
DOCKER_BUILDKIT=1 docker build -t $IMAGE_NAME .
BUILD_STATUS=$?
if [ $BUILD_STATUS -ne 0 ]; then
echo "Docker build failed. Exiting..."
exit $BUILD_STATUS
fi
END_TIME=$(date +%s)
BUILD_TIME=$((END_TIME - START_TIME))
echo "Total build time: $BUILD_TIME seconds"
# Step 3: Run the Docker container.
# -v "$(pwd)":/submission mounts the current directory ($(pwd) outputs the current directory path)
# to /submission inside the container. This way, the container can access the contents
# of the current directory as if they were located at /submission inside the container.
# 'python /submission/local_evaluation.py' is the command executed inside the container.
# the -w sets the workind directory to /submission.
# It then local_evaluation.py using software runtime set up in the Dockerfile.
docker run \
--gpus all \
-v "$(pwd)":/submission \
-w /submission \
--shm-size=10.24gb\
$IMAGE_NAME python local_evaluation.py
# Note: We assume you have nvidia-container-toolkit installed and configured
# to use the --gpus all flag. If you are not using GPUs, you can remove this flag.
# Note 1: Please refer to the Dockerfile to understand how the software runtime is set up.
# The Dockerfile should include all necessary commands to install Python, the necessary
# dependencies, and any other software required to run local_evaluation.py.
# Note 2: Note the .dockerignore file in the root of this directory.
# In the .dockerignore file, specify any files or directories that should not be included
# in the Docker context. This typically includes large files, models, or datasets that
# are not necessary for building the Docker image. Excluding these can significantly
# speed up the build process by reducing the size of the build context sent to the Docker daemon.
# Ensure your Dockerfile and .dockerignore are properly set up before running this script.
### Setting Up and Downloading Baseline Model weighta with Hugging Face
This guide outlines the steps to download (and check in) the models weights required for the baseline models.
We will focus on the `Meta-Llama-3-8B-Instruct`.
But the steps should work equally well for any other models on hugging face.
#### Preliminary Steps:
1. **Install the Hugging Face Hub Package**:
Begin by installing the `huggingface_hub` package, which includes the `hf_transfer` utility, by running the following command in your terminal:
```bash
pip install huggingface_hub[hf_transfer]
```
2. **Accept the LLaMA Terms**:
You must accept the LLaMA model's terms of use by visiting: [meta-llama/Meta-Llama-3-8B-Instruct Terms](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
3. **Create a Hugging Face CLI Token**:
Generate a CLI token by navigating to: [Hugging Face Token Settings](https://huggingface.co/settings/tokens). You will need this token for authentication.
#### Hugging Face Authentication:
1. **Login via CLI**:
Authenticate yourself with the Hugging Face CLI using the token created in the previous step. Run:
```bash
huggingface-cli login
```
When prompted, enter the token.
#### Model Downloads:
1. **Download LLaMA-2-7b Model**:
Execute the following command to download the `Meta-Llama-3-8B-Instruct` model to a local subdirectory. This command excludes unnecessary files to save space:
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir-use-symlinks False \
--local-dir models/meta-llama/Meta-Llama-3-8B-Instruct \
--exclude *.pth # These are alternates to the safetensors hence not needed
```
#### Version Control with Git LFS:
1. **Track Model Weights**:
Use Git Large File Storage (LFS) to track the model directories. This ensures efficient handling of large files:
```bash
git lfs track "models/meta-llama/*"
```
2. **Commit and Push**:
Add the models to your Git repository, commit the changes, and push them to your remote repository:
```bash
git add models/
git commit -am "add weights"
git push origin master
```
If you are struggling with GIT-LFS, you are very much encouraged to check out [this post](https://discourse.aicrowd.com/t/how-to-upload-large-files-size-to-your-submission/2304).
......@@ -11,18 +11,19 @@ We apply a limit on the hardware available to each participant to run their solu
- `40` x vCPU (`20` physical CPU cores)
- `180GB` RAM
**Note**: When running in `gpu:false` mode, you will have access to `4` x vCPUs (`2` physical cores) and `8GB` RAM.
Please note that NVIDIA T4 uses a somewhat outdated architectures and is thus not compatible with certain acceleration toolkits (e.g. Flash Attention), so please be careful about compatibility.
Besides, the following restrictions will also be imposed:
- Network connection will be disabled (except for HuggingFace to download open-source checkpoints).
- Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows **[TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]**.
- Network connection will be disabled.
- Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows.
| Phase | Track 1 | Track 2 | Track 3 | Track 4 | Track 5 |
| ------ | ------- | ------- | ------- | ------- | ------- |
| **Phase 1**| 140 minutes | 40 minutes | 60 minutes | 60 minutes | 5 hours |
- Each team will be able to make up to **4 submissions per week**, with a maximum of **2 Track 5 all-around submissions** **[TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]**.
- Each team will be able to make up to **2 submissions per week** per track for Tracks 1-4, and **1 submission per week** for track 5 all-around.
Based on the hardware and system configuration, we recommend participants to begin with 7B models. According to our experiments, 7B models like Vicuna-7B and Mistral can perform inference smoothly on 2 NVIDIA T4 GPUs, while 13B models will result in OOM.
......@@ -17,11 +17,13 @@ Few of the most common ways are as follows:
[...]
```
We would suggest participants to keep the `requirements.txt` to the minimum, with only necessary packages in it. Chances are that, the more (unnecessary) packages you put in it, the more likely you may encounter an error on some (maybe totally unnecessary) packages.
* `apt.txt` -- The Debian packages (via aptitude) used by your inference code!
These files are used to construct your **AIcrowd submission docker containers** in which your code will run.
* `Dockerfile` -- **For advanced users only**. `Dockerfile` gives you more flexibility on defining the software runtime used during evaluations.
* `Dockerfile` -- `Dockerfile` gives you more flexibility on defining the software runtime used during evaluations. The `Dockerfile` under the root path of the starter kit will be used to build your solution. Feel free to modify anything in it, and test it locally.
----
......
......@@ -20,9 +20,9 @@ This document is designed to assist you in making your initial submission smooth
Our platform supports custom runtime environments. This means you have the flexibility to choose any libraries or frameworks necessary for your project. Here’s how you can specify your runtime and dependencies:
- **`requirements.txt`**: List any PyPI packages your project needs.
- **`requirements.txt`**: List any PyPI packages your project needs. **Do specify versions, as we observe significant difference in inference time between different `transformer` versions.**
- **`apt.txt`**: Include any apt packages required.
- **`Dockerfile`**: Optionally, you can provide your own Dockerfile. An example is located at `utilities/_Dockerfile`, which can serve as a helpful starting point.
- **`Dockerfile`**: The one located at the root will be used by default to build your submission. **You can specify the python version here if you need specific ones**.
For detailed setup instructions regarding runtime dependencies, refer to the documentation in the `docs/runtime.md` file.
......@@ -32,18 +32,21 @@ Your project should follow the structure outlined in the starter kit. Here’s a
```
.
├── .dockerignore # Please specify the paths to your model checkpoints so that the large files won't be built into the docker image.
├── README.md # Project documentation and setup instructions
├── aicrowd.json # Submission meta information - like your username, track name
├── data
│ └── development.json # Development dataset local testing
├── docs
│ └── runtime.md # Documentation on the runtime environment setup, dependency confifgs
│ └── runtime.md # Documentation on the runtime environment setup, dependency configs
├── Dockerfile # The Dockerfile that will be used to build your submission and all dependencies. The default one will work fine, but you can write your own.
├── docker_run.sh # This script builds your submission locally and calls `local_evaluation.py`. It can be used to debug (if your submission fails to build).
├── local_evaluation.py # Use this to check your model evaluation flow locally
├── metrics.py # Scripts to calculate evaluation metrics for your model's performance
├── models
│ ├── README.md # Documentation specific to the implementation of model interfaces
│ ├── base_model.py # Base model class
│ ├── dummy_model.py # A simple or placeholder model for demonstration or testing
│ ├── dummy_model.py # A simple or placeholder model for demonstration or testing. We also implement a simple Vicuna-7B baseline here.
│ └── user_config.py # IMPORTANT: Configuration file to specify your model
├── parsers.py # Model output parser
├── requirements.txt # Python packages to be installed for model development
......@@ -52,7 +55,7 @@ Your project should follow the structure outlined in the starter kit. Here’s a
└── _Dockerfile # Example Dockerfile for specifying runtime via Docker
```
Remember, **your submission metadata JSON (`aicrowd.json`)** is crucial for mapping your submission to the challenge. Ensure it contains the correct `challenge_id`, `authors`, and other necessary information. To utilize GPUs, set the `"gpu": true` flag in your `aicrowd.json`.
Remember, **your submission metadata JSON (`aicrowd.json`)** is crucial for mapping your submission to the challenge. Ensure it contains the correct `challenge_id`, `authors`, and other necessary information. **To utilize GPUs, set the `"gpu": true` flag in your `aicrowd.json`.**
## Submitting to Different Tracks
......@@ -112,10 +115,12 @@ For more information on how to upload large files to your submission and detaile
To submit your code, push a tag beginning with "submission-" to your repository on [GitLab](https://gitlab.aicrowd.com/). Follow these steps to make a submission:
Assuming, you have cloned the repo already by following the instructions [here](../README.md#setup) and made your changes.
1. Commit your changes with `git commit -am "Your commit message"`.
2. Tag your submission (e.g., `git tag -am "submission-v0.1" submission-v0.1`).
3. Push your changes and tags to the AIcrowd repository (replace `<YOUR_AICROWD_USER_NAME>` with your actual username).
3. Push your changes and tags to the AIcrowd repository (e.g. `git push origin submission-v0.1`)
After pushing your tag, you can view your submission details at `https://gitlab.aicrowd.com/<YOUR_AICROWD_USER_NAME>/amazon-kdd-cup-2024-starter-kit/issues`.
After pushing your tag, you can view your submission details at `https://gitlab.aicrowd.com/<YOUR-AICROWD-USER-NAME>/amazon-kdd-cup-2024-starter-kit/issues`. It may take about **30 minutes** for each submission to build and begin evaluation, so please be patient.
Ensure your `aicrowd.json` is correctly filled with the necessary metadata, and you've replaced `<YOUR_AICROWD_USER_NAME>` with your GitLab username in the provided URL.
\ No newline at end of file
Ensure your `aicrowd.json` is correctly filled with the necessary metadata, and you've replaced `<YOUR-AICROWD-USER-NAME>` with your GitLab username in the provided URL.
import pandas as pd
from tqdm import tqdm
import torch
import numpy as np
import os
from sentence_transformers import SentenceTransformer
import metrics
import numpy as np
import pandas as pd
import parsers
import torch
from tqdm import tqdm
VERSION = "0.1.0"
def print_sample(idx, generation, truth, metric, score):
......@@ -51,18 +52,36 @@ def generate_model_outputs(data_df, model):
- A list containing the model outputs for each entry in the data DataFrame.
"""
outputs = []
for _, row in tqdm(
data_df.iterrows(), total=len(data_df), desc="Generating Responses"
):
is_multiple_choice = row["task_type"] == "multiple-choice"
prompt = row["input_field"]
model_output = model.predict(prompt, is_multiple_choice)
outputs.append(model_output)
return outputs
task_grouped_df = data_df.groupby(by=["task_type"])
for task_type, task_group_data_df in task_grouped_df:
task_group_data_df = task_group_data_df.reset_index(drop=True)
is_multiple_choice = task_type[0] == "multiple-choice"
batch_size = model.get_batch_size()
batches = [task_group_data_df[i:i+batch_size] for i in range(0,len(task_group_data_df),batch_size)]
for batch_df in batches:
batch = {
"prompt": batch_df["input_field"].tolist(),
}
model_output = model.batch_predict(
batch,
is_multiple_choice
)
outputs.append(
pd.DataFrame({
"input_field": batch["prompt"],
"model_output_str": model_output
}))
df_outputs = pd.concat(outputs)
return df_outputs
# Function to evaluate the generated model outputs
def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
def evaluate_outputs(data_df, log_every_n_steps=1):
"""
Evaluate the model outputs against ground truth values using specified metrics.
......@@ -81,21 +100,18 @@ def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
for row_idx, row in tqdm(
data_df.iterrows(), total=len(data_df), desc="Evaluating"
):
task_type, metric, ground_truth = (
task_name, task_type, metric, ground_truth, model_output_str = (
row["task_name"],
row["task_type"],
row["metric"],
row["output_field"],
row["model_output_str"],
)
if metric not in eval_methods:
raise NotImplementedError(f"No metric for {metric=}")
task_name = f"{task_type}---{metric}"
# Note: In practice, here we are using the task_type-metric pair as a unique identifier, calling it as the task_name.
# During the actual evaluations, the task names are more semantically defined, meaning, there could be multiple tasks
# with the same task_type and metric.
model_output = task_parsers[task_type].parse(outputs[row_idx])
model_output = task_parsers[task_type].parse(model_output_str)
eval_fn = eval_methods[metric]
metric_score = eval_fn(model_output, ground_truth)
......@@ -108,9 +124,9 @@ def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
per_task_metrics[task_name]["sample_score"].append(metric_score)
if row_idx % log_every_n_steps == 0:
if (row_idx + 1) % log_every_n_steps == 0:
print_sample(
row_idx, model_output, ground_truth, metric, metric_score
row_idx + 1, model_output, ground_truth, metric, metric_score
)
return per_task_metrics
......@@ -143,7 +159,7 @@ def aggregate_scores(per_task_metrics):
overall_score = (
np.mean(sample_scores)
if metric != "micro f1"
else metrics.compute_f1_score(sample_scores)
else metrics.calculate_f1_score(sample_scores)
)
overall_metrics["task_name"].append(task_name)
......@@ -163,26 +179,28 @@ def get_evaluation_methods():
Returns:
- A dictionary mapping metric names to their respective evaluation functions.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
sentence_all_lm = SentenceTransformer("all-MiniLM-L6-v2").to(device)
sentence_multilingual = SentenceTransformer(
"paraphrase-multilingual-MiniLM-L12-v2"
).to(device)
return {
"accuracy": metrics.accuracy,
"hit rate@3": metrics.hit_rate_3,
"rougel": metrics.rougel,
"sent-transformer": lambda g, t: metrics.sent_transformer(
g, t, sentence_all_lm
"accuracy": metrics.calculate_per_sample_accuracy,
"hit rate@3": metrics.calculate_hit_rate_3,
"rougel": metrics.calculate_rougel,
"sent-transformer": lambda generated_text, reference_texts: metrics.calculate_cosine_similarity(
generated_text=generated_text,
reference_texts=reference_texts,
model_name="all-MiniLM-L6-v2",
),
"multilingual-sent-transformer": lambda generated_text, reference_texts: metrics.calculate_cosine_similarity(
generated_text=generated_text,
reference_texts=reference_texts,
model_name="paraphrase-multilingual-MiniLM-L12-v2",
),
"multilingual-sent-transformer": lambda g, t: metrics.sent_transformer(
g, t, sentence_multilingual
"micro f1": metrics.calculate_true_positive_false_positives_false_negatives,
"ndcg": metrics.calculate_ndcg,
"bleu": metrics.calculate_bleu_score,
"jp-bleu": lambda generated_text, reference_text: metrics.calculate_bleu_score(
generated_text=generated_text,
reference_text=reference_text,
is_japanese=True,
),
"micro f1": metrics.tp_fp_fn,
"ndcg": metrics.ndcg_eval,
"bleu": metrics.bleu,
"jp-bleu": lambda g, t: metrics.bleu(g, t, jp=True),
}
......@@ -208,14 +226,14 @@ def get_task_parsers():
# Main execution function to load data, generate model outputs, evaluate, and aggregate scores
def main():
# Load development data
# Please download the development data from : https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/dataset_files
# Please download the development data from : https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms/dataset_files
# and place it at: ./data/development.json
DATA_FILENAME = "./data/development.json"
if not os.path.exists(DATA_FILENAME):
raise FileNotFoundError(
f"Development data file not found at {DATA_FILENAME}."
"Please download the development data from : https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/dataset_files"
"Please download the development data from : https://www.aicrowd.com/challenges/amazon-kdd-cup-2024-multi-task-online-shopping-challenge-for-llms/dataset_files"
"and place it at: ./data/development.json"
)
......@@ -229,14 +247,15 @@ def main():
model = UserModel()
# Generate model outputs
outputs = generate_model_outputs(data_df, model)
data_df["outputs"] = (
outputs # Optional: Add outputs back to DataFrame for inspection
)
print(data_df.head())
df_outputs = generate_model_outputs(data_df, model)
# add outputs to the data_df
merged_data_df = pd.merge(data_df, df_outputs, on="input_field")
print(merged_data_df.head())
# Evaluate the generated outputs and calculate metrics
per_task_metrics = evaluate_outputs(data_df, outputs)
per_task_metrics = evaluate_outputs(merged_data_df)
# Aggregate and display the evaluation scores
overall_metrics = aggregate_scores(per_task_metrics)
......
import os
from typing import List, Tuple, Union
import evaluate
import numpy as np
import torch
from loguru import logger
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
import numpy as np
import evaluate
from typing import List
sacrebleu = None
sentence_transformer_model_cache = {}
def calculate_per_sample_accuracy(prediction: int, truth: int) -> bool:
"""
Computes the accuracy of a single prediction.
def accuracy(prediction: int, truth: int):
This function checks if a given prediction matches the ground truth.
Parameters:
- prediction (int): The predicted value.
- truth (int): The actual ground truth value.
Returns:
- bool: True if the prediction matches the truth, False otherwise.
"""
return prediction == truth
def hit_rate_3(retrieved_int: List[int], truth: List[int]):
def calculate_hit_rate_3(retrieved_int: List[int], truth: List[int]) -> float:
"""
Calculates the hit rate within the top 3 retrieved integers.
This function assesses how many of the truth integers are present
within the first three elements of the retrieved list of integers.
Parameters:
- retrieved_int (List[int]): The list of retrieved integers, ordered by relevance.
- truth (List[int]): The list of ground truth integers.
Returns:
- float: The hit rate, calculated as the proportion of truth integers found
in the top 3 retrieved integers, relative to the total number of truth integers.
"""
# Calculate the number of hits within the top 3 retrieved integers
hit = len(set(truth).intersection(set(retrieved_int[:3])))
hit /= len(truth)
return hit
# Normalize the hit count by the total number of truth integers to get the hit rate
hit_rate = hit / len(truth)
return hit_rate
def calculate_rougel(generation: str, truth: str) -> float:
"""
Calculates the ROUGE-L F-measure score between a generated string and the truth string.
def rougel(generation: str, truth: str):
ROUGE-L measures the longest common subsequence between the generated text and the truth text,
considering both the precision and recall of the sequences. It is widely used in evaluating
the quality of text generation systems.
Parameters:
- generation (str): The generated text to evaluate.
- truth (str): The ground truth text to compare against.
Returns:
- float: The ROUGE-L F-measure score, indicating the quality of the generated text.
"""
# Initialize the ROUGE scorer with the ROUGE-L metric
scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
# Calculate the ROUGE scores between the generated text and the truth text
scores = scorer.score(generation, truth)
# Extract and return the ROUGE-L F-measure score
return scores["rougeL"].fmeasure
def sent_transformer(generation: str, truth: str, sent_transformer_model):
generation_embedding = sent_transformer_model.encode([generation])[0]
def load_sentence_transformer_model(model_name: str) -> SentenceTransformer:
"""
Loads a Sentence Transformer model by its name and moves it to the appropriate device.
if isinstance(truth, str):
truth_embedding = sent_transformer_model.encode([truth])[0]
score = (generation_embedding * truth_embedding).sum()
score /= np.linalg.norm(generation_embedding, ord=2) * np.linalg.norm(
truth_embedding, ord=2
)
if score > 0:
return score
else:
return 0
Parameters:
- model_name (str): The name of the model to load.
Returns:
- SentenceTransformer: The loaded SentenceTransformer model.
"""
global sentence_transformer_model_cache
# a model cache ensure we do not load the model on every call
if model_name not in sentence_transformer_model_cache:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer(model_name).to(device)
sentence_transformer_model_cache[model_name] = model
return sentence_transformer_model_cache[model_name]
def calculate_cosine_similarity(generated_text: str, reference_texts: Union[str, List[str]], model_name) -> float:
"""
Computes the cosine similarity score(s) between a generated text and reference text(s) using a sentence embedding model.
This function calculates the cosine similarity between the embedding of the generated text and the embedding(s)
of reference text(s). The embeddings are generated using a specified sentence embedding model. The cosine similarity
score is a measure of similarity between two vectors, ranging from -1 (completely different) to 1 (exactly the same).
Parameters:
- generated_text (str): The text generated by the model.
- reference_texts (Union[str, List[str]]): The reference text(s) for comparison. Can be a single string or a list of strings.
- model_name: The sentence embedding model used to generate text embeddings.
Returns:
- float: The average cosine similarity score between the generated text and the reference text(s). If reference_texts is a single
string, a single score is returned. If reference_texts is a list of strings, the average score across all references is returned.
The score is bounded between 0 (no similarity) and 1 (identical), with negative scores adjusted to 0.
"""
# Load/Reference model
model = load_sentence_transformer_model(model_name)
# Embedding for the generated text
generated_embedding = model.encode([generated_text])[0]
# Handling a single reference text
if isinstance(reference_texts, str):
# Embedding for the single reference text
reference_embedding = model.encode([reference_texts])[0]
# Compute cosine similarity
similarity_score = np.dot(generated_embedding, reference_embedding) / (np.linalg.norm(generated_embedding) * np.linalg.norm(reference_embedding))
# Ensure non-negative score
return max(similarity_score, 0)
# Handling multiple reference texts
else:
scores = []
for label_item in truth:
truth_embedding = sent_transformer_model.encode([label_item])[0]
score_ = (generation_embedding * truth_embedding).sum()
score_ /= np.linalg.norm(
generation_embedding, ord=2
) * np.linalg.norm(truth_embedding, ord=2)
scores.append(score_)
if np.mean(scores) > 0:
return np.mean(scores)
else:
return 0
def tp_fp_fn(entity_list, truth):
answer_lower = []
for a in entity_list:
answer_lower.append(a.lower().lstrip(" ").rstrip(" "))
truth_lower = []
for l in truth:
truth_lower.append(l.lower())
true_positive = len(set(answer_lower).intersection(set(truth_lower)))
false_positive = len(answer_lower) - true_positive
false_negative = len(truth_lower) - true_positive
return true_positive, false_positive, false_negative
def compute_f1_score(tp_fp_fn_list):
total_tp = 0
total_fp = 0
total_fn = 0
for tp, fp, fn in tp_fp_fn_list:
similarity_scores = []
for reference_text in reference_texts:
# Embedding for each reference text
reference_embedding = model.encode([reference_text])[0]
# Compute cosine similarity for each reference
individual_score = np.dot(generated_embedding, reference_embedding) / (np.linalg.norm(generated_embedding) * np.linalg.norm(reference_embedding))
similarity_scores.append(individual_score)
# Calculate and ensure non-negative average score
return max(np.mean(similarity_scores), 0)
def calculate_true_positive_false_positives_false_negatives(extracted_entities: List[str], ground_truth_entities: List[str]) -> Tuple[int, int, int]:
"""
Calculates true positives, false positives, and false negatives for entity extraction.
This function compares a list of extracted entities against a list of ground truth entities
to determine the count of true positives (correctly extracted entities), false positives
(incorrectly extracted entities), and false negatives (missed entities).
Both lists are case-insensitive, and leading/trailing spaces in extracted entities are ignored.
Parameters:
- extracted_entities (List[str]): The list of entities extracted by the model.
- ground_truth_entities (List[str]): The list of actual entities (ground truth).
Returns:
- Tuple[int, int, int]: A tuple containing the counts of true positives, false positives, and false negatives.
"""
# Normalize the extracted entities by making them lowercase and stripping leading/trailing spaces
normalized_extracted_entities = [entity.lower().strip() for entity in extracted_entities]
# Normalize the ground truth entities by making them lowercase
normalized_ground_truth_entities = [entity.lower() for entity in ground_truth_entities]
# Calculate true positives by finding the intersection between extracted and ground truth entities
true_positives = len(set(normalized_extracted_entities).intersection(set(normalized_ground_truth_entities)))
# Calculate false positives as extracted entities not in ground truth
false_positives = len(normalized_extracted_entities) - true_positives
# Calculate false negatives as ground truth entities not extracted
false_negatives = len(normalized_ground_truth_entities) - true_positives
return true_positives, false_positives, false_negatives
def calculate_f1_score(metrics_list: List[Tuple[int, int, int]]) -> float:
"""
Calculates the F1 score from a list of tuples containing true positives, false positives, and false negatives.
Parameters:
- metrics_list (List[Tuple[int, int, int]]): A list of tuples, where each tuple contains counts of true positives,
false positives, and false negatives in that order for various classifications or entity extractions.
Returns:
- float: The computed F1 score, ranging from 0 to 1.
"""
total_tp, total_fp, total_fn = 0, 0, 0
# Aggregate total true positives, false positives, and false negatives
for tp, fp, fn in metrics_list:
total_tp += tp
total_fp += fp
total_fn += fn
precision = total_tp / (total_tp + total_fp)
recall = total_tp / (total_tp + total_fn)
# Calculate precision and recall
precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
# Calculate F1 score, handling the case where precision + recall equals 0
if precision + recall == 0:
return 0
else:
return 2 * precision * recall / (precision + recall)
def calculate_ndcg(predicted_relevance_scores: List[int], true_relevance_weights: List[float]) -> float:
"""
Calculates and evaluates the Normalized Discounted Cumulative Gain (NDCG) score directly from predicted relevance scores
against true relevance weights. It normalizes the scores to ensure a fair comparison, trimming the predicted scores
if necessary to match the length of the true relevance weights.
Parameters:
- predicted_relevance_scores (List[int]): Indices of items ranked by the algorithm, expected to be integers starting from 1.
- true_relevance_weights (List[float]): Actual relevance weights for the items, with higher values indicating greater relevance.
Returns:
- float: The NDCG score, normalized against the ideal ranking, ranging from 0 to 1.
"""
# Trim the predicted scores to match the true scores length if necessary
if len(predicted_relevance_scores) > len(true_relevance_weights):
predicted_relevance_scores = predicted_relevance_scores[:len(true_relevance_weights)]
def ndcg(ranked_list, weight):
idcg = 0
dcg = 0
for i in range(len(ranked_list)):
position = i + 1
if ranked_list[i] - 1 < len(weight):
relevance = weight[ranked_list[i] - 1]
dcg, idcg = 0.0, 0.0
# Calculate DCG for the predicted ranking
for i, score_index in enumerate(predicted_relevance_scores, start=1):
if score_index - 1 < len(true_relevance_weights):
relevance = true_relevance_weights[score_index - 1]
else:
relevance = 0
dcg += (np.power(2, relevance) - 1) / np.log2(position + 1)
weight.sort(reverse=True)
for i in range(len(weight)):
position = i + 1
relevance = weight[i]
idcg += (np.power(2, relevance) - 1) / np.log2(position + 1)
return dcg / idcg
dcg += (np.power(2, relevance) - 1) / np.log2(i + 1)
# Calculate IDCG using sorted true relevance weights
for i, weight in enumerate(sorted(true_relevance_weights, reverse=True), start=1):
idcg += (np.power(2, weight) - 1) / np.log2(i + 1)
# Avoid division by zero
return 0 if idcg == 0 else dcg / idcg
def ndcg_eval(relevance_scores: List[float], truth: List[float]):
if len(relevance_scores) > len(truth):
relevance_scores = relevance_scores[: len(truth)]
return ndcg(relevance_scores, truth)
def calculate_bleu_score(generated_text: str, reference_text: str, is_japanese: bool = False) -> float:
"""
Calculates the BLEU score for a generated text compared to a reference truth text. This function supports
both general text and Japanese-specific evaluation by using the sacrebleu library.
Parameters:
- generated_text (str): The generated text to be evaluated.
- reference_text (str): The reference truth text.
- is_japanese (bool, optional): Flag to indicate whether the text is in Japanese, requiring special tokenization.
def bleu(generation, truth, jp=False):
Returns:
- float: The BLEU score as a percentage (0 to 1 scale) for the generated text against the reference truth.
"""
global sacrebleu
if sacrebleu is None:
print("\nsacrebleu loading...")
sacrebleu = evaluate.load("sacrebleu")
generation = generation.lstrip("\n").rstrip("\n").split("\n")[0]
candidate = [generation]
reference = [[truth]]
if not jp:
score = (
sacrebleu.compute(
predictions=candidate, references=reference, lowercase=True
)["score"]
/ 100
)
else:
score = (
sacrebleu.compute(
predictions=candidate,
references=reference,
lowercase=True,
tokenize="ja-mecab",
)["score"]
/ 100
)
# Preprocess input texts
generated_text = generated_text.lstrip("\n").rstrip("\n").split("\n")[0]
candidate = [generated_text]
reference = [[reference_text]]
# Compute BLEU score with or without Japanese-specific tokenization
bleu_args = {"predictions": candidate, "references": reference, "lowercase": True}
if is_japanese:
bleu_args["tokenize"] = "ja-mecab"
score = sacrebleu.compute(**bleu_args)["score"] / 100
return score
......@@ -4,7 +4,7 @@
For a streamlined experience, we suggest placing the code for all your models within the `models` directory. This is a recommendation for organizational purposes, but it's not a strict requirement.
## Model Base Class
Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `predict` method.
Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `batch_predict` method.
## Configuring Your Model
To ensure your model is recognized and utilized correctly, please specify your model class name in the [`user_config.py`](user_config.py) file, by following the instructions in the inline comments.
......@@ -12,12 +12,14 @@ To ensure your model is recognized and utilized correctly, please specify your m
## Model Inputs and Outputs
### Inputs
Your model will receive two pieces of information for every task:
- `prompt` (`str`): This is the specific task's input prompt.
- `batch` (`Dict[str, Any]`): A batch of inputs as a dictionary, where the dictionary has the following key:
- `prompt` (`List[str]`): `A list if prompts representing the tasks in a batch`
- `is_multiple_choice` (`bool`): This indicates whether the task is a multiple choice question.
### Outputs
The output from your model's `predict` function should always be a string. Depending on the task, this could be:
The output from your model's `batch_predict` function should be a list of string responses for all the prompts in the input batch.
Depending on the task, each response could be:
- A single integer (in the range [0, 3]) for multiple choice tasks.
- A comma-separated list of integers for ranking tasks.
- A comma-separated list of named entities for Named Entity Recognition (NER) tasks.
......
from typing import Any, Dict, List
class ShopBenchBaseModel:
def __init__(self):
pass
def predict(self, prompt: str, is_multiple_choice: bool) -> str:
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
raise NotImplementedError("get_batch_size method not implemented")
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a prediction based on the input prompt and task type.
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Args:
prompt (str): The input prompt for the model.
is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
......
from typing import List, Union
import random
import os
import random
from typing import Any, Dict, List
from .base_model import ShopBenchBaseModel
......@@ -19,33 +19,55 @@ class DummyModel(ShopBenchBaseModel):
"""Initializes the model and sets the random seed for consistency."""
random.seed(AICROWD_RUN_SEED)
def predict(self, prompt: str, is_multiple_choice: bool) -> str:
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
Generates a prediction based on the input prompt and task type.
self.batch_size = 4
return self.batch_size
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Args:
prompt (str): The input prompt for the model.
is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
"""
prompts = batch["prompt"]
possible_responses = [1, 2, 3, 4]
if is_multiple_choice:
# Randomly select one of the possible responses for multiple choice tasks
return str(random.choice(possible_responses))
else:
# For other tasks, shuffle the possible responses and return as a string
random.shuffle(possible_responses)
return str(possible_responses)
# Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
# For generation tasks, this should ideally return an unconstrained string.
batch_response = []
for prompt in prompts:
if is_multiple_choice:
# Randomly select one of the possible responses for multiple choice tasks
batch_response.append(str(random.choice(possible_responses)))
else:
# For other tasks, shuffle the possible responses and return as a string
random.shuffle(possible_responses)
batch_response.append(str(possible_responses))
# Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
# For generation tasks, this should ideally return an unconstrained string.
return batch_response
......@@ -7,6 +7,7 @@ from models.dummy_model import DummyModel
# This approach allows for easier reference to your model class when evaluating your models,
UserModel = DummyModel
# When implementing your own model please follow this pattern:
#
# from models.your_model import YourModel
......@@ -17,3 +18,11 @@ UserModel = DummyModel
# Finally, assign YourModel to UserModel as shown below to use it throughout your script.
#
# UserModel = YourModel
# For example, to use the Llama3 8B Instruct baseline, you can comment the lines below:
# please remember to download the model weights and checking them into the repository
# before submitting
# from models.vanilla_llama3_baseline import Llama3_8B_ZeroShotModel
# UserModel = Llama3_8B_ZeroShotModel
import os
import random
from typing import Any, Dict, List
import vllm
from .base_model import ShopBenchBaseModel
#### CONFIG PARAMETERS ---
# Set a consistent seed for reproducibility
AICROWD_RUN_SEED = int(os.getenv("AICROWD_RUN_SEED", 773815))
# Batch size you wish the evaluators will use to call the `batch_generate_answer` function
AICROWD_SUBMISSION_BATCH_SIZE = 16 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
# VLLM Parameters
VLLM_TENSOR_PARALLEL_SIZE = 4 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
VLLM_GPU_MEMORY_UTILIZATION = 0.85 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
class Llama3_8B_ZeroShotModel(ShopBenchBaseModel):
"""
A dummy model implementation for ShopBench, illustrating how to handle both
multiple choice and other types of tasks like Ranking, Retrieval, and Named Entity Recognition.
This model uses a consistent random seed for reproducible results.
"""
def __init__(self):
"""Initializes the model and sets the random seed for consistency."""
random.seed(AICROWD_RUN_SEED)
self.initialize_models()
def initialize_models(self):
# Initialize Meta Llama 3 - 8B Instruct Model
self.model_name = "models/meta-llama/Meta-Llama-3-8B-Instruct"
if not os.path.exists(self.model_name):
raise Exception(
f"""
The evaluators expect the model weights to be checked into the repository,
but we could not find the model weights at {self.model_name}
Please follow the instructions in the docs below to download and check in the model weights.
https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/blob/master/docs/download-baseline-model-weights.md
"""
)
# initialize the model with vllm
self.llm = vllm.LLM(
self.model_name,
worker_use_ray=True,
tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE,
gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION,
trust_remote_code=True,
dtype="half", # note: bfloat16 is not supported on nvidia-T4 GPUs
enforce_eager=True
)
self.tokenizer = self.llm.get_tokenizer()
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
self.batch_size = AICROWD_SUBMISSION_BATCH_SIZE
return self.batch_size
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
"""
prompts = batch["prompt"]
# format prompts using the chat template
formatted_prompts = self.format_prommpts(prompts)
# set max new tokens to be generated
max_new_tokens = 100
if is_multiple_choice:
max_new_tokens = 1 # For MCQ tasks, we only need to generate 1 token
# Generate responses via vllm
responses = self.llm.generate(
formatted_prompts,
vllm.SamplingParams(
n=1, # Number of output sequences to return for each prompt.
top_p=0.9, # Float that controls the cumulative probability of the top tokens to consider.
temperature=0, # randomness of the sampling
seed=AICROWD_RUN_SEED, # Seed for reprodicibility
skip_special_tokens=True, # Whether to skip special tokens in the output.
max_tokens=max_new_tokens, # Maximum number of tokens to generate per output sequence.
),
use_tqdm = False
)
# Aggregate answers into List[str]
batch_response = []
for response in responses:
batch_response.append(response.outputs[0].text)
if is_multiple_choice:
print("MCQ: ", batch_response)
return batch_response
def format_prommpts(self, prompts):
"""
Formats prompts using the chat_template of the model.
Parameters:
- queries (list of str): A list of queries to be formatted into prompts.
"""
system_prompt = "You are a helpful online shopping assistant. Please answer the following question about online shopping and follow the given instructions.\n\n"
formatted_prompts = []
for prompt in prompts:
formatted_prompts.append(system_prompt + prompt)
return formatted_prompts
import ast
from loguru import logger
VERSION = "0.1.1"
MAX_RESPONSE_CHARACTERS = 5000
class ShoppingBenchTaskParsers:
"""
......@@ -49,6 +55,9 @@ class ShoppingBenchTaskParsers:
response, str
), f"Response must be a string, but got {type(response)}"
# Consider only the first MAX_RESPONSE_CHARACTERS
response = response[:MAX_RESPONSE_CHARACTERS]
# Attempt to retrieve the appropriate parser method for the task type.
parser_method = task_parser_methods.get(self.task_type)
......@@ -73,10 +82,15 @@ class ShoppingBenchTaskParsers:
An integer representing the selected option. Returns -1 if the parsing fails due to
an invalid response format.
"""
default_response = -1
try:
return int(response.strip()[0])
except ValueError:
return -1
response = response.strip()
return int(response[0])
except Exception as e:
logger.warning(
f"SHOPBENCH_PARSER_WARNING::: Error parsing multichoice response: {e}. Responding with default : {default_response}"
)
return default_response
def _parse_ranking(self, response: str) -> list:
"""
......@@ -91,6 +105,7 @@ class ShoppingBenchTaskParsers:
A list of integers representing the items in ranked order. Limits to the first 5 unique
elements. Returns an empty list if duplicates are found or parsing fails.
"""
default_respomse = []
# Keep only numeric characters and specific punctuation.
cleaned_response = "".join(
c for c in response if c.isnumeric() or c in [",", " "]
......@@ -101,7 +116,9 @@ class ShoppingBenchTaskParsers:
for item in cleaned_response.split(","):
try:
# Attempt to convert each item to an integer and add it to the list.
ranked_items.append(int(item))
int_item = int(item)
if int_item <= 5: # we know int_item can be at most 5
ranked_items.append(int_item)
except ValueError:
pass # Skip non-numeric items.
......@@ -110,7 +127,7 @@ class ShoppingBenchTaskParsers:
# If there are duplicates, empty the list
if len(ranked_items) != len(set(ranked_items)):
ranked_items = []
ranked_items = default_respomse
return ranked_items
def _parse_generation(self, response: str) -> str:
......@@ -139,24 +156,30 @@ class ShoppingBenchTaskParsers:
Returns:
A list of integers representing the first 3 unique retrieved item indices.
"""
# Similar to ranking parser, but only returns the first 3 elements.
cleaned_response = "".join(
c for c in response if c.isnumeric() or c in [",", " "]
)
# Convert to list of integers
response = []
for item in cleaned_response.split(","):
try:
# Attempt to convert each item to an integer and add it to the list.
response.append(int(item))
except ValueError:
pass # Skip non-numeric items.
# consider only the first 3 elements
retrieved_items = response[:3]
default_response = []
try:
# Similar to ranking parser, but only returns the first 3 elements.
cleaned_response = "".join(
c for c in response if c.isnumeric() or c in [",", " "]
)
return retrieved_items
# Convert to list of integers
response = []
for item in cleaned_response.split(","):
try:
# Attempt to convert each item to an integer and add it to the list.
response.append(int(item))
except ValueError:
pass # Skip non-numeric items.
# consider only the first 3 elements
retrieved_items = response[:3]
return retrieved_items
except Exception as e:
logger.warning(
f"SHOPBENCH_PARSER_WARNING::: Error parsing retrieval response: {e}. Responding with default : {default_response}"
)
return default_response
def _parse_named_entity_recognition(self, response: str) -> list:
"""
......@@ -182,78 +205,124 @@ class ShoppingBenchTaskParsers:
raise SyntaxError(
"Unexpected Syntax error - fall back to comma separated list."
)
except (SyntaxError, ValueError):
except Exception as e:
# Fallback: split the string by commas and strip whitespace.
return [entity.strip() for entity in response.split(",")]
# we remove empty entities. it will not cause bug, just an implementation choice.
return [
entity.strip()
for entity in response.split(",")
if entity.strip() != ""
]
import unittest
class TestShoppingBenchTaskParsers(unittest.TestCase):
def test_multichoice(self):
parser = ShoppingBenchTaskParsers("multichoice")
# Check for a valid numeric response
self.assertEqual(parser.parse("2"), 2)
# Check for an invalid (alphabetic) response, expecting failure code -1
self.assertEqual(parser.parse("a"), -1)
# Check handling of newline-only input, expecting failure code -1
self.assertEqual(parser.parse("\n"), -1)
# Check handling of space-only input, expecting failure code -1
self.assertEqual(parser.parse(" "), -1)
# Check handling of leading space before a valid response
self.assertEqual(parser.parse(" 2"), 2)
# Check handling of newline before a valid response
self.assertEqual(parser.parse("\n1"), 1)
# Check for newline and space before a valid response
self.assertEqual(parser.parse("\n 3"), 3)
# Check for newline and space only, expecting failure code -1
self.assertEqual(parser.parse("\n "), -1)
def test_ranking(self):
parser = ShoppingBenchTaskParsers("ranking")
# Basic successful parse of a comma-separated list of numbers
self.assertEqual(parser.parse("1, 2, 3, 4, 5"), [1, 2, 3, 4, 5])
# Successfully parses even when wrapped in square brackets
self.assertEqual(parser.parse("[1, 2, 3, 4, 5]"), [1, 2, 3, 4, 5])
# Fails (empty list) when numbers are repeated
self.assertEqual(parser.parse("1, 2, 2, 3"), [])
# Filters out non-numeric values correctly, keeping the valid numbers
self.assertEqual(parser.parse("1, 2, 4, aicrowd, 5"), [1, 2, 4, 5])
# Check handling of newline-only input, expecting empty list
self.assertEqual(parser.parse("\n"), [])
# Check handling of space and newline input, expecting empty list
self.assertEqual(parser.parse(" \n"), [])
# Parses numbers correctly even when prefixed by non-numeric text
self.assertEqual(
parser.parse("The answer is: 1, 2, 3, 4, 5"), [1, 2, 3, 4, 5]
)
# Correctly handles a leading comma
self.assertEqual(parser.parse(",1,2,3,4,5"), [1, 2, 3, 4, 5])
# Fails (empty list) when numbers are not comma-separated
self.assertEqual(parser.parse("1 2"), [])
def test_generation(self):
parser = ShoppingBenchTaskParsers("generation")
# Verifies correct response without modification
self.assertEqual(
parser.parse("This is a generated response."),
"This is a generated response.",
)
# Handles and trims extraneous newlines and spaces correctly
self.assertEqual(
parser.parse("\nThe answer is \n\n good.\n\n\n\n\n\n\n"),
"The answer is \n\n good.",
)
# Correctly returns empty string for newline and space-only inputs
self.assertEqual(parser.parse("\n \n"), "")
def test_retrieval(self):
parser = ShoppingBenchTaskParsers("retrieval")
# Basic successful parse of a comma-separated list of numbers
self.assertEqual(parser.parse("100, 200, 300"), [100, 200, 300])
# Successfully handles shorter than expected input lists
self.assertEqual(parser.parse("100, 200"), [100, 200])
# Filters out non-numeric values correctly, keeping the valid numbers
self.assertEqual(parser.parse("100, 200, jjhg"), [100, 200])
# Correctly parses numbers despite excessive spacing and newlines
self.assertEqual(
parser.parse("100, 200, \n\n\n 300"), [100, 200, 300]
)
# Limits output to first three elements if more are provided
self.assertEqual(parser.parse("100, 200, 300, 400"), [100, 200, 300])
# Correctly handles newline before valid input
self.assertEqual(parser.parse("\n 100, 200, 300"), [100, 200, 300])
# Returns empty list for newline-only inputs
self.assertEqual(parser.parse("\n \n \n"), [])
def test_named_entity_recognition(self):
parser = ShoppingBenchTaskParsers("named_entity_recognition")
# Successfully parses a list of strings, correctly interpreting them as separate entities
self.assertEqual(
parser.parse("['New York', 'ShopBench', 'Amazon']"),
["New York", "ShopBench", "Amazon"],
)
# Successfully parses comma-separated entities without brackets or quotes
self.assertEqual(
parser.parse("New York, ShopBench, Amazon"),
["New York", "ShopBench", "Amazon"],
)
# Incorrectly includes the opening bracket in the first entity and the closing bracket in the last entity,
# indicating an unintentional parsing error with brackets when quotes are not used.
self.assertEqual(
parser.parse("[New York, ShopBench, Amazon]"),
["[New York", "ShopBench", "Amazon]"],
)
# Correctly parses entities even when the input starts with a newline and a comma, trimming unnecessary characters
self.assertEqual(
parser.parse("\n, New York, ShopBench"), ["New York", "ShopBench"]
)
# Returns an empty list when parsing only a space, indicating no entities found
self.assertEqual(parser.parse(" "), [])
# Returns an empty list for inputs consisting only of newlines and spaces, indicating no entities found
self.assertEqual(parser.parse("\n \n"), [])
if __name__ == "__main__":
# Example usage of the ShoppingBenchTaskParsers class for various task types.
# MULTICHOICE EXAMPLE
multic_choice_parser = ShoppingBenchTaskParsers("multichoice")
print("Multichoice Example:")
print(multic_choice_parser.parse("2")) # Expected output: 2
print(
multic_choice_parser.parse("a")
) # Expected output (failure case): -1
print()
# RANKING EXAMPLE
ranking_parser = ShoppingBenchTaskParsers("ranking")
print("Ranking Example:")
print(
ranking_parser.parse("1, 2, 3, 4, 5")
) # Expected output: [1, 2, 3, 4, 5]
print(
ranking_parser.parse("[1, 2, 3, 4, 5]")
) # Expected output: [1, 2, 3, 4, 5] - tolerant to [, ]
print(
ranking_parser.parse("1, 2, 2, 3")
) # Expected output (failure case): [] # because of repeating numbers
print(
ranking_parser.parse("1, 4, 5, aicrowd, 6")
) # Expected output: [1, 4, 5, 6] # remove alphanumeric chars
print()
# GENERATION EXAMPLE
generation_parser = ShoppingBenchTaskParsers("generation")
print("Generation Example:")
print(
generation_parser.parse("This is a generated response")
) # Expected output: 'This is a generated response.'
print()
# RETRIEVAL EXAMPLE
retrieval_parser = ShoppingBenchTaskParsers("retrieval")
print("Retrieval Example:")
print(
retrieval_parser.parse("100, 200, 300")
) # Expected output: [100, 200, 300]
print(
retrieval_parser.parse("100, 200")
) # Expected output (shorter than 3): [100, 200]
print(
retrieval_parser.parse("100, 200, jjhg")
) # Expected output (removed alphhanumeric chars): [100, 200]
print(
retrieval_parser.parse("100, 200, 300, 400")
) # Expected output (only consider first 3 elems): [100, 200, 300]
print()
# NAMED ENTITY RECOGNITION EXAMPLE
ner_parser = ShoppingBenchTaskParsers("named_entity_recognition")
print("Named Entity Recognition Example:")
print(
ner_parser.parse("['New York', 'ShopBench', 'Amazon']")
) # Expected output: ['New York', 'ShopBench', 'Amazon']
print(
ner_parser.parse("New York, ShopBench, Amazon")
) # Expected output: ['New York', 'ShopBench', 'Amazon']
print(
ner_parser.parse("[New York, ShopBench, Amazon]")
) # failure case - not tolerant to [ if quotes not used
# - extra '[' characters added to boundary elems]): ['[New York', 'ShopBench', 'Amazon]']
# Expected output: ['[New York', 'ShopBench', 'Amazon]']
unittest.main()
torch
\ No newline at end of file
torch
vllm>=0.4.2
loguru
......@@ -3,5 +3,5 @@ pandas
sentence-transformers
rouge_score
evaluate
sacrebleu
sacrebleu[ja]
\ No newline at end of file
sacrebleu==2.4.1
sacrebleu[ja]