Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • gaojingtong/amazon-kdd-cup-2024-starter-kit
  • pp/amazon-kdd-cup-2024
  • xbtl/amazon-kdd-cup-2024-starter-kit
  • jeremy_shi/amazon-kdd-cup-2024-starter-kit
  • zeng_biao_jie/amazon-kdd-cup-2024-starter-kit
  • der2933/amazon-kdd-cup-2024-starter-kit
  • zbt2702160239/amazon-kdd-cup-2024-starter-kit
  • pokce/amazon-kdd-cup-2024-starter-kit
  • boren/amazon-kdd-cup-2024-starter-kit
  • simon_jegou/amazon-kdd-cup-2024-starter-kit
  • li_zhi_peng/amazon-kdd-cup-2024-starter-kit
  • shisong_qin/amazon-kdd-cup-2024-starter-kit
  • lei_ding5/amazon-kdd-cup-2024-starter-kit
  • Pokce2/amazon-kdd-cup-2024-starter-kit
  • lizhipeng/amazon-kdd-cup-2024-starter-kit
  • giba/amazon-kdd-cup-2024-starter-kit
  • liuxiaoming1412/amazon-kdd-cup-2024-phase-2-lxm-07
  • haoyuzhang/amazon-kdd-cup-2024-starter-kit
  • aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit
19 results
Show changes
Commits on Source (8)
models/**
\ No newline at end of file
.git/
models/**
data/
\ No newline at end of file
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive \
LANG=en_US.UTF-8 \
......
......@@ -155,20 +155,22 @@ This also includes instructions on [specifying your software runtime](docs/submi
## 💻 What hardware does my code run on ?
You can find more details about the hardware and system configuration in [docs/hardware-and-system-config.md](docs/hardware-and-system-config.md).
In summary, we provide you `2` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 1; and `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.
In summary, we provide you `4` x [[NVIDIA T4 GPUs](https://www.nvidia.com/en-us/data-center/tesla-t4/)] in Phase 2.
Your solution will be given a certain amount of time for inference, after which it would be immediately killed and no results would be available. The time limit is set at
| Phase | Track 1 | Track 2 | Track 3 | Track 4 | Track 5 |
| ------ | ------- | ------- | ------- | ------- | ------- |
| **Phase 1**| 140 minutes | 40 minutes | 60 minutes | 60 minutes | 5 hours |
| **Phase 2**| 70 minutes | 20 minutes | 30 minutes | 20 minutes | 140 minutes |
For reference, the baseline solution with zero-shot [Vicuna-7B](https://huggingface.co/lmsys/vicuna-7b-v1.5) (Find it [**here**](https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/blob/master/models/dummy_model.py)) consumes the following amount of time.
For reference, the baseline solution with zero-shot LLaMA3-8B-instruct consumes the following amount of time.
| Phase | Track 1 | Track 2 | Track 3 | Track 4 |
| ------ | ------- | ------- | ------- | ------- |
| **Phase 1**| ~50 minutes | ~3 minutes | ~25 minutes | ~35 minutes |
| **Phase 2**| 1490s | 397s | 576s | 359s |
We limit the prediction time of each sample to at most **15 seconds**.
We limit the prediction time of each sample to at most **10 seconds**. This limit applies at a batch level. For example, for a batch of 8 samples, you should return the prediction after at most 80 seconds. Otherwise, your submission will be killed.
Your maximum repo size is 200GB.
## 🧩 How are my model responses parsed by the evaluators ?
Please refer to [parsers.py](parsers.py) for more details on how we parse your model responses.
......
......@@ -35,6 +35,7 @@ docker run \
--gpus all \
-v "$(pwd)":/submission \
-w /submission \
--shm-size=10.24gb\
$IMAGE_NAME python local_evaluation.py
# Note: We assume you have nvidia-container-toolkit installed and configured
......
### Setting Up and Downloading Baseline Model weighta with Hugging Face
This guide outlines the steps to download (and check in) the models weights required for the baseline models.
We will focus on the `Meta-Llama-3-8B-Instruct`.
But the steps should work equally well for any other models on hugging face.
#### Preliminary Steps:
1. **Install the Hugging Face Hub Package**:
Begin by installing the `huggingface_hub` package, which includes the `hf_transfer` utility, by running the following command in your terminal:
```bash
pip install huggingface_hub[hf_transfer]
```
2. **Accept the LLaMA Terms**:
You must accept the LLaMA model's terms of use by visiting: [meta-llama/Meta-Llama-3-8B-Instruct Terms](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
3. **Create a Hugging Face CLI Token**:
Generate a CLI token by navigating to: [Hugging Face Token Settings](https://huggingface.co/settings/tokens). You will need this token for authentication.
#### Hugging Face Authentication:
1. **Login via CLI**:
Authenticate yourself with the Hugging Face CLI using the token created in the previous step. Run:
```bash
huggingface-cli login
```
When prompted, enter the token.
#### Model Downloads:
1. **Download LLaMA-2-7b Model**:
Execute the following command to download the `Meta-Llama-3-8B-Instruct` model to a local subdirectory. This command excludes unnecessary files to save space:
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir-use-symlinks False \
--local-dir models/meta-llama/Meta-Llama-3-8B-Instruct \
--exclude *.pth # These are alternates to the safetensors hence not needed
```
#### Version Control with Git LFS:
1. **Track Model Weights**:
Use Git Large File Storage (LFS) to track the model directories. This ensures efficient handling of large files:
```bash
git lfs track "models/meta-llama/*"
```
2. **Commit and Push**:
Add the models to your Git repository, commit the changes, and push them to your remote repository:
```bash
git add models/
git commit -am "add weights"
git push origin master
```
If you are struggling with GIT-LFS, you are very much encouraged to check out [this post](https://discourse.aicrowd.com/t/how-to-upload-large-files-size-to-your-submission/2304).
......@@ -52,20 +52,36 @@ def generate_model_outputs(data_df, model):
- A list containing the model outputs for each entry in the data DataFrame.
"""
outputs = []
for _, row in tqdm(
data_df.iterrows(), total=len(data_df), desc="Generating Responses"
):
is_multiple_choice = row["task_type"] == "multiple-choice"
# the 'task_type' column won't be available during evaluation, so you should use something like
# ```is_multiple_choice = row['is_multiple_choice']``
prompt = row["input_field"]
model_output = model.predict(prompt, is_multiple_choice)
outputs.append(model_output)
return outputs
task_grouped_df = data_df.groupby(by=["task_type"])
for task_type, task_group_data_df in task_grouped_df:
task_group_data_df = task_group_data_df.reset_index(drop=True)
is_multiple_choice = task_type[0] == "multiple-choice"
batch_size = model.get_batch_size()
batches = [task_group_data_df[i:i+batch_size] for i in range(0,len(task_group_data_df),batch_size)]
for batch_df in batches:
batch = {
"prompt": batch_df["input_field"].tolist(),
}
model_output = model.batch_predict(
batch,
is_multiple_choice
)
outputs.append(
pd.DataFrame({
"input_field": batch["prompt"],
"model_output_str": model_output
}))
df_outputs = pd.concat(outputs)
return df_outputs
# Function to evaluate the generated model outputs
def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
def evaluate_outputs(data_df, log_every_n_steps=1):
"""
Evaluate the model outputs against ground truth values using specified metrics.
......@@ -84,17 +100,18 @@ def evaluate_outputs(data_df, outputs, log_every_n_steps=1):
for row_idx, row in tqdm(
data_df.iterrows(), total=len(data_df), desc="Evaluating"
):
task_name, task_type, metric, ground_truth = (
task_name, task_type, metric, ground_truth, model_output_str = (
row["task_name"],
row["task_type"],
row["metric"],
row["output_field"],
row["model_output_str"],
)
if metric not in eval_methods:
raise NotImplementedError(f"No metric for {metric=}")
model_output = task_parsers[task_type].parse(outputs[row_idx])
model_output = task_parsers[task_type].parse(model_output_str)
eval_fn = eval_methods[metric]
metric_score = eval_fn(model_output, ground_truth)
......@@ -230,14 +247,15 @@ def main():
model = UserModel()
# Generate model outputs
outputs = generate_model_outputs(data_df, model)
data_df["outputs"] = (
outputs # Optional: Add outputs back to DataFrame for inspection
)
print(data_df.head())
df_outputs = generate_model_outputs(data_df, model)
# add outputs to the data_df
merged_data_df = pd.merge(data_df, df_outputs, on="input_field")
print(merged_data_df.head())
# Evaluate the generated outputs and calculate metrics
per_task_metrics = evaluate_outputs(data_df, outputs)
per_task_metrics = evaluate_outputs(merged_data_df)
# Aggregate and display the evaluation scores
overall_metrics = aggregate_scores(per_task_metrics)
......
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
import numpy as np
import evaluate
import os
from typing import List, Tuple, Union
import evaluate
import numpy as np
import torch
from typing import List, Union, Tuple
from loguru import logger
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
sacrebleu = None
sentence_transformer_model_cache = {}
......
......@@ -4,7 +4,7 @@
For a streamlined experience, we suggest placing the code for all your models within the `models` directory. This is a recommendation for organizational purposes, but it's not a strict requirement.
## Model Base Class
Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `predict` method.
Your models should inherit from the `ShopBenchBaseModel` class found in [base_model.py](base_model.py). We provide an example model, `dummy_model.py`, to illustrate how you might structure your own model. Crucially, your model class must implement the `batch_predict` method.
## Configuring Your Model
To ensure your model is recognized and utilized correctly, please specify your model class name in the [`user_config.py`](user_config.py) file, by following the instructions in the inline comments.
......@@ -12,12 +12,14 @@ To ensure your model is recognized and utilized correctly, please specify your m
## Model Inputs and Outputs
### Inputs
Your model will receive two pieces of information for every task:
- `prompt` (`str`): This is the specific task's input prompt.
- `batch` (`Dict[str, Any]`): A batch of inputs as a dictionary, where the dictionary has the following key:
- `prompt` (`List[str]`): `A list if prompts representing the tasks in a batch`
- `is_multiple_choice` (`bool`): This indicates whether the task is a multiple choice question.
### Outputs
The output from your model's `predict` function should always be a string. Depending on the task, this could be:
The output from your model's `batch_predict` function should be a list of string responses for all the prompts in the input batch.
Depending on the task, each response could be:
- A single integer (in the range [0, 3]) for multiple choice tasks.
- A comma-separated list of integers for ranking tasks.
- A comma-separated list of named entities for Named Entity Recognition (NER) tasks.
......
from typing import Any, Dict, List
class ShopBenchBaseModel:
def __init__(self):
pass
def predict(self, prompt: str, is_multiple_choice: bool) -> str:
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
raise NotImplementedError("get_batch_size method not implemented")
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a prediction based on the input prompt and task type.
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Args:
prompt (str): The input prompt for the model.
is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
......
from typing import List, Union
import random
import os
import random
from typing import Any, Dict, List
from .base_model import ShopBenchBaseModel
......@@ -19,34 +19,55 @@ class DummyModel(ShopBenchBaseModel):
"""Initializes the model and sets the random seed for consistency."""
random.seed(AICROWD_RUN_SEED)
def predict(self, prompt: str, is_multiple_choice: bool) -> str:
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
self.batch_size = 4
return self.batch_size
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a prediction based on the input prompt and task type.
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Args:
prompt (str): The input prompt for the model.
is_multiple_choice (bool): Indicates whether the task is a multiple choice question.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: The prediction as a string representing a single integer[0, 3] for multiple choice tasks,
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
"""
prompts = batch["prompt"]
possible_responses = [1, 2, 3, 4]
if is_multiple_choice:
# Randomly select one of the possible responses for multiple choice tasks
return str(random.choice(possible_responses))
else:
# For other tasks, shuffle the possible responses and return as a string
random.shuffle(possible_responses)
return str(possible_responses)
# Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
# For generation tasks, this should ideally return an unconstrained string.
batch_response = []
for prompt in prompts:
if is_multiple_choice:
# Randomly select one of the possible responses for multiple choice tasks
batch_response.append(str(random.choice(possible_responses)))
else:
# For other tasks, shuffle the possible responses and return as a string
random.shuffle(possible_responses)
batch_response.append(str(possible_responses))
# Note: As this is dummy model, we are returning random responses for non-multiple choice tasks.
# For generation tasks, this should ideally return an unconstrained string.
return batch_response
......@@ -19,3 +19,10 @@ UserModel = DummyModel
#
# UserModel = YourModel
# For example, to use the Llama3 8B Instruct baseline, you can comment the lines below:
# please remember to download the model weights and checking them into the repository
# before submitting
# from models.vanilla_llama3_baseline import Llama3_8B_ZeroShotModel
# UserModel = Llama3_8B_ZeroShotModel
import os
import random
from typing import Any, Dict, List
import vllm
from .base_model import ShopBenchBaseModel
#### CONFIG PARAMETERS ---
# Set a consistent seed for reproducibility
AICROWD_RUN_SEED = int(os.getenv("AICROWD_RUN_SEED", 773815))
# Batch size you wish the evaluators will use to call the `batch_generate_answer` function
AICROWD_SUBMISSION_BATCH_SIZE = 16 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
# VLLM Parameters
VLLM_TENSOR_PARALLEL_SIZE = 4 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
VLLM_GPU_MEMORY_UTILIZATION = 0.85 # TUNE THIS VARIABLE depending on the number of GPUs you are requesting and the size of your model.
class Llama3_8B_ZeroShotModel(ShopBenchBaseModel):
"""
A dummy model implementation for ShopBench, illustrating how to handle both
multiple choice and other types of tasks like Ranking, Retrieval, and Named Entity Recognition.
This model uses a consistent random seed for reproducible results.
"""
def __init__(self):
"""Initializes the model and sets the random seed for consistency."""
random.seed(AICROWD_RUN_SEED)
self.initialize_models()
def initialize_models(self):
# Initialize Meta Llama 3 - 8B Instruct Model
self.model_name = "models/meta-llama/Meta-Llama-3-8B-Instruct"
if not os.path.exists(self.model_name):
raise Exception(
f"""
The evaluators expect the model weights to be checked into the repository,
but we could not find the model weights at {self.model_name}
Please follow the instructions in the docs below to download and check in the model weights.
https://gitlab.aicrowd.com/aicrowd/challenges/amazon-kdd-cup-2024/amazon-kdd-cup-2024-starter-kit/-/blob/master/docs/download-baseline-model-weights.md
"""
)
# initialize the model with vllm
self.llm = vllm.LLM(
self.model_name,
worker_use_ray=True,
tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE,
gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION,
trust_remote_code=True,
dtype="half", # note: bfloat16 is not supported on nvidia-T4 GPUs
enforce_eager=True
)
self.tokenizer = self.llm.get_tokenizer()
def get_batch_size(self) -> int:
"""
Determines the batch size that is used by the evaluator when calling the `batch_predict` function.
Returns:
int: The batch size, an integer between 1 and 16. This value indicates how many
queries should be processed together in a single batch. It can be dynamic
across different batch_predict calls, or stay a static value.
"""
self.batch_size = AICROWD_SUBMISSION_BATCH_SIZE
return self.batch_size
def batch_predict(self, batch: Dict[str, Any], is_multiple_choice:bool) -> List[str]:
"""
Generates a batch of prediction based on associated prompts and task_type
For multiple choice tasks, it randomly selects a choice.
For other tasks, it returns a list of integers as a string,
representing the model's prediction in a format compatible with task-specific parsers.
Parameters:
- batch (Dict[str, Any]): A dictionary containing a batch of input prompts with the following keys
- prompt (List[str]): a list of input prompts for the model.
- is_multiple_choice bool: A boolean flag indicating if all the items in this batch belong to multiple choice tasks.
Returns:
str: A list of predictions for each of the prompts received in the batch.
Each prediction is
a string representing a single integer[0, 3] for multiple choice tasks,
or a string representing a comma separated list of integers for Ranking, Retrieval tasks,
or a string representing a comma separated list of named entities for Named Entity Recognition tasks.
or a string representing the (unconstrained) generated response for the generation tasks
Please refer to parsers.py for more details on how these responses will be parsed by the evaluator.
"""
prompts = batch["prompt"]
# format prompts using the chat template
formatted_prompts = self.format_prommpts(prompts)
# set max new tokens to be generated
max_new_tokens = 100
if is_multiple_choice:
max_new_tokens = 1 # For MCQ tasks, we only need to generate 1 token
# Generate responses via vllm
responses = self.llm.generate(
formatted_prompts,
vllm.SamplingParams(
n=1, # Number of output sequences to return for each prompt.
top_p=0.9, # Float that controls the cumulative probability of the top tokens to consider.
temperature=0, # randomness of the sampling
seed=AICROWD_RUN_SEED, # Seed for reprodicibility
skip_special_tokens=True, # Whether to skip special tokens in the output.
max_tokens=max_new_tokens, # Maximum number of tokens to generate per output sequence.
),
use_tqdm = False
)
# Aggregate answers into List[str]
batch_response = []
for response in responses:
batch_response.append(response.outputs[0].text)
if is_multiple_choice:
print("MCQ: ", batch_response)
return batch_response
def format_prommpts(self, prompts):
"""
Formats prompts using the chat_template of the model.
Parameters:
- queries (list of str): A list of queries to be formatted into prompts.
"""
system_prompt = "You are a helpful online shopping assistant. Please answer the following question about online shopping and follow the given instructions.\n\n"
formatted_prompts = []
for prompt in prompts:
formatted_prompts.append(system_prompt + prompt)
return formatted_prompts
torch
vllm>=0.4.2
loguru