Commonsense Persona-grounded Dialogue Challenge - Task 1 - Starter kit
This repository is the CPD Challenge (Task 1) Submission template and Starter kit! Clone the repository to compete now!
This repository contains:
- Documentation on how to submit your models to the leaderboard
- The procedure for best practices and information on how we evaluate your model, etc.
- Starter code for you to get started!
Table of Contents
- Commonsense Persona-grounded Dialogue Challenge - Task 1 - Starter kit
- Table of Contents
- Competition Overview
- Getting Started
- How to write your own model?
- How to start participating?
- Other Concepts
- 📎 Important links
Competition Overview
This challenge is an opportunity for researchers and machine learning enthusiasts to test their skills on the challenging tasks of Commonsense Dialogue Response Generation (Task1) and Commonsense Persona Knowledge Linking (Task2) for persona-grounded dialogue.
Research on dialogue systems has been around for a long time, but thanks to Transformers and Large Language Models (LLM), conversational AI has come a long way in the last five years, becoming more human-like. On the other hand, it is still challenging to collect natural dialogue data for research and to benchmark which models ultimately perform the best because there is no definitive assessment data or metrics, and the comparisons are often within a limited amount of models.
We contribute to the research and development of current state-of-the-art dialogue systems, by crafting high quality human-human dialogues for model testing, and providing a common benchmarking venue by hosting this CPDC 2023 competition.
The competition aims to see the best approach among state-of-the-art participant models on an evaluation dataset of natural conversation. The submitted systems will be evaluated on a new Commonsense Persona-grounded Dialogue dataset. To this end, we first created several persona profiles, similar to ConvAI2, with a natural personality based on a commonsense persona-grounded knowledge graph (PeaCoK†) newly released on ACL 2023, and allowing us to obtain naturally related persona sentences. Furthermore, based on that persona, we created a natural dialogue between two people and prepared a sufficient amount of dialogue data for evaluation.
The Commonsense Persona-grounded Dialogue (CPD) Challenge hosts one track on Commonsense Dialogue Response Generation (Task 1) and one track on Commonsense Persona Knowledge Linking (Task 2). Independent leaderboards are set for the two tracks, each featuring a separate prize pool. In either case, participants may use any learning data. In Task 1, participants will submit dialogue response generation systems. We will evaluate them on the prepared persona-grounded dialogue dataset mentioned above. In Task 2, participants will submit systems linking knowledge to a dialogue. This task is designed in the similar spirit of ComFact, which is released along with the published paper in EMNLP 2022. We will evaluate them by checking if the linking of persona-grounded knowledge can be judged successfully on the persona-grounded dialogue dataset.
† PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives (ACL2023 Outstanding Paper Award)
Task 1: Commonsense Dialogue Response Generation
Participants will submit dialogue response generation systems. We provide a baseline model trained on ConvAI2 PERSONA-CHAT dataset with PeaCoK persona knowledge augmentation. Our trained baseline model checkpoint could be downloaded from this repository.
Participants may use any datasets for training their models, not limited to our provided training datasets used for developing the baseline model. Our provided training data include:
- Original PERSONA-CHAT (with either original or revised PERSONA-CHAT profiles):
- Training set (original profiles):
data/persona_peacok/train_persona_original_chat_convai2.json
- Validation set (original profiles):
data/persona_peacok/valid_persona_original_chat_convai2.json
- Training set (revised profiles):
data/persona_peacok/train_persona_revised_chat_convai2.json
- Validation set (revised profiles):
data/persona_peacok/valid_persona_revised_chat_convai2.json
- Training set (original profiles):
- PERSONA-CHAT with profiles augmented with PeaCoK facts (up to 5 randomly chosen to augment each profile):
- Training set (augmented original profiles):
data/persona_peacok/train_persona_original_chat_ext.json
- Validation set (augmented original profiles):
data/persona_peacok/valid_persona_original_chat_ext.json
- Training set (augmented revised profiles):
data/persona_peacok/train_persona_revised_chat_ext.json
- Validation set (augmented revised profiles):
data/persona_peacok/valid_persona_revised_chat_ext.json
- Training set (augmented original profiles):
- Full set of PeaCoK facts linked to each PERSONA-CHAT profile:
- For original profiles:
data/persona_peacok/persona_extend_full_original.json
- For revised profiles:
data/persona_peacok/persona_extend_full_revised.json
- For original profiles:
- Full PeaCoK knowledge graph:
data/peacok_kg.json
We will evaluate submitted systems on an internal persona-grounded dialogue dataset. The dialogues in our evaluation dataset have persona sentences similar to the PERSONA-CHAT dataset, but the number of persona sentences for a person is more than five sentences. The major part of the persona is derived from the PeaCoK knowledge graph.
We also provide a list of other resources that may be related to this task:
- Original PERSONA-CHAT Paper
- PERSONA-CHAT Leaderboard
- Partner Personas Generation for Diverse Dialogue Generation (PPG): Paper and Code
- On Symbolic and Neural Commonsense Knowledge Graphs (COMET-ATOMIC 2020): Paper and Code
GPU and Prompt Engineering Tracks
We provide two separate settings for participants to choose from, the GPU track and the Prompt Engineering Track.
GPU Track
In this track we provide participants with access to a single GPU with 24GB VRAM, this will allow them to fine tune and submit their own LLMs that are specific for this task.
Prompt Engineering Track
In the prompt engineering track, we provide participants with access to the OpenAI API. This will allow anyone to test their prompt engineering skills with a powerful LLM and combine it with advanced etrieval based methods to generate context.
Can I participate in both tracks?
Yes, anyone can participate in both tracks, the prize pool is common. The submission limits will apply to both tracks combined. See below details of how to specify the track for the submissions.
Getting Started
- Sign up to join the competition on the AIcrowd website.
- Fork this starter kit repository. You can use this link to create a fork.
- Clone your forked repo and start developing your model.
- Develop your model(s) following the template in how to write your own model section.
- Submit your trained models to AIcrowd Gitlab for evaluation (full instructions below). The automated evaluation setup will evaluate the submissions on the private datasets and report the metrics on the leaderboard of the competition.
How to write your own model?
We recommend that you place the code for all your models in the agents/
directory (though it is not mandatory). You should implement the following
-
generate_responses
- This function is called to generate the response of a conversation given persona information.
Add your agent name in agent/user_config.py
, this is what will be used for the evaluations.
An example are provided in agent/dummy_agent.py
How to start participating?
Setup
- Add your SSH key to AIcrowd GitLab
You can add your SSH Keys to your GitLab account by going to your profile settings here. If you do not have SSH Keys, you will first need to generate one.
-
Fork the repository. You can use this link to create a fork.
-
Clone the repository
git clone git@gitlab.aicrowd.com:aicrowd/challenges/commonsense-persona-grounded-dialogue-challenge-2023/commonsense-persona-grounded-dialogue-challenge-task-1-starter-kit
-
Install competition specific dependencies!
cd commonsense-persona-grounded-dialogue-challenge-task-1-starter-kit pip install -r requirements.txt
-
Write your own model as described in How to write your own model section.
-
Test your model locally using
python local_evaluation.py
orpython local_evaluation_with_api.py
-
Make a submission as described in How to make a submission section.
How do I specify my software runtime / dependencies?
We accept submissions with custom runtime, so you don't need to worry about which libraries or framework to pick from.
The configuration files typically include requirements.txt
(pypi packages), apt.txt
(apt packages) or even your own Dockerfile
.
An example Dockerfile is provided in utilities/_Dockerfile which you can use as a starting point.
You can check detailed information about setting up runtime dependencies in the 👉 docs/runtime.md file.
What should my code structure be like?
Please follow the example structure as it is in the starter kit for the code structure. The different files and directories have following meaning:
.
├── aicrowd.json # Submission meta information - like your username
├── apt.txt # Linux packages to be installed inside docker image
├── requirements.txt # Python packages to be installed
├── local_evaluation.py # Use this to check your model evaluation flow locally
├── local_evaluation_with_api.py # Use this to check your model evaluation flow locally
├── dummy_data_task1.json # A set of dummy conversations you can use for integration testing
└── agents # Place your models related code here
├── dummy_agent.py # Dummy agent for example interface
└── user_config.py # IMPORTANT: Add your agent name here
Finally, you must specify an AIcrowd submission JSON in aicrowd.json
to be scored!
The aicrowd.json
of each submission should contain the following content:
For GPU Track - Set the GPU flag to true
{
"challenge_id": "task-1-commonsense-dialogue-response-generation",
"authors": ["your-aicrowd-username"],
"gpu": true,
"description": "(optional) description about your awesome model"
}
For Prompt Engineering Track - Set the GPU flag to false
{
"challenge_id": "task-1-commonsense-dialogue-response-generation",
"authors": ["your-aicrowd-username"],
"gpu": false,
"description": "(optional) description about your awesome model"
}
This JSON is used to map your submission to the challenge - so please remember to use the correct challenge_id
as specified above. You can modify the authors
and description
keys. Please DO NOT add any additional keys to aicrowd.json
unless otherwise communicated during the course of the challenge.
Other Concepts
Evaluation Metrics
Time, compute and api constraints
You will be provided conversations with 7 turns each in batches of upto 50 conversations
. For each batch of conversations, the first set of turns will be provided to your model. After the response is receieved the further turns of the same conversation will be provided. Each conversation will have exactly 7 turns. Your model needs to complete all 7 responses of 50 conversations within **1 hour**
. The number of batches of conversation your model will process will vary based on the challenge round.
Before running on the challenge dataset, your model will be run on the dummy data, as a sanity check. This will show up as the convai-validation
phase on your submission pages. The dummy data will contain 5 conversations of 7 turns each
, your model needs to complete the validation phase within **15 minutes**
.
Before your model starts processing conversations, it is provided an additional time upto 5 minutes to load models or preprocess any data if needed.
GPU Track
Your model will be run on an AWS g5.2xlarge node. This node has 8 vCPUs, 32 GB RAM, and one Nvidia A10G GPU with 24 GB VRAM.
Prompt Engineering Track
Your model will be run on an AWS m5.xlarge node. This node has 4 vCPUs, 16 GB RAM*
For API usage, the following constraints will apply:
- A maximum of 2 api calls per utterance is allowed.
- Input token limit per dialog (the combined number of input tokens for 7 utterances) - 10,000
- Output token limit per dialog (the combined number of output tokens for 7 utterances) - 1,000
Local Evaluation
Participants can run the evaluation protocol for their model locally with or without any constraint posed by the challenge to benchmark their models privately. See local_evaluation.py
for details. You can change it as you like, your changes to local_evaluation.py
will NOT be used for the competition.
To test your submissions with the prompt engineering track, please use local_evaluation_with_api.py
Note about Dummy test data
The file dummy_data_task1.json
is a dummy test dataset to test your code before submission. All dialogues in the dataset based on a same pair of persona A and persona B, but the actual test dataset for evaluation is not like this and was created based on different pairs of personas. Besides, the data field persona A and gold_reference will be invisible in the real evaluation data. To be more clear, we label the invisible fields as an illustration in dummy_data_task1_with_notes.json
.
Contributing
🙏 You can share your solutions or any other baselines by contributing directly to this repository by opening merge request.
- Add your implemntation as
agents/<your_agent>.py
. - Import it in
user_config.py
- Test it out using
python local_evaluation.py
. - Add any documentation for your approach at top of your file.
- Create merge request! 🎉🎉🎉
How to make a submission?
👉 Follow the instuctions provided here docs/submission.md
Best of Luck 🎉 🎉