Skip to content
Snippets Groups Projects

ZEW Data Purchasing Challenge 2022

ZEW Data Purchasing Challenge 2022 - Starter Kit

Discord

This repository is the main Data Purchasing Challenge template and Starter kit. Clone the repository to compete now!

This repository contains:

  • Documentation on how to submit your models to the leaderboard
  • The procedure for best practices and information on how we evaluate your agent, etc.
  • Starter code for you to get started!

Coming Soon

  • Baselines

Note: You can also make submissions online using the notebook present here.

Table of contents


🏆 About the Challenge

In short: You have to classify images. Some images in your training set are labelled but most of them aren't. How do you decide which images to label if you have a limited budget to do so?

In more detail: You face a multi-label image classification task. The dataset consists of synthetically generated images of painted metal sheets. A classifier is meant to predict whether the sheets have production damages and if so which ones. You have access to a set of images, a subset of which are labeled with respect to production damages. Because labeling is costly and your budget is limited, you have to decide for which of the unlabeled images labels should be purchased in order to maximize prediction accuracy.

Each of the images have a 4 dimensional label representing the presence or the absence of ['scratch_small', 'scratch_large', 'dent_small', 'dent_large'] in the images.

What's special about this challenge?

As you would have noticed the challenge name is "Data Purchasing Challenge". Wonder why? 😉

This challenge features online evaluation in which your submissions don't only train & predict online. BUT go through purchase phase as well.

What is a Purchase Phase? 🤔

This challenge has subset of the dataset which is unlabelled. During the purchase phase, your model is provided with a fixed budget. Your model can use that budget and ask images to be labelled using purchase_label function.

In this sense, participants have to make a data purchasing decision.

We hope you are as excited as we are!! 🤩

💪 Getting Started

Download Dataset

# Go to the data directory
cd data/

# Listing dataset files
aicrowd dataset list -c data-purchasing-challenge-2022

# Downloading debug dataset (6MB)
aicrowd dataset download -c data-purchasing-challenge-2022 debug.tar.gz

# Downloading all dataset files (~1G)
aicrowd dataset download -c data-purchasing-challenge-2022

Don't have AIcrowd CLI installed? 🥺
You can install it here or Download Datasets without CLI.

Dataset Distribution

A quick distribution of the dataset is as follows:

The publicly released dataset is for local experiments and validating your code base. The private dataset which is used for all the phases during evaluation is different from the publicly released one.

Using this repository

This repository contains a submission template.

# Clone the repository
git clone https://gitlab.aicrowd.com/zew/data-purchasing-challenge-2022-starter-kit.git
cd data-purchasing-challenge-2022-starter-kit

# Install dependencies
pip install -r requirements.txt

# Download the dataset, and place it in `data/` folder
#   Check Download Dataset section above.

# Run codebase locally
python run.py

This runs all the phases (pre_training, purchase & prediction) locally and returns your score.

👥 Participation

The participation flow look as follows:

Quick description about all the phases:

  • Runtime Setup
    You can use requirements.txt for all your python packages requirement. In case you are advanced developer and need more freedom, checkout all the other supported runtime configurations here.
  • Pre-Train Phase
    It is your typical training phase. You need to implement pre_training_phase function and it will have access to training_dataset (instance of ZEWDPCBaseDataset). Learn more about it by referring to inline documentation here.
  • Purchase Phase
    In this phase you have access to unlabelled dataset as well, which you can probe till your budget lasts. Learn more about it by referring to inline documentation here.
  • Prediction Phase
    In this phase, you have access to a test set, and you are supposed to make predictions using your trained models. inline documentation here

🧩 Repository structure

Required files

File Description
ZEWDPCBaseRun (class in run.py) Entry point to your implementation. Your code goes here
local_evaluation.py Run your codebase locally on all the phases
aicrowd.json A configuration file used to identify the challenge and resources needed for evaluation
requirements.txt List of PyPI packages that should be installed for your code to run
submit.sh Utility script to submit your codebase as submission to this challenge.

Other important files

File Description
data/ Directory containing dataset (you don't need to upload dataset for submissions)
evaluator/evaluation_metrics.py Helps your generate score for your run locally
evaluator/dataset.py Dataset wrapper implementation using which you can access dataset easily and purchase the labels during purchase phase

🚀 Submission

  • Prepare your runtime environment
  • Make submissions by pushing your code repository
  • Get scores, iterate and improve! 💪

GitLab submission

We have added a quick submission utility script as part of this starter kit, to keep things simple. You can make submission as follows:

./submit.sh <unique-submission-name>

Example: ./submit.sh "bayes v0.1"

In case you don't want to use above utility script due to different usecases, details information about it is available in SUBMISSION.md.

Notebook submission

You can also make submissions online using the notebook present here.

Evaluation hardware and timeouts

In Round 1, your code will have access to machine with 4 CPUS, 16 GB RAM, 1 NVIDIA T4 GPU and 3 hours of runtime per submission. In the Round 2 of this competition, your code will be evaluated across multiple budget-runtime constraints which will be announced later.

📎 Important links

✍️ Maintainers