- Pandas loads the data into dataframes and facilitates us to analyse the data.
- Learn more about it [here](https://www.tutorialspoint.com/python_data_science/python_pandas.htm) 🤓
%% Cell type:code id: tags:
``` python
train_info_path="data/train_info.csv"
test_info_path="data/test_info.csv"
PUZZLES_DIRECTORY="data/puzzles"
METADATA_FILE="data/metadata.csv"
train_images_path="data/train/"
test_images_path="data/test/"
train_info=pd.read_csv(train_info_path)
test_info=pd.read_csv(test_info_path)
OUTPUT_PATH="data/submission.tar.gz"
metadata_df=pd.read_csv(METADATA_FILE)
```
%% Cell type:markdown id: tags:
## Visualize the images👀
Create directory to store the solved image.
%% Cell type:code id: tags:
``` python
defplot_image(img_path):
img=cv2.imread(img_path)
# print("Shape of the captcha ",img.shape)
plt.imshow(img)
```
%% Cell type:code id: tags:
``` python
fig=plt.figure(figsize=(20,20))
columns=3
rows=3
foriinrange(1,columns*rows+1):
img=train_images_path+train_info['filename'][i]
fig.add_subplot(rows,columns,i)
plot_image(img)
plt.show()
```
%% Cell type:markdown id: tags:
## Split Data into Train and Validation 🔪
- The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
- The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
- There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, [k-fold](https://machinelearningmastery.com/k-fold-cross-validation/), [leave one out](https://en.wikipedia.org/wiki/Cross-validation_statistics). 🧐
- Validation sets are also used to avoid your model from [overfitting](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) on the train dataset.
- We have decided to split the data with 20 % as validation and 80 % as training.
- To learn more about the train_test_split function [click here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 🧐
- This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
%% Cell type:markdown id: tags:
- Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
- with this step we are all set move to the next step with a prepared dataset.
**This is a very naive approach of creating a new image of the desired size, and paste all the individual puzzle pieces at random locations.**
%% Cell type:markdown id: tags:
# TRAINING PHASE 🏋️
%% Cell type:markdown id: tags:
**We will use PyTesseract, an Optical Character Recognition library to recognize the characters in the test captcha directly and make a submission in this notebook.But lest see its performace on the train set.**
**Here for evaluation mean over normalised [Levenshtein Similarity Score](https://en.wikipedia.org/wiki/Levenshtein_distance) will be used to test the efficiency of the model.**
- Follow all submission guidelines strictly to avoid inconvenience.
print("Wrote output file to : ",OUTPUT_PATH)
```
%% Cell type:markdown id: tags:
## To download the generated csv in colab run the below command
## To download the generated tar.gz in colab run the below command
%% Cell type:code id: tags:
``` python
try:
fromgoogle.colabimportfiles
files.download('submission.csv')
files.download(OUTPUT_PATH)
except:
print("Option Only avilable in Google Colab")
```
%% Cell type:markdown id: tags:
### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/CPTCHA) and make one.
### Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to [challenge page](https://www.aicrowd.com/challenges/jigsaw) and make one.