How to create and deploy Text Classification on AWS SageMaker
Sagemaker can be pretty intimidating when you are a first timer on the platform. Time wise do not expect that the time to deploy the model will be any time close to building that model on Jupyter Notebook. The architecture of AWS SageMaker environment needs time to be deciphered as most of the current documentation around the container structures are a little misleading. I hope this blog helps first time SageMaker enthusiasts get started.
There are the following important aspects when it comes to building a real time predictor in SageMaker :
- Right IAM Role on SageMaker with access to S3 folder
- Understanding of the docker/container os environment— own or one of the existing ones
- Building the required scripts, in my case — preprocessor.py, train.py and evaluate.py.
To start, the training data needs to be stored in an accessible S3 bucket. Access the location and read the files to confirm it is reading the right data.
Once, we are sure about the data, we can start building the preprocessor.py. The preprocessor.py file is a python script that is run in the mentioned container image. It first reads the same files from s3 — train.csv and validation.csv and then stores it in the os path in the container image for further processing. Given the training data is text, it needs to be cleaned for alpha numeric, punctuation, etc. and then after processing the data it is stored in the os path for output.
We now initiate the sklearn.processor to run the training script in the container. The container image should be put as one of the parameters.
I imported a scikit learn docker container and tailored it for my business case. This docker container has a scikit learn inducive platform. There is a requirements.txt file which stores all the packages that are required to be installed and imported in the container environment. In my case the following packages were included — nltk, s3fs, pandas, numpy, tar, joblib, etc.
After successful running of the preprocessor script, we build the training script — train.py. The training script essentially picks up the processed data from the os path ( this path is specific to the container, check the container documentation to understand the paths ) and trains a model.
In my train.py file, I used pipeline command to create a pipeline of count_vectorizer, tfidf and MultinomialNB() model so that the incoming text that needs to be classified runs through this pipeline and I dont have to create another script to transform text to features with the existing vocabulary of the training model. The model is then dumped into the opt/ml/model path.
The training script also needs to have the functions — input function, predict_fn, output_fn defined which are called later when we use the model to predict class on new data. It is essential to write them in case of text classification as the default ones would not work sufficiently.
Now it is time to run the training script. This training script needs to be run with the pre processor.
Once the model is run, you can save it to the s3 as a model artefact and can also access it directly to predict results. In my case, I am reading the model from the opt/ml/model path.
Now it is time for evaluation.py script. This script will read the model from the opt/ml/processing/model path and run prediction on the test set. It will generate the desired scores from the confusion matrix.
We run the evaluation script using the sklearn_processor like we did before with train.py and preprocessing.py. There is a path in the container to store the evaluation results as well. And we do the same below in opt/ml/processing/evaluation path:
Now if the evaluation script gives desired results, the model is ready to be deployed.
There are in built functions that run when we call the deploy function on the model. But, if there is a deviation in terms of how the output is generated or the input data type, it is in the text classification case, the following functions need to be changed and substituted in the train.py script (as mentioned above). I am going to mention the changes I made to these functions in my case in a separate blog. You can comment in the section below if you want to discuss your case.
- model_fn
2. predict_fn
3. input_fn
4. output_fn
There are no more obstacles if you are through this step. Just deploy the function with an appropriate endpoint_name or go with a system generated one.
Do not forget to delete the endpoint when you are not using it,and save some money !
You can find the whole code in the github repo — https://github.com/pranidhii/My_Fav_Projects