Cracking video recommendation on web page with AWS Comprehend — Part 1
Hosting a video on a web page improves the SEO ranking of the page. Playing the most relevant video on the page increases user engagement, time spent and user retention. Playing the right video on a page is the right recipe for boosting the page performance.
While solving this problem, I had two things on hand — 1) The transcripts of all videos 2) An existing AWS environment 3) A small time frame to execute the project. I knew that the web page could be scraped to get the text content from the page. So, the problem on a high level could be classified into a Text Analytics Problem.
The right solution to the problem in the given context seemed to be Topic Modeling. This is a text analytics concept and intends to classify any text into topics with scores. They are particularly useful for discovering the statistical regularities hidden in textual data in supervised/semi-supervised/unsupervised settings. The approach to solve the problem was to identify topics with weights for both answers and the web page text. The answer with the score closest to the web page score would play on the page.
Given the existing AWS environment, I chose AWS Comprehend for faster implementation.
AWS Comprehend is a Natural Language Processing service that uses machine learning algorithms to find insights and relationships in unstructured text. While, the service identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; it also automatically organizes a collection of text files by topic. It assigns different weights to the most relevant topics in a block of text
The interesting aspect of the videos text transcripts is that it belongs to certain specific domain topics and this data along with tagged topics can be used to train the comprehend classifier.
Readying the data for AWS Comprehend
To create a custom classification in AWS Comprehend, it requires training the classifier with data in the following two formats :
- Using Multi-class mode — Training document file must have one class and document per line. The file must be in .csv format and should have at least 10 documents per class.
- Using multi-label mode — Training document file must have one or more classes per line and one document per line, each class separated by a delimiter. The file format must be .csv and there must be at least 10 documents per class.
Tagging every video transcript would be a very expensive task, hence I applied the concept of clustering unsupervised data to the video transcripts. I used the k- means clustering method for the same. The range of clusters were defined by the elbow on the graph between the number of features and the euclidean distance.
After clustering the videos, random samples from the transcripts were taken and tagged with topics by me and my team. Once these topics were rounded against every cluster number, the topics were extended to the whole data set. Making use of the k-means clustering method for unsupervised data, I was able to create useful insights and get a feel for the data while avoiding quite a bit of manual work for my colleagues which went down quite well. Now every video transcript had a topic attached to it. This was the first time master data set(multi-class type)that was then uploaded to the AWS S3 bucket for training the Comprehend model. Of course the idea is to make our users keep correcting the topics for new videos so that the topic become more relevant with time. I will try and cover this aspect in my next blog.
Other inputs for creating the classifier :
- Access to s3 bucket where the input training data is uploaded
- Output data(optional) can be stored in another S3 bucket folder. This will have the confusion matrix for the training data set.
- An IAM (Identity and Access Management) Role
- VPC Setting (Optional) — Use a VPC to restrict the data that can be uploaded to, or downloaded from, an S3 bucket that you use with Amazon Comprehend.
- Tags(Optional) — A tag is a label that you can add to a resource as metadata to help you organize, search, or filter your data. Each tag consists of a key and an optional value.
Once the classifier is created with the above data, one can look into the confusion matrix by just creating the classifier. A detailed output file is saved on the s3 bucket mentioned in the output data.
To start using the classifier
Once the classifier is successfully created, there are two methods to create custom classifications :
- Creating endpoint — Creating endpoint on the classifier model helps enable synchronous analysis requests. This endpoint can be used to analyze any document real time.
- Batch Analysis — Creates asynchronous custom classification jobs to classify documents using custom labels
I used the endpoint creating method to get the labels for the text real time. The scores of the output which are labels with weights are stored against each answer in the database. I will share more update on the implementation of scraping web data and matching the scores in my next blog.