Indian Sign Language Recognition using Google’s Mediapipe Framework
Sign Language Recognition using Artificial Intelligence has been an area of interest for many years. Researchers have used several approaches and have achieved a lot of success in training different machine and deep learning models that can recognize signs corresponding to different words. The majority of research that has been done is for American Sign Language (ASL) and the methodologies require the use of some kind of motion sensors or hand gloves to detect the positions of different fingers accurately. I completely agree with the fact that these approaches are no doubt very effective and can account for almost every sign but what irks me is the fact that these require the use of some highly sensitive hardware that can not be used by everyone and many times require specific environment. Some other approaches to recognize sign languages include the use of deep learning models that work on skin masked images. Skin masked images are formed by segmenting out the part from the image which matches the color of the skin. That region is given a specific color (white) and all rest pixels in the image are assigned another color (black). Refer to Figure 2 below.
In such approaches after skin masking, important features are extracted from the images using several techniques, and deep learning models are trained for classifying different signs. Libraries such as OpenCV have made these tasks really easy and these approaches have proven to be fast in real-time but still, the use of deep learning models requires the use of more resources and they might not be able to perform so well on common devices having limited resources. Moreover, as it can be seen from the figure 2, it is obvious that many important features such as the exact position of some fingers due to the self-occlusion of hands and complexity of signs are lost in this approach and it makes many different signs look the same (for example M and N). Therefore, it might work really well for a small dataset but as the dataset will increase this approach will lose its effectiveness.
The third approach which researchers have popularly started using is through pose estimation models such as OpenPose. These methods have been the most effective to this date and can accurately classify almost any sign without the use of any specific hardware such as motion sensors or hand gloves. In these methods, a set of key points is recognized from an image with the help of Deep Learning models and these key points can accurately represent any position of hands. The only problem with this method is that although it can work in real-time, it requires a good amount of resources and it gives a speed of 0.1 to 0.3 frames per second which is not good at all. It can not process frames smoothly in real-time.
In the mid of 2019, Google released the Mediapipe framework with cross-platform support. What makes it better is its impressive frame rate and the ability to run in any environment (see figure 3). It definitely opens up a whole new world of possibilities for the recognition of sign languages because now the requirement of a huge amount of resources has been eliminated.
In the starting Mediapipe only had support for Mac and Linux but recently the developers made it available as a pip package with a new update on November 5. I had already been working on my final year project — Indian Sign Language recognition when this update came and I had already prepared and trained a huge dataset of different alphanumeric signs(0–9, A-Z). I was using a light hand key point tracking model and prepared a preprocessed dataset containing the x and the y coordinates of these key points in images. The model that I was using was fast but not efficient, it could not identify the key points accurately. (See figure 4) On observing closely you will realize that the position of key points for both the alphabets — M and N is almost the same. Also, this model was not able to recognize proper key points for many alphabets such as Q and C because some part of the hand is not visible in these signs. Due to this lack of accuracy the machine learning model that I had trained was not so efficient in classifying different signs. I wanted to use the Mediapipe hands model but due to its incompatibility with windows and integration issues, I was not able to implement it successfully even after continuously investing 3–4 days. But as soon as I found out about the new update I got very excited and was finally able to implement the functionalities provided by the pip package in real-time after facing some errors (for obvious reasons). It was so fast that I was able to process all of my datasets again in just 2 hours which took more than 2 days to be processed earlier. I trained a Random Forest Classifier on the pre-processed data and got an accuracy of 98 % this time in classifying different alphabets and digits.
The accuracy of Mediapipe hands can be clearly observed in figure 5 as compared to figure 4. Accurate key point detection makes it easier for the machine learning model to classify between signs with little differences.
Now, let’s dive into a little more depth about the machine learning model I trained. I did not simply use the x and y coordinates returned by Mediapipe hands to train the machine learning model. Although these features can be enough to classify the 36 signs (26 alphabets and 10 digits) but if more signs were to be included in the dataset then more features would be required for better accuracy. Therefore, I included the distances between the 0th key point (the key point at the very bottom in the palm) and the rest of the 20 key points as features. Mediapipe hands model returns the normalized coordinates for these key points i.e. it returns the key points by dividing the x coordinate by the width of the frame and y by the height of the frame but for a better normalization, I calculated the new coordinates by shifting the origin to the 0th point itself. Now we have the position of key points with respect to the 0th key point. Therefore, the location of the hand will not actually have much effect on the coordinates of these key points and the model will be able to work on a more general scale.
I have explained the general workflow of the whole project in real-time below:
Initially, the frame is captured from the video and passed to the Mediapipe framework. Mediapipe consists of 2 different models in the pipeline. The first is a palm detection model that recognizes the region containing the hand in the video frame and then this cropped region is given as input to a second model that recognizes the position of hand landmarks. In the recent release, it is also able to efficiently determine if it's the right or the left hand. The detected hand landmarks are then passed to two separate functions. The first function calculates the new coordinates after the origin is shifted to the 0th key point and the second function calculates the euclidean distance between the 0th key point and the rest of the key points. Now the new coordinates, distances, and the handedness (left or right) are given as input to the machine learning model which predicts and returns the class corresponding to the sign.
I hope you got some idea of the power and potential that Mediapipe has and I would definitely like to encourage everyone to develop more applications that use Mediapipe and can work efficiently in real-time. Good Luck!