SignLLM: A Multilingual Sign Language Model that can Generate Sign Language Gestures from Input Text

The main objective of Sign Language Production (SLP) is to develop sign avatars that resemble humans using text inputs. The standard process for SLP methods based on deep learning involves several steps. Firstly, the text is translated into gloss, a language that represents postures and gestures. This gloss is then utilized to generate a video that imitates sign language. The resulting video is further enhanced to create more realistic avatar animations that closely resemble real people. Gathering and processing data in sign language poses challenges due to the complexity of these processes.

In recent years, most studies have struggled with the challenges of a German sign language (GSL) dataset called PHOENIX14T, as well as other lesser-known language datasets for Sign Language Production, Recognition, and Translation tasks (SLP, SLR, and SLT). These challenges, including the absence of standardized tools and slow progress in research on minority languages, have significantly dampened researchers’ enthusiasm. The complexity of the issue is further emphasized by the fact that studies using the American Sign Language (ASL) dataset are still in their early stages.

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

While significant progress has been made in the field thanks to mainstream datasets, they do not address the emerging challenges:

Existing datasets may contain files in complex forms, such as images, scripts, OpenPose skeleton key points, graphs, and other formats used for preprocessing, lacking directly trainable actionable data.

Manually annotating glosses is a laborious and time-consuming process.

Transforming sign video datasets from sign language experts into various forms makes expanding the dataset difficult.

Researchers from various institutions present Prompt2Sign, a pioneering dataset that tracks the upper body movements of sign language demonstrators on a large scale. This dataset is a significant advancement in multilingual sign language recognition and generation, combining eight distinct sign languages using publicly available online videos and datasets to overcome the limitations of previous efforts.

The researchers start by standardizing the posture information of video frames into a preset format using OpenPose to construct this dataset. Storing key information in their standardized format aims to reduce redundancy and simplify training with seq2seq and text2text models. They generate prompt words automatically to lessen the need for human annotations and enhance the tools’ automation level in processing and data collection, making them highly efficient and lightweight.

The team acknowledges that the current model may require adjustments due to the challenges posed by new datasets during model training. Managing multiple sign languages simultaneously is difficult due to variations in sign language across countries. Investigating training techniques at fast speeds, exploring under-researched topics, and improving language comprehension prompts are crucial. To address these issues, the team introduces SignLLM, the first large-scale multilingual SLP model built on the Prompt2Sign dataset.

SignLLM can generate skeletal poses of eight different sign languages based on texts or suggestions. It features two modes: the Multi-Language Switching Framework (MLSF) for generating multiple sign languages simultaneously and the Prompt2LangGloss module for generating static encoder-decoder pairs. The team aims to establish a standard for multilingual recognition and generation using their new dataset.

The team’s latest loss function incorporates a Reinforcement Learning (RL) module to expedite model training on multiple languages and larger datasets, addressing the prolonged training time. Through numerous tests and ablation studies, SignLLM outperforms baseline methods on both development and test sets for eight sign languages.

While their work has significantly advanced the automation of data processing and capture in sign language, a comprehensive end-to-end solution is still needed. For instance, utilizing one’s private dataset requires extracting 2D keypoint json files using OpenPose and manual updates.

For more information, refer to the Paper and Project. All credit for this research goes to the project’s researchers.

Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter.

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Source link