Recent breakthroughs in generative AI and huge language, vision, and multimodal models can be a foundation for open-domain knowledge, inference, and generation capabilities, enabling open-ended task aid scenarios. The capacity to produce pertinent instructions and content is just the beginning of what is needed to construct AI systems that work with humans in the real world. This includes mixed-reality task assistants, interactive robots, smart manufacturing floors, autonomous vehicles, and many more.
Artificial intelligence systems must continuously perceive and reason multimodally in a stream about their environment to seamlessly work with humans in the real world. This criterion extends beyond object detection and tracking. For physical teamwork to be successful, everyone involved must be aware of the objects’ potential functions, their relationships to one another, and spatial limitations and how these factors change over time.
These systems must be able to reason not only about the physical world but also about humans. Judgments regarding cognitive states and social norms of real-time collaborative behavior should be included in this reasoning, in addition to lower-level judgments about body stance, voice, and actions.
Using a combination of mixed-reality and artificial intelligence technologies, such as big language and vision models, Microsoft Research introduces SIGMA. This interactive program can use HoloLens 2 to walk users through procedural tasks. A big language model, such as GPT-4, or a set of manually defined stages in a task library can be used to dynamically create tasks. When a user asks SIGMA an open-ended question during the interaction, the system can use its extensive language model to provide an answer. To top it all off, SIGMA can locate and highlight task-relevant objects in the user’s field of view using vision models such as Detic and SEEM.
Several design choices support these research goals. One example of the system’s implementation is a client-server architecture. The HoloLens 2 device runs a lightweight client application that transmits multiple multimodal data streams to a more powerful desktop server. These streams include RGB (red, green, and blue), depth, audio, head, hand, and gaze tracking information. Client apps receive data and instructions from the desktop server on displaying content on the device, which executes the application’s basic functionality. By using this design, researchers can get beyond the headset’s present computing limits and open the door to possibilities for expanding the program to additional mixed-reality devices.
The open-source architecture known as Platform for Situated Intelligence (psi) is the foundation for SIGMA, allowing for developing and researching multimodal integrative AI systems. Performant streaming and logging infrastructure are provided by the underlying \\psi framework, which also allows for fast prototyping. The framework’s data replay infrastructure makes data-driven application-level development and tuning possible. Finally, there is a wealth of support for visualization, debugging, tuning, and maintenance in Platform for Situated Intelligence Studio.
While SIGMA’s present functionality lacks sophistication, it does serve as a foundation for future research into the convergence of mixed reality and artificial intelligence. Many research topics, particularly perception, can and have been explored using collected datasets. These problems range from computer vision to speech recognition.
As an example of Microsoft’s ongoing dedication to the field, SIGMA is a research platform. It is representative of the company’s efforts to investigate novel artificial intelligence and mixed reality technologies. Dynamics 365 Guides is another enterprise-ready mixed-reality solution that Microsoft provides to frontline employees. Frontline employees are empowered with step-by-step procedural assistance and relevant information in the workflow with Copilot in Dynamics 365 Guides, which customers currently utilize in private preview. AI and mixed reality work together to make this possible. Enterprise users can benefit greatly from Dynamics 365 Guides, a feature-rich tool designed for frontline workers who execute difficult operations.
By making the system publicly available, the researchers hope to alleviate other researchers’ burdens associated with the fundamental engineering tasks of building a full-stack interactive application so they can proceed straight to the exciting new frontiers in their field.
Check out the Details and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter.
Don’t Forget to join our 41k+ ML SubReddit
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.