Introduction
Imagine being able to ask a question using a picture, query instructional videos to receive a single targeted clip, or search through a diverse collection of multimedia files using your voice. All of this is possible with the Open Platform for Enterprise AI (OPEA™) Multimodal Question and Answer (MMQnA) chatbot. The MMQnA chatbot leverages the power of multimodal AI to deliver a flexible and intuitive way to interact with complex datasets. Whether you’re a developer, a data scientist, or an enterprise looking to enhance your information retrieval capabilities, this tool is designed to help you efficiently meet your needs.
In the era of Large Language Models (LLMs), we can now make use of robust and accurate models for complex datasets. Instead of being limited to a single modality, like text, we can leverage transformer architectures that support any modality type as an input. Here, we introduce a MMQnA chatbot capable of handling any mix of text, images, spoken audio, or video in a Retrieval-Augmented Generation (RAG) workflow.
This article will walk you through the steps to deploy and test drive OPEA’s MMQnA megaservice on the Intel® Gaudi® 2 AI accelerator using Intel® Tiber™ AI Cloud. From setup to execution, we’ll cover everything you need to know to get started with this multimodal GenAI application.
What is OPEA?
OPEA is an open platform consisting of composable building blocks for state-of-the-art generative AI systems. It is ideal for showcasing MMQnA because it is flexible, secure, and cost-effective. OPEA makes it easy to integrate advanced AI solutions into business systems, speeding up development and adding value. It uses a modular approach with microservices for flexibility and megaservices for comprehensive solutions, simplifying the development and scaling of complex AI applications. OPEA also supports powerful hardware like Intel Gaudi 2 and Intel® Xeon® Scalable Processors, which are adept at handling the heavy demands of AI models. Plus, OPEA’s GenAIExamples repository demonstrates many different scenarios and makes accessing the services easy and user-friendly.
Overview of the MMQnA Chatbot
The MMQnA chatbot example uses advanced open-source Large Vision-Language Models (LVMs) to handle any mix of modalities, providing accurate and contextually relevant responses. The features introduced in this tutorial can address real-world challenges across many different industries, from healthcare to retail. With the MMQnA chatbot, you can chat with:
- A collection of videos
- A collection of audio files (e.g. a podcast library)
- Images that have customized captions or labels
- Multimodal PDFs
- A diverse set of multimodal data (images, text, video, audio, and PDFs) using a spoken audio query
The MMQnA chatbot makes use of the innovative BridgeTower model, a state-of-the-art multimodal encoding transformer that seamlessly integrates visual and textual data into a unified semantic space. This allows the retrieval system to dynamically fetch the most relevant multimodal information, whether it be frames, transcripts, or captions, from your data collection to answer complex queries.
At its core, MMQnA consists of three key components: the embedding service, the retrieval service, and the LVM service. During the data ingestion phase, the BridgeTower model processes visual and textual data, embedding them and storing these embeddings in a vector database. When a user poses a question, the retrieval service receives the most relevant multimodal content from this vector store and feeds it into the LVM to generate a comprehensive response. This architecture ensures that MMQnA can handle a wide range of queries.

The user interface (UI) for MMQnA makes it easy to interact with the system (see Figure 3). The UI is included with the MMQnA example and is deployed in a docker container, providing a user-friendly way to upload data, input queries, and view chat responses along with retrieved media.
Prerequisites
Before you begin setting up the MMQnA chatbot, ensure you have the following prerequisites in place:
- Hardware: You must have access to a machine with two or more Intel Gaudi 2 processor cards. We will be using Intel Tiber AI Cloud for this tutorial – specifically, an instance with 8 Gaudi 2 HL-225H mezzanine cards with 3rd Generation Xeon processors, 1 TB of RAM, and 30 TB of disk space. If you do not have SSH access to the machine, you will need to port forward the UI port (5173) to access the user interface.
- Docker Compose: Docker Compose will be used to run the services. Ensure Docker Compose is installed on your machine.
With these prerequisites in place, you are ready to proceed with the step-by-step tutorial, where we will set up and deploy the MMQnA chatbot on Intel Gaudi 2 using Intel Tiber AI Cloud.
Step-by-Step Tutorial
Follow these steps to get the MMQnA megaservice up and running on Intel Gaudi 2 using Intel Tiber AI Cloud and start interacting with your own multimodal data.
Step 1: SSH to the Intel Gaudi Machine
If you are using Intel Tiber AI Cloud, start an Intel Gaudi 2 instance and wait for it to be in the “Ready” state. SSH to the VM with port forwarding to access the user interface at port 5173.
Step 2: Clone the GenAIExamples Repository
Clone the GenAIExamples repository and navigate to the MMQnA Intel Gaudi directory (the docker_compose directory contains options for running with different hardware).
Step 3: Configure the Environment
Set the necessary environment variables for the host IP and Hugging Face token and run the set_env.sh script for Intel Gaudi. Note that if your enterprise is behind a proxy, OPEA components require specific proxy environment variables. Many parameters can be customized by editing the set_env.sh script before you source it. For example, on Intel Gaudi the LVM model defaults to llava-hf/llava-v1.6-vicuna-13b-hf. While not required for the default LLaVA model, some Hugging Face models are gated and would require a token.
The services’ many port numbers are specified in the set_env.sh script and can be edited there if any changes are necessary before starting the application. When you are finished inspecting and editing the entries in the set_env.sh script, source it in your environment.
Step 4: Start the Services with Docker Compose
The Docker Compose YAML configuration file defines and manages the multi-container MMQnA application. Running the docker compose up command reads the YAML file and starts all the defined services with the specified networks, variables, ports, and dependencies required for the application to run. Start the services using Docker Compose:
Step 5: Monitor the Service Initialization
Docker Compose will pull images from DockerHub and start the containers according to the configuration file. Some of the services will take a while to initialize. The embedding-multimodal-bridgetower container will take one or two minutes to show as Healthy and the TGI Gaudi service will take up to ten minutes to be operational, because of the large size of the LVM model. You can monitor the service’s status by following the container logs, like this:
When the TGI Gaudi service is ready, the end of the log file will have a Connected or Uvicorn running on <url> message.
The logs of other containers can be reviewed for readiness before testing. For example:
When all of the services are up and running properly, you will be ready to test them.
Step 6: Sanity Check with Curl Commands (Optional)
The README in the GenAIExamples repository contains shell commands for sanity checking the microservice and megaservice APIs. By running these curl commands, you can validate that the services are properly set up and responsive. Below is a short example that tests the embedding-multimodal-bridgetower service, sets up some data, tests the dataprep microservice, and then queries the megaservice. You can refer to this README for a more complete explanation and set of tests.
First, test the embedding-multimodal-bridgetower service by sending a POST request with some sample text:
Next, download a sample video file to set up some data for ingestion:
Then, test the dataprep microservice by generating a transcript for the video file:
You should see a response that includes status 200 and the message saying Data preparation succeeded. Finally, test the overall functionality of the megaservice by sending a text query:
You should see a response that answers the question using the downloaded video. If you see an error, check the logs of the tgi-llava-gaudi-server service again and make sure it is ready. By performing these sanity checks, you have confirmed that the entire MMQnA pipeline, from data ingestion to query response, is working seamlessly.
Step 7: Access the User Interface
From your browser, navigate to http://127.0.0.1:5173 to access the MMQnA UI.
Step 8: Data Ingestion
There are four different UI tabs for data ingestion – Upload Video, Upload Image, Upload Audio, and Upload PDF. This part of the tutorial demonstrates uploading a PDF, but you can mix and match the four types to your own needs. Download a PDF with at least one image. We will use this example: https://www.coris.noaa.gov/activities/resourceCD/resources/edge_abyss_bm.pdf. Click on the “Upload PDF” tab and either drag the file into the upload box or select the file after clicking the upload box. The images and text from the PDF file will be uploaded and you will see the message Congratulations, your upload is done!

Step 9: Query the Megaservice
Click back to the “MultimodalQnA” query tab and enter a prompt into the text box. You will be able to chat with the contents of the file you have uploaded.

You can use text, an image, or the microphone to input follow-up queries. Use the “Clear” button to start over with a new conversation.
Step 10: Shutdown
When you have finished using the MMQnA service, shut down all running containers to free up system resources. To do this, navigate to the directory containing your Docker Compose YAML file and run the following command:
This step-by-step tutorial has shown you how to set up and run the MMQnA chatbot application. If you encountered any problems along the way, the next section will help you address them. After that, it will be time to start building and sharing your own multimodal data chatbot.
Tips and Troubleshooting
When working with the MMQnA application, you may encounter some issues. Here are a few tips to help you troubleshoot and resolve the most common problems.
-
Check proxy variables: Ensure that your proxy environment variables (http_proxy, https_proxy, and no_proxy) are set correctly. Incorrect proxy settings can prevent services from communicating properly.
-
Check that the LVM container is initialized: Check the log of the tgi-llava-gaudi-server container with docker logs tgi-llava-gaudi-server. If it contains the string Connected near the bottom, then it is fully initialized. Otherwise, you will need to wait a little longer.
-
View container logs: If a problem is not solvable by the above tips, you may need to inspect the logs of other containers to identify where the failure is occurring. Use docker ps to see the container names and docker logs <container_name> to figure out which service is failing. If an exception has been thrown, you will be able to view the stack trace and error type at the end of the log.
-
Verify basic curl commands: Revisit the sanity checks and read the full microservices verification section in the README to isolate the problem. If you can narrow it down to a single microservice, API call, or input, it will make it easier to spot the problem. Look at the container logs in combination with specific curl commands.
For more detailed documentation, refer to the GenAIExamples GitHub repository.
Conclusion
We’ve walked through the setup, deployment, and usage of the MMQnA GenAI example from OPEA and provided some tips and tricks to get it all running smoothly. Now it’s your turn to imagine the possibilities for multimodal RAG in your enterprise. We encourage you to try out the MMQnA application, build with it, and experience firsthand how it can transform interactions with your datasets. Your feedback is appreciated, so please share your experiences and suggestions by submitting comments or issues through our GitHub repository.
Sources & Further Reading
-
DeepLearning.AI. Multimodal RAG: Chat with videos. DeepLearning.AI. https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/
-
Lal, V. (2023). Our demo at Intel Vision: Multimodal GenAI and RAG. LinkedIn. https://www.linkedin.com/pulse/our-demo-intel-vision-multimodal-genai-rag-vasudev-lal-qfurc/
-
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. arXiv. https://arxiv.org/abs/2310.03744
-
Xu, X., Wu, C., Rosenman, S., Lal, V., Che, W., & Duan, N. (2024). BridgeTower: Building bridges between encoders in vision-language representation learning. arXiv. https://arxiv.org/abs/2206.08657
-
OPEA Project 1.2 Links:
- GenAIExamples GitHub repo
- GenAIComps GitHub repo
- Official documentation
Acknowledgements
Thank you to our colleagues who made contributions and helped to review this blog: Omar Khleif, Harsha Ramayanam, and Abolfazl Shahbazi.