Multimodal Question and Answer: A Step-by-Step Guide with OPEA™ 1.2...

2/26/2025

Melanie Hart Buehler Mustafa Cetin Dina Suehiro Jones

Introduction

Imagine being able to ask a question using a picture, query instructional videos to receive a single targeted clip, or search through a diverse collection of multimedia files using your voice. All of this is possible with the Open Platform for Enterprise AI (OPEA™) Multimodal Question and Answer (MMQnA) chatbot. The MMQnA chatbot leverages the power of multimodal AI to deliver a flexible and intuitive way to interact with complex datasets. Whether you’re a developer, a data scientist, or an enterprise looking to enhance your information retrieval capabilities, this tool is designed to help you efficiently meet your needs.

In the era of Large Language Models (LLMs), we can now make use of robust and accurate models for complex datasets. Instead of being limited to a single modality, like text, we can leverage transformer architectures that support any modality type as an input. Here, we introduce a MMQnA chatbot capable of handling any mix of text, images, spoken audio, or video in a Retrieval-Augmented Generation (RAG) workflow.

This article will walk you through the steps to deploy and test drive OPEA’s MMQnA megaservice on the Intel® Gaudi® 2 AI accelerator using Intel® Tiber™ AI Cloud. From setup to execution, we’ll cover everything you need to know to get started with this multimodal GenAI application.

What is OPEA?

OPEA is an open platform consisting of composable building blocks for state-of-the-art generative AI systems. It is ideal for showcasing MMQnA because it is flexible, secure, and cost-effective. OPEA makes it easy to integrate advanced AI solutions into business systems, speeding up development and adding value. It uses a modular approach with microservices for flexibility and megaservices for comprehensive solutions, simplifying the development and scaling of complex AI applications. OPEA also supports powerful hardware like Intel Gaudi 2 and Intel® Xeon® Scalable Processors, which are adept at handling the heavy demands of AI models. Plus, OPEA’s GenAIExamples repository demonstrates many different scenarios and makes accessing the services easy and user-friendly.

Overview of the MMQnA Chatbot

The MMQnA chatbot example uses advanced open-source Large Vision-Language Models (LVMs) to handle any mix of modalities, providing accurate and contextually relevant responses. The features introduced in this tutorial can address real-world challenges across many different industries, from healthcare to retail. With the MMQnA chatbot, you can chat with:

A collection of videos
A collection of audio files (e.g. a podcast library)
Images that have customized captions or labels
Multimodal PDFs
A diverse set of multimodal data (images, text, video, audio, and PDFs) using a spoken audio query

The MMQnA chatbot makes use of the innovative BridgeTower model, a state-of-the-art multimodal encoding transformer that seamlessly integrates visual and textual data into a unified semantic space. This allows the retrieval system to dynamically fetch the most relevant multimodal information, whether it be frames, transcripts, or captions, from your data collection to answer complex queries.

At its core, MMQnA consists of three key components: the embedding service, the retrieval service, and the LVM service. During the data ingestion phase, the BridgeTower model processes visual and textual data, embedding them and storing these embeddings in a vector database. When a user poses a question, the retrieval service receives the most relevant multimodal content from this vector store and feeds it into the LVM to generate a comprehensive response. This architecture ensures that MMQnA can handle a wide range of queries.

Figure 1. MMQnA Architecture — **Figure 1**. MMQnA Architecture

The user interface (UI) for MMQnA makes it easy to interact with the system (see Figure 3). The UI is included with the MMQnA example and is deployed in a docker container, providing a user-friendly way to upload data, input queries, and view chat responses along with retrieved media.

Prerequisites

Before you begin setting up the MMQnA chatbot, ensure you have the following prerequisites in place:

Hardware: You must have access to a machine with two or more Intel Gaudi 2 processor cards. We will be using Intel Tiber AI Cloud for this tutorial – specifically, an instance with 8 Gaudi 2 HL-225H mezzanine cards with 3rd Generation Xeon processors, 1 TB of RAM, and 30 TB of disk space. If you do not have SSH access to the machine, you will need to port forward the UI port (5173) to access the user interface.
Docker Compose: Docker Compose will be used to run the services. Ensure Docker Compose is installed on your machine.

With these prerequisites in place, you are ready to proceed with the step-by-step tutorial, where we will set up and deploy the MMQnA chatbot on Intel Gaudi 2 using Intel Tiber AI Cloud.

Step-by-Step Tutorial

Follow these steps to get the MMQnA megaservice up and running on Intel Gaudi 2 using Intel Tiber AI Cloud and start interacting with your own multimodal data.

Step 1: SSH to the Intel Gaudi Machine

If you are using Intel Tiber AI Cloud, start an Intel Gaudi 2 instance and wait for it to be in the “Ready” state. SSH to the VM with port forwarding to access the user interface at port 5173.


ssh -L 5173:127.0.0.1:5173 <ip address>

Step 2: Clone the GenAIExamples Repository

Clone the GenAIExamples repository and navigate to the MMQnA Intel Gaudi directory (the docker_compose directory contains options for running with different hardware).


git clone https://github.com/opea-project/GenAIExamples.git --depth 1 --single-branch --branch v1.2
cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi

Step 3: Configure the Environment

Set the necessary environment variables for the host IP and Hugging Face token and run the set_env.sh script for Intel Gaudi. Note that if your enterprise is behind a proxy, OPEA components require specific proxy environment variables. Many parameters can be customized by editing the set_env.sh script before you source it. For example, on Intel Gaudi the LVM model defaults to llava-hf/llava-v1.6-vicuna-13b-hf. While not required for the default LLaVA model, some Hugging Face models are gated and would require a token.


export HUGGINGFACEHUB_API_TOKEN=<your Hugging Face API token>

The services’ many port numbers are specified in the set_env.sh script and can be edited there if any changes are necessary before starting the application. When you are finished inspecting and editing the entries in the set_env.sh script, source it in your environment.


source ./set_env.sh

Step 4: Start the Services with Docker Compose

The Docker Compose YAML configuration file defines and manages the multi-container MMQnA application. Running the docker compose up command reads the YAML file and starts all the defined services with the specified networks, variables, ports, and dependencies required for the application to run. Start the services using Docker Compose:


docker compose up -d

Step 5: Monitor the Service Initialization

Docker Compose will pull images from DockerHub and start the containers according to the configuration file. Some of the services will take a while to initialize. The embedding-multimodal-bridgetower container will take one or two minutes to show as Healthy and the TGI Gaudi service will take up to ten minutes to be operational, because of the large size of the LVM model. You can monitor the service’s status by following the container logs, like this:


docker logs tgi-llava-gaudi-server -f

When the TGI Gaudi service is ready, the end of the log file will have a Connected or Uvicorn running on <url> message.

The logs of other containers can be reviewed for readiness before testing. For example:


docker ps
docker logs lvm

When all of the services are up and running properly, you will be ready to test them.

Step 6: Sanity Check with Curl Commands (Optional)

The README in the GenAIExamples repository contains shell commands for sanity checking the microservice and megaservice APIs. By running these curl commands, you can validate that the services are properly set up and responsive. Below is a short example that tests the embedding-multimodal-bridgetower service, sets up some data, tests the dataprep microservice, and then queries the megaservice. You can refer to this README for a more complete explanation and set of tests.

First, test the embedding-multimodal-bridgetower service by sending a POST request with some sample text:


curl http://${host_ip}:${EMM_BRIDGETOWER_PORT}/v1/encode \
    -X POST \
    -H "Content-Type:application/json" \
    -d '{"text":"This is an example"}'

Next, download a sample video file to set up some data for ingestion:


export video_file="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_file}

Then, test the dataprep microservice by generating a transcript for the video file:


curl --silent --write-out "HTTPSTATUS:%{http_code} " \
    ${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
    -H 'Content-Type: multipart/form-data' \
    -X POST \
    -F "files=@./${video_file}"

You should see a response that includes status 200 and the message saying Data preparation succeeded. Finally, test the overall functionality of the megaservice by sending a text query:


curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"messages": "What is the Bull Run Live Rally?"}'

You should see a response that answers the question using the downloaded video. If you see an error, check the logs of the tgi-llava-gaudi-server service again and make sure it is ready. By performing these sanity checks, you have confirmed that the entire MMQnA pipeline, from data ingestion to query response, is working seamlessly.

Step 7: Access the User Interface

From your browser, navigate to http://127.0.0.1:5173 to access the MMQnA UI.

Step 8: Data Ingestion

There are four different UI tabs for data ingestion – Upload Video, Upload Image, Upload Audio, and Upload PDF. This part of the tutorial demonstrates uploading a PDF, but you can mix and match the four types to your own needs. Download a PDF with at least one image. We will use this example: https://www.coris.noaa.gov/activities/resourceCD/resources/edge_abyss_bm.pdf. Click on the “Upload PDF” tab and either drag the file into the upload box or select the file after clicking the upload box. The images and text from the PDF file will be uploaded and you will see the message Congratulations, your upload is done!

Figure 2. PDF Ingestion Success — **Figure 2**. PDF Ingestion Success

Step 9: Query the Megaservice

Click back to the “MultimodalQnA” query tab and enter a prompt into the text box. You will be able to chat with the contents of the file you have uploaded.

Figure 3. Query Results for “What can you tell me about deep sea corals?” — **Figure 3**. Query Results for “What can you tell me about deep sea corals?”

You can use text, an image, or the microphone to input follow-up queries. Use the “Clear” button to start over with a new conversation.

Step 10: Shutdown

When you have finished using the MMQnA service, shut down all running containers to free up system resources. To do this, navigate to the directory containing your Docker Compose YAML file and run the following command:


docker compose down

This step-by-step tutorial has shown you how to set up and run the MMQnA chatbot application. If you encountered any problems along the way, the next section will help you address them. After that, it will be time to start building and sharing your own multimodal data chatbot.

Tips and Troubleshooting

When working with the MMQnA application, you may encounter some issues. Here are a few tips to help you troubleshoot and resolve the most common problems.

Check proxy variables: Ensure that your proxy environment variables (http_proxy, https_proxy, and no_proxy) are set correctly. Incorrect proxy settings can prevent services from communicating properly.

Check that the LVM container is initialized: Check the log of the tgi-llava-gaudi-server container with docker logs tgi-llava-gaudi-server. If it contains the string Connected near the bottom, then it is fully initialized. Otherwise, you will need to wait a little longer.

View container logs: If a problem is not solvable by the above tips, you may need to inspect the logs of other containers to identify where the failure is occurring. Use docker ps to see the container names and docker logs <container_name> to figure out which service is failing. If an exception has been thrown, you will be able to view the stack trace and error type at the end of the log.

Verify basic curl commands: Revisit the sanity checks and read the full microservices verification section in the README to isolate the problem. If you can narrow it down to a single microservice, API call, or input, it will make it easier to spot the problem. Look at the container logs in combination with specific curl commands.

For more detailed documentation, refer to the GenAIExamples GitHub repository.

Conclusion

We’ve walked through the setup, deployment, and usage of the MMQnA GenAI example from OPEA and provided some tips and tricks to get it all running smoothly. Now it’s your turn to imagine the possibilities for multimodal RAG in your enterprise. We encourage you to try out the MMQnA application, build with it, and experience firsthand how it can transform interactions with your datasets. Your feedback is appreciated, so please share your experiences and suggestions by submitting comments or issues through our GitHub repository.

Sources & Further Reading

DeepLearning.AI. Multimodal RAG: Chat with videos. DeepLearning.AI. https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/

Lal, V. (2023). Our demo at Intel Vision: Multimodal GenAI and RAG. LinkedIn. https://www.linkedin.com/pulse/our-demo-intel-vision-multimodal-genai-rag-vasudev-lal-qfurc/

Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. arXiv. https://arxiv.org/abs/2310.03744

Xu, X., Wu, C., Rosenman, S., Lal, V., Che, W., & Duan, N. (2024). BridgeTower: Building bridges between encoders in vision-language representation learning. arXiv. https://arxiv.org/abs/2206.08657

OPEA Project 1.2 Links:
- GenAIExamples GitHub repo
- GenAIComps GitHub repo
- Official documentation

Acknowledgements

Thank you to our colleagues who made contributions and helped to review this blog: Omar Khleif, Harsha Ramayanam, and Abolfazl Shahbazi.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Multimodal Question and Answer

Introduction

What is OPEA?

Overview of the MMQnA Chatbot

Prerequisites

Step-by-Step Tutorial

Step 1: SSH to the Intel Gaudi Machine

Step 2: Clone the GenAIExamples Repository

Step 3: Configure the Environment

Step 4: Start the Services with Docker Compose

Step 5: Monitor the Service Initialization

Step 6: Sanity Check with Curl Commands (Optional)

Step 7: Access the User Interface

Step 8: Data Ingestion

Step 9: Query the Megaservice

Step 10: Shutdown

Tips and Troubleshooting

Conclusion

Sources & Further Reading

Acknowledgements

Related Content

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Multimodal Question and Answer

Introduction

What is OPEA?

Overview of the MMQnA Chatbot

Prerequisites

Step-by-Step Tutorial

Step 1: SSH to the Intel Gaudi Machine

Step 2: Clone the GenAIExamples Repository

Step 3: Configure the Environment

Step 4: Start the Services with Docker Compose

Step 5: Monitor the Service Initialization

Step 6: Sanity Check with Curl Commands (Optional)

Step 7: Access the User Interface

Step 8: Data Ingestion

Step 9: Query the Megaservice

Step 10: Shutdown

Tips and Troubleshooting

Conclusion

Sources & Further Reading

Acknowledgements

Related Content

产品和性能信息