Deploy a DeepSeek-R1-Distill Chatbot in Minutes on Low-Cost AWS Xeon...

The DeepSeek-R1 model stunned the world when it debuted and demonstrated reasoning skills on par with OpenAI o1* while being much more cost-efficient. It provides responses in a step-by-step fashion similar to a human’s thought process to solve problems in science, mathematics, coding, and general knowledge.

The DeepSeek-R1-Zero model was trained from the DeepSeek-V3-Base large language model (LLM) using a technique called reinforcement learning (RL), a type of deep learning algorithm which rewards the model for correct answers or correct thought processes, without the use of datasets. Its training also did not include supervised fine-tuning (SFT) as a preliminary step. Because DeepSeek-R1-Zero suffers from repetition, poor readability, and language mixing, the DeepSeek-R1 model was introduced to address these issues. DeepSeek-R1 incorporates two RL stages to discover reasoning patterns similar to humans, and two SFT stages to seed the model’s reasoning and non-reasoning capabilities.

DeepSeek-R1, along with DeepSeek-R1-Distill models which are fine-tuned based on open source models including Meta* Llama and Alibaba Cloud* Qwen, are all available on Hugging Face* hub. The models can be easily deployed using Amazon Web Services (AWS)* and the Open Platform for Enterprise AI (OPEA) ChatQnA example. These distilled versions allow researchers with limited computing power to run the model. According to Mario Krenn, leader of the Artificial Scientist Lab at the Max Planck Institute for the Science of Light in Erlangen, Germany, “an experiment that cost $370 to run with o1, cost less than $10 with R1.”

Our example uses a cost-effective m7i.4xlarge AWS instance, powered by 4th Generation Intel® Xeon® Scalable processors with 16 virtual CPUs (vCPUs), to run DeepSeek-R1-Distill-Qwen-1.5B to do a fundamental math problem: solve a quadratic function.

For the prompt, we will be asking DeepSeep-R1-Distill-Qwen-1.5B to find the solutions to x^2+3x-9 and separate the reasoning and final answer. For reference, the correct answer should be two solutions: (-3+3√5)/2, (-3-3√5)/2 .

Alternatively, the answers in decimal are ≈1.8541, ≈−4.8541.

Look at the difference in responses between Qwen2.5-Math-1.5B and DeepSeek-R1-Distill-Qwen-1.5B:

Figure 1: OPEA ChatQnA running Qwen2.5-Math-1.5B gives the most straightforward response and answer.

Figure 2: OPEA ChatQnA running DeepSeek-R1-Distill-Qwen-1.5B provides the entire thought process to arrive at the correct answer, imitating human reasoning.

The original Qwen model just gives a straightforward answer that seems robotic. On the contrary, DeepSeek-R1-Distill-Qwen-1.5B explains everything in a more human-like way, similar to how a teacher would show the solution to students. The experience for the end user can be completely different with the DeepSeek-R1-Distill model.

Building an entire pipeline from scratch is time-consuming and costly. The OPEA ChatQnA example integrates all the building blocks needed to build and deploy a retrieval augmented generation (RAG) based chatbot. To run with DeepSeek-R1 distill models, all it takes is changing a single environment variable to specify the corresponding Hugging Face model card. This guide shows you how you can run DeepSeek yourself in minutes on an AWS instance with OPEA.

OPEA’s Purpose

Let’s say you want to build a generative AI (GenAI) application, but don’t know where to start. Or perhaps you do know, but there are so many different components including vector databases, embeddings, retrieval/ranking, LLM inference, etc. Implementing these components from scratch and getting them to work in the entire system may be reinventing the wheel. That is where OPEA comes in – to simplify development, production, and adoption of GenAI applications for enterprises.

OPEA is a Linux Foundation project that provides an open source framework of building blocks called microservices for developers to incorporate into their GenAI applications. Full examples of GenAI applications using these microservices are also available, including AgentQnA, ChatQnA, Code Generation, Document Summarization, Text2Image, and more. OPEA has about 50 partners, all experts in specific areas, contributing to the open source project. By building your application with OPEA, you are leveraging the work of these experts without having to start from scratch, and you can get something running in minutes.

To run an LLM, it can be as simple as setting up and deploying the ChatQnA example with Docker*. As of the most recent 1.2 release, the default LLM is meta-llama/Meta-Llama-3-8B-Instruct. To run with any of the DeepSeek-R1-Distill models, just change one environment variable. Let’s see how you can run ChatQnA on your AWS instance.

Setting Up the OPEA ChatQnA Example on AWS

The OPEA documentation contains a getting started guide that describes how to get started with OPEA on-prem and on various cloud service providers, including AWS. It has instructions on how to launch your AWS instance, and it recommends using an m7i.4xlarge instance or larger. This instance has a 4th Gen Intel® Xeon® Scalable processor with 16 vCPUs and 64GB of memory. You can use larger instances with more cores or multiple nodes to get better performance in throughput and latency and have more memory to work with larger models.

After logging on to your instance, the first step is to install Docker Engine. You can follow the link or download and run the install_docker.sh script using the commands below:


wget https://raw.githubusercontent.com/opea-project/GenAIExamples/refs/heads/main/ChatQnA/docker_compose/install_docker.sh 

chmod +x install_docker.sh 

./install_docker.sh

You will also need to configure Docker to run as a non-root user with these instructions.

Next, follow the sample guide for ChatQnA on Xeon. It will walk you through how to download the GenAIComps and GenAIExamples GitHub* repositories and what commands to run in your AWS instance. For simplicity, only the minimum steps will be shown below.

Clone the GenAIExamples GitHub repository for a specific release version of OPEA. This example shows 1.2.


# Set workspace 
export WORKSPACE=<path> 
cd $WORKSPACE 

# Set desired release version - number only 
export RELEASE_VERSION=<insert-release-version>  

# GenAIExamples 
git clone https://github.com/opea-project/GenAIExamples.git 
cd GenAIExamples 
git checkout tags/v${RELEASE_VERSION} 
cd ..

Set up additional environment variables such as your Hugging Face token, IP address, and proxy settings if needed.


export HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token" 
export host_ip=$(hostname -I | awk '{print $1}') 
export no_proxy=${your_no_proxy},$host_ip 
export http_proxy=${your_http_proxy} 
export https_proxy=${your_http_proxy}

Set up the ChatQnA specific environment. Navigate to the docker compose directory for Intel Xeon CPUs:


cd $WORKSPACE/GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon

In the set_env.sh script, you’ll need to set LLM_MODEL_ID to the HuggingFace model card corresponding to the DeepSeek-R1-Distill model you wish to work with. Here, we will set it to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. You can also print out LLM_MODEL_ID to confirm it is set properly.

Figure 3: ChatQnA’s set_env.sh for Intel Xeon CPUs as of OPEA release 1.2, modified to run DeepSeek-R1-Distill-Qwen-1.5B

When finished, run set_env.sh:


source ./set_env.sh

The final step is to run docker compose to start all the microservices and the ChatQnA megaservice. The YAML file compose.yaml is the default one to use and will use vLLM as the inference serving engine for the LLM microservice. If this is your first time running docker compose, it will take extra time to pull the docker images from Dockerhub.


docker compose -f compose.yaml up -d

You can use docker ps to see all the containers that are running. This will take a few minutes for everything to set up. You can determine if the setup is complete if the output of this command shows vllm-service has a “healthy” status, or you can run docker logs vllm-service and confirm that you see the message “Application startup complete.”

Figure 4: docker ps showing the docker containers starting up. The vllm-service has a health status of “starting.”

Figure 5: Confirmation that the vllm-service has completed setup.

Interacting with ChatQnA

Now you have ChatQnA up and running! It is time to test out the user interface (UI). Open a web browser and type in the public IP address of your AWS instance. You should see the UI looking something like this:

Figure 6: ChatQnA UI

Because the Qwen model was originally trained to handle math, science, coding, and general knowledge problems, you should try to stick to prompts in these areas for best and accurate results. Try asking it math problems including quadratic functions, comparing numbers, performing arithmetic, calculus, physics, chemistry...you name it!

The key distinguisher of using DeepSeek-R1-Distill models over the original open source models is the way they show the chain-of-thought process before arriving at the answer. It might also provide alternative solutions or perspectives to answer the question thoroughly, imitating how humans solve similar problems.

You can also upload a file or set of files with knowledge the model might not know, which it can retrieve answers from. Just click on the upload icon in the upper right corner of the UI to upload. It will create embeddings out of those files and store them in a vector database. Next time you ask the chatbot a question, it will use RAG to leverage the uploaded data in the final response.

When you are finished and want to stop your ChatQnA session, you can do so with this command in your current directory:


docker compose down

Key Takeaways and Final Remarks

Using OPEA microservices and examples, you can get the small, low-cost DeepSeek-R1-Distill models running in an application on AWS with Intel® Xeon® processors in just minutes. You don’t need to write new code at all. Once it’s running, OPEA enables you to build on it and customize it to suit your GenAI application’s needs. These smaller models may also be more feasible for your enterprise, as they require less memory and power, so Intel® Xeon® CPUs may be a cost-effective platform to get the job done.

After this tutorial, there are more ways to improve performance and work with OPEA to best address your requirements:

Increase the performance of ChatQnA with DeepSeek-R1-Distill models by using a larger AWS instance or multiple nodes.
Determine the smallest and lowest cost AWS instance with Intel® Xeon® CPUs that runs ChatQnA with your model of choice while meeting your requirements.
Change the model to other DeepSeek-R1-Distill models, including versions finetuned with Llama 3.1 and Qwen 2.5.
Run other GenAI Examples from OPEA for your GenAI application or to build on top of them, ranging from AgentQnA, AudioQnA, CodeGen, CodeTrans, DocSum, VideoQnA, and more.

Resources

For more information about AWS instances powered by Intel® Xeon® Scalable processors, OPEA, and DeepSeek-R1 models you can run on these hardware platforms and software libraries, check out the links below.

opea.dev: Open Platform for Enterprise AI main site
Amazon EC2 Instance Types
OPEA Projects and Source Code
OPEA Documentation
- OPEA Getting Started Guide
- ChatQnA Sample Guide for Intel® Xeon® CPUs on github.io
DeepSeek-R1
- HuggingFace Model Card for DeepSeek-R1
Resources for getting started with GenAI development

About the Author

Alex Sin, AI Software Solutions Engineer, Intel

Alex Sin is an AI software solutions engineer at Intel who consults and enables customers, partners, and developers to build generative AI applications using Intel software solutions on Intel® Xeon® Scalable Processors and Gaudi AI Accelerators. He is familiar with Intel-optimized PyTorch, DeepSpeed, and the Open Platform for Enterprise AI (OPEA) to reduce time-to-solution on popular generative AI tasks. Alex also delivers Gaudi workshops, hosts webinars, and shows demos to beginner and advanced developers alike on how to get started with generative AI and large-language models. Alex holds bachelor’s and master's degrees in electrical engineering from UCLA, where his background is in embedded systems and deep learning.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Deploy a DeepSeek-R1-Distill Chatbot in Minutes on Low-Cost AWS Xeon Instances

OPEA’s Purpose

Setting Up the OPEA ChatQnA Example on AWS

Interacting with ChatQnA

Key Takeaways and Final Remarks

Resources

About the Author

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Deploy a DeepSeek-R1-Distill Chatbot in Minutes on Low-Cost AWS Xeon Instances

OPEA’s Purpose

Setting Up the OPEA ChatQnA Example on AWS

Interacting with ChatQnA

Key Takeaways and Final Remarks

Resources

About the Author

产品和性能信息