A Guide to In-Browser LLMs

3/6/2025

Ankur Singh, Code Samples Software Engineer, Intel

Ramya Ravi, AI Software Marketing Engineer, Intel

Introduction

Imagine building lightweight AI-powered applications that run entirely in a web browser on an AI PC. These apps reduce server costs, enhance privacy, and even work offline. From real-time text summarization and chatbots to image enhancement tools, voice assistants, and AI-powered note-taking apps, in-browser AI development is opening exciting new possibilities.

While full-scale models like ChatGPT still require substantial resources, advancements in hardware, browser capabilities, along with the JavaScript ecosystem we can run smaller AI models efficiently in browsers. In this article, we’ll explore JavaScript frameworks that enable in-browser LLM inference, discuss their advantages, and highlight key considerations for developers looking to harness the power of AI PCs.

The Hidden Complexity Behind In-Browser AI

Before we dive into the available frameworks, it's important to understand the underlying layers that make in-browser LLMs possible. Developers may not need to interact with these layers directly, but having a high-level understanding can help when optimizing performance, troubleshooting issues, and deciding which framework to use.

Let’s discuss each of these layers one-by-one, starting from the bottom:

1. Hardware (GPU/CPU/NPU): Efficient AI computations rely on specialized hardware, such as GPUs for general AI tasks, while NPUs (Neural Processing Units) are designed to handle specific workloads, such as AI inference and AI-powered tasks in mobile devices like AI PCs, where power efficiency and performance optimization are crucial.

2. OS Native APIs (Windows/macOS/Linux/iOS/Android): The OS manages hardware resources and provides native APIs that browsers leverage to interact with the hardware.

3. Browser APIs (WebGL/WebGPU/WebNN/WebAssembly): These technologies enable efficient hardware acceleration without requiring platform-specific code, ensuring compatibility across devices.

WebGPU: A modern web API that provides low-level access to the GPU for high-performance computations and graphics rendering. It is the successor to WebGL, with support for general-purpose GPU (GPGPU) tasks. Reduces CPU overhead by offloading expensive computations to the GPU. Check supported browsers: WebGPU Compatibility. WebGPU is often not enabled by default, so you may need to turn it on via browser-specific feature flags. Here's how to enable WebGPU on popular browsers:
- Chromium: Navigate to chrome://flags, search for "Unsafe WebGPU", and enable it.
- Firefox: Go to about:config, search for "dom.webgpu.enabled" and "gfx.webgpu.force-enabled", then set both to true.
After enabling these flags, you may need to restart your browser for the changes to take effect.
WebNN: A JavaScript API designed to enable efficient neural network inference directly in web browsers. It provides a unified abstraction for running machine learning models using underlying hardware capabilities like CPUs, GPUs, or NPUs. It enables near-native performance for AI inference in web apps. Detailed steps for enabling WebNN on Windows device with Intel® Core™ Ultra processors, can be found here.
WebAssembly: A binary instruction format that enables near-native performance for code written in C, C++, or Rust within the browser.

Most JavaScript frameworks are built on top of one or more of these browser technologies, providing easy-to-use interfaces that abstract away low-level optimizations. This allows developers to focus on building AI-powered applications rather than dealing with device or OS compatibility. Check out this link to track browser support for these technologies.

JavaScript Frameworks for In-Browser LLMs

Equipped with the knowledge of OS-Native APIs, let’s explore the JavaScript frameworks available for running LLMs in the browser. These frameworks simplify the application development by abstracting browser APIs and providing user-friendly APIs.

WebLLM

A high-performance, open-source framework designed for in-browser LLM inference offering:

WebGPU & WebAssembly Acceleration: Combines WebGPU for efficient local GPU acceleration with WebAssembly (Wasm) for high-performance CPU computation, enabling seamless AI model execution in the browser.
OpenAI-style API: Easy to integrate into existing applications.
Model Support: Works with Llama 3, Phi 3, Gemma, Mistral, Qwen, DeepSeek, etc.
Structured Generation: Supports JSON mode for structured output.
Web Worker Integration: Offloads computations to separate threads for smooth UI interactions.

Here’s a code snippet showing just how easy it is to use WebLLM:


import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";

 // Initialize with a progress callback
 const initProgressCallback = (progress) => {
     console.log("Model loading progress:", progress);
 };

// Initializing Engine
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct", { initProgressCallback });
const messages = [
    { role: "system", content: "You are a helpful AI assistant." },
    { role: "user", content: "Hello!" }
];

// Calling the Engine
const reply = await engine.chat.completions.create({
    messages,
});
console.log(reply.choices[0].message);
console.log(reply.usage);

The above code snippet demonstrates how closely WebLLM mimics the OpenAI-style API. This makes it incredibly easy to switch over and use WebLLM as a replacement for OpenAI API calls.

To get the full experience and see everything WebLLM is capable of, try out WebLLM Chat and to start building check out the docs.

Transformers.js

This is another powerful framework optimized for running transformer-based models in the browser. It is built on top of ONNX Runtime Web. It provides:

WebGPU Acceleration: Optimized for running AI models efficiently with WebGPU. It can also make use of other backends like WebNN and WASM.
ONNX Model Support: Allows execution of ONNX-formatted models, a widely used standard in machine learning.
Specialized for Transformer Models: Designed specifically for architectures underlying modern LLMs.

Here’s a small code snippet demonstrating text generation:


import { pipeline } from '@huggingface/transformers';
const generator = await pipeline('text-generation', 'Xenova/distilgpt2');
const text = 'Hello';
const output = await generator(text);

As can be seen here, this closely resembles the Hugging Face Transformers Python API—this is intentional. It’s great because it allows you to tackle a wide range of tasks across NLP, Vision, Audio, and Multimodal domains. You can explore the full list of available tasks here. To get started, check out the Transformers.js docs.

Note: You’ll need to build some custom tooling to replicate the full experience of the OpenAI JS client.

ONNX Runtime Web

ONNX Runtime Web is a versatile framework that supports multiple backends, including WebGPU, WebGL, WebNN and WebAssembly, making it suitable for running pre-trained ONNX models in-browser.

Some Other Frameworks

TensorFlow.js

TensorFlow.js is a well-established machine learning library that allows in-browser model training and inference. While it supports some BERT-based models, it’s not ideal for LLMs, as it's more ML-focused.

MediaPipe LLM Inference

MediaPipe LLM Inference is a new kid in the block. It's experimental and still under active development but something worth keeping an eye on.

Choosing the Right Framework and Browser Settings

Now that we've explored browser capabilities and various JavaScript frameworks, here are the key points to keep in mind when selecting the right tools for your next project:

Browser and Device-Specific Considerations:

Target Chromium-based Browsers: These browsers offer the best support for WebGPU and WebNN, which are essential for AI workloads.
Intel® AI PCs (Core Ultra + Windows): These PCs are powered by Intel® Core™ Ultra processors includes a CPU, a GPU and a NPU (neural processing unit) to handle AI workloads more efficiently. If you're using Intel's latest AI PC laptops, make sure to enable WebNN and WebGPU features in your browser. Frameworks like ONNX Runtime Web or Transformers.js can leverage WebNN APIs to fully utilize NPUs, offering optimal performance.

JavaScript Frameworks:

WebLLM and Transformers.js: These are the best starting points for most developers due to their ease of use and optimizations for Large Language Models (LLMs).
- WebLLM: Designed specifically for LLM inference, it uses WebGPU for acceleration but currently cannot utilize NPUs.
- Transformers.js: Best for NLP and vision models.
ONNX Runtime Web: This framework supports a wide range of ML models, providing more flexibility and customization. However, it can be verbose and requires you to implement or define many components. If you need flexibility, it's worth exploring, but be prepared for added complexity.

Here's a comparison table summarizing the key features of WebLLM, Transformers.js, and ONNX Runtime Web:

Feature	WebLLM	Transformers.js	ONNX Runtime Web
Primary Focus	Large Language Model (LLM) inference in the browser	Natural Language Processing (NLP) and vision models in the browser	Wide range of machine learning models in the browser
Ease of Use	High; designed for seamless integration with an OpenAI-style API	High; functionally equivalent to Hugging Face's Python library	Moderate; requires more setup and understanding of model components
Hardware Acceleration	Utilizes WebGPU for GPU acceleration; does not support NPUs	Utilizes WebGPU for GPU acceleration; other backends like WebNN, Wasm, and ONNX. Supports NPUs	Supports WebGPU/WebGL acceleration, WebNN and WebAssembly. Supports NPUs.

The Big Wins: Why In-Browser AI Matters

The shift towards in-browser AI is not just about technical feasibility—it represents a fundamental shift in how AI applications are developed and deployed. Some key benefits include:

Privacy: User data stays on the device, minimizing security risks.
Cost Savings: Eliminates the need for expensive cloud servers.
Quick response time: AI models respond instantly without relying on server roundtrips.
Cross-Platform Compatibility: One codebase can work seamlessly across Windows, macOS, iOS, and Android.

Modern browsers are rapidly evolving to support AI workloads natively. With technologies like WebGPU and WebNN, developers now have direct access to hardware acceleration, making in-browser LLMs a viable and powerful alternative to server-based AI solutions.

Conclusion

Running LLMs in the browser is no longer just a futuristic idea—it’s a reality, thanks to advancements in hardware and browser technologies. Whether you’re building chatbots, AI assistants, or real-time text summarization tools, leveraging frameworks like WebLLM, Transformers.js, and ONNX Runtime Web can make development seamless and efficient. As AI PCs become more widespread, browser-based AI will only become more powerful, unlocking new possibilities for developers and users alike.

By knowing the available frameworks and how they interact with browser APIs, you can create high-performance AI applications that take full advantage of client-side processing.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

The Web Developer’s Guide to In-Browser LLMs

Get the Latest on All Things CODE

Introduction

The Hidden Complexity Behind In-Browser AI

JavaScript Frameworks for In-Browser LLMs

WebLLM

Transformers.js

ONNX Runtime Web

Some Other Frameworks

TensorFlow.js

MediaPipe LLM Inference

Choosing the Right Framework and Browser Settings

The Big Wins: Why In-Browser AI Matters

Conclusion

Resources

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

The Web Developer’s Guide to In-Browser LLMs

Get the Latest on All Things CODE

Introduction

The Hidden Complexity Behind In-Browser AI

JavaScript Frameworks for In-Browser LLMs

WebLLM

Transformers.js

ONNX Runtime Web

Some Other Frameworks

TensorFlow.js

MediaPipe LLM Inference

Choosing the Right Framework and Browser Settings

The Big Wins: Why In-Browser AI Matters

Conclusion

Resources

产品和性能信息