Back to articles
vllmollamainferencellmperformancegpuproduction

How We Achieved 10x LLM Throughput by Migrating from Ollama to vLLM

Mohammed M. Ahmed··12 min read

If you are running local large language models in production and still using Ollama as your inference backend, this article might save you weeks of optimization work. We recently migrated a high-volume text processing pipeline from Ollama to vLLM and saw throughput jump from approximately 67 operations per minute to over 675 — a verified 10x improvement with no degradation in output quality.

Here is exactly what we changed, why it worked, and how you can replicate it.

Hardware Environment

All benchmarks were run on a single workstation — no multi-GPU, no cloud instances:

ComponentSpecification
GPUNVIDIA RTX 3090 (24 GB VRAM)
CPUDual Intel Xeon Gold 6148 (40 cores / 80 threads)
RAM256 GB DDR4 ECC
PlatformDell Precision 7820 Tower
InferenceSingle-GPU, local serving

This is a capable but not exotic setup. Many ML practitioners and small teams have access to equivalent hardware. The results here should be broadly reproducible on any 24 GB VRAM GPU with a modern multi-core CPU.

One telling metric: under Ollama, GPU utilization hovered around 35-45% even with concurrent workers — the scheduling overhead left the GPU idle between requests. Under vLLM with continuous batching, utilization climbed to a sustained 85-95%. The hardware did not change. The software did.

The Problem: Ollama Hits a Ceiling

Ollama is an excellent tool. It lowers the barrier to running open-weight models locally, and for prototyping or low-volume inference, it is hard to beat. But when your workload scales to tens of thousands of structured extraction and translation tasks against dense, domain-specific text, Ollama's architecture becomes the bottleneck.

Our pipeline processes large volumes of multilingual scholarly text — source material that requires structured field extraction, classification, and translation. Each document needs careful parsing: identifying names, relationships, categorical judgments, dates, and biographical metadata. At production scale, we were looking at processing 60,000 to 80,000 entries across multiple corpora.

With Ollama, we hit three fundamental constraints:

  1. Request-level scheduling. Ollama processes one request at a time per model instance. Even with async clients and concurrent workers, the GPU sits partially idle between requests as it waits for each completion before loading the next.

  2. No native batching. Each inference call is isolated. If you need to process 20 similar items, you make 20 separate API calls — each with its own prompt loading, tokenization, and generation overhead.

  3. Oversized models for the task. Ollama makes it easy to pull large models (we were running a 32B parameter model), but for structured extraction tasks, a well-quantized smaller model often matches quality while running significantly faster.

At roughly 67 processed items per minute, our full pipeline would have taken days to complete. For a dataset that needs periodic re-extraction as source material is corrected or expanded, that turnaround time is unacceptable.

Defining "operation": Each operation corresponds to one structured extraction unit — a single document parsed into typed fields. The average workload per unit is approximately 900 input tokens and 300 output tokens of structured JSON. All throughput figures in this article refer to completed extraction units per minute, not raw token counts.

The Solution: vLLM with Three Key Changes

The migration to vLLM was not simply swapping one inference server for another. The 10x improvement came from three architectural changes that vLLM enabled.

1. Continuous Batching at the Token Level

This is vLLM's defining advantage. Unlike Ollama's request-level scheduling — where each request occupies the GPU exclusively until completion — vLLM implements continuous batching. It schedules inference at the token level, interleaving generation across multiple concurrent requests.

The practical effect: when one request is between tokens (waiting for sampling, decoding, or context switching), vLLM is already generating tokens for other requests. GPU utilization goes from sporadic bursts to sustained throughput.

Under the hood, this is powered by vLLM's PagedAttention mechanism — an efficient KV cache management system that eliminates the memory fragmentation problem that plagues naive concurrent inference. Traditional inference servers allocate contiguous memory blocks for each request's key-value cache, which leads to fragmentation and wasted VRAM as requests of different lengths compete for space. PagedAttention instead manages KV cache in non-contiguous memory pages (analogous to virtual memory in operating systems), allowing near-optimal memory utilization even under highly variable concurrent loads.

We configured 10 concurrent async workers hitting the vLLM server simultaneously. Instead of queueing behind each other, all 10 requests were being served in overlapping fashion. The GPU stayed consistently saturated.

2. Batch Prompting — 20 Items Per Call

vLLM's OpenAI-compatible API let us restructure our prompts to include multiple items in a single inference call. Instead of sending one document per request, we packed 20 items into a single numbered prompt and asked the model to return structured JSON with matched identifiers.

This is not a vLLM-specific feature — you could theoretically do this with any inference backend — but vLLM's stability and throughput under high-concurrency loads made it practical. With Ollama, large multi-item prompts under concurrent load frequently timed out or produced malformed responses.

The batch prompting alone reduced our API call volume by 20x, and combined with continuous batching, the effective throughput multiplied dramatically.

3. Right-Sized Quantized Model

We moved from a 32B parameter Q4-quantized model to a 14B parameter AWQ-quantized model (specifically, an AWQ variant of Qwen 2.5 Instruct). This was the least intuitive change, but the results speak for themselves.

For structured extraction tasks — where the model is parsing known field types from semi-structured text, not generating creative prose — a well-quantized 14B model matches the output quality of a 32B model. We validated this by running both models against the same test corpus and comparing extraction accuracy. The differences were negligible.

The smaller model footprint meant:

  • Faster time-to-first-token
  • Lower VRAM usage, leaving headroom for larger batch sizes
  • Higher tokens-per-second throughput
  • The ability to serve more concurrent requests without OOM errors

The Numbers

MetricOllama (Before)vLLM (After)
Model32B Q4-quantized (Ollama default)14B AWQ quantized
API StyleNative Ollama APIOpenAI-compatible
Items per call120
Concurrent workers610
Throughput~67/min~675/min
Improvement~10x

These are real production numbers from a live data pipeline, not synthetic benchmarks. The workload involves multilingual text extraction with structured JSON output, which is arguably more demanding than typical English-language tasks due to the complexity of the source language's morphology and the specialized vocabulary involved.

Why 10x? The Multiplicative Effect

The 10x improvement was not a single optimization — it was three independent multipliers compounding:

Effective throughput = Baseline × (Continuous batching factor)
                                × (Batch prompt factor)
                                × (Model efficiency factor)

                    ≈ 67/min  × ~2.5x  × ~3x  × ~1.3x
                    ≈ 650–700/min
  • Continuous batching (~2.5x): Saturating the GPU via token-level interleaving instead of request-level queueing, plus the memory efficiency of PagedAttention allowing more concurrent sequences.
  • Batch prompting (~3x): Packing 20 items per call amortizes prompt overhead and reduces per-item latency. The factor is less than 20x because generation time still scales with output length.
  • Model efficiency (~1.3x): The AWQ-quantized 14B model generates tokens faster than the 32B Q4-quantized model while producing equivalent structured output.

Each factor alone is modest. Together, they multiply to 10x. This is not magic — it is multiplicative architecture.

A note on latency vs. throughput: Under heavy batching, latency per individual request may increase slightly — a single item processed in isolation would be marginally faster on a quiet server. But total work completed per hour increases dramatically, which is the metric that matters for batch data pipelines. If your workload is latency-sensitive (real-time chat, interactive UIs), tune your concurrency and batch sizes accordingly.

Throughput at a Glance

Ollama   ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   ~67/min
vLLM     █████████████████████████████████████████░  ~675/min
         ─────────────────────────────────────────
         0       100     200     300     400     500     600     700
                           Operations per minute

Implementation Details

The API Migration Was Nearly Free

vLLM serves an OpenAI-compatible API out of the box. Our client code migrated from raw aiohttp calls against Ollama's /api/generate endpoint to the official AsyncOpenAI Python client pointing at localhost:8000/v1. The client code became cleaner and more maintainable in the process.

# Before: Ollama native API
async with aiohttp.ClientSession() as session:
    response = await session.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt}
    )

# After: vLLM via OpenAI-compatible API
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = await client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}]
)

This also means that if we ever need to fall back to a cloud-hosted OpenAI model for quality-critical paths, the client code requires zero changes — just swap the base URL and API key.

vLLM Server Setup

Starting the vLLM server is straightforward:

vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
    --quantization awq \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

The key parameters:

  • --quantization awq: Enables AWQ kernel optimizations for the quantized model weights.
  • --max-model-len: Set based on your longest expected prompt + completion. Larger values consume more memory.
  • --gpu-memory-utilization: How much VRAM vLLM is allowed to use. We set 0.90 to leave some headroom.

Preserving the Fallback Path

We kept the Ollama configuration and extractor code in place. The system supports pluggable inference backends — Ollama for quick local prototyping, vLLM for production throughput, and OpenAI API for cloud-based quality checks. The backend is selected via environment configuration, not code changes.

Lessons Learned

1. Your Inference Backend Matters More Than Your Model Size

The industry obsesses over model size and benchmark scores. But in production systems, inference architecture and scheduling often matter more than incremental model quality improvements. A 14B model on the right infrastructure will outperform a 70B model on the wrong infrastructure — not in benchmark accuracy, but in the only metric that matters at scale: work completed per hour.

We spent weeks tuning prompts and experimenting with larger models when the real bottleneck was the inference server's scheduling architecture. A 14B model served by vLLM dramatically outperformed a 32B model served by Ollama — not because the model was better, but because the server was better at keeping the GPU busy.

2. Batch Prompting Is Underutilized

Most LLM application code sends one item per request because that is how chat interfaces work. But for data processing pipelines, packing multiple items into a single prompt — with clear numbering and structured output expectations — is a straightforward multiplier. It reduces network overhead, amortizes prompt loading costs, and plays well with continuous batching.

3. AWQ Quantization Is Production-Ready for Structured Tasks

There is still hesitation in the industry around quantized models. For open-ended generation, that caution may be warranted. But for structured extraction — where the model is selecting from a constrained output space and the task is well-defined — AWQ quantization delivers negligible quality loss with significant performance gains. Test it against your specific workload and let the data decide.

4. Ollama and vLLM Serve Different Needs

This is not an argument against Ollama. Ollama excels at what it was designed for: making local LLM usage accessible. If you are running a personal assistant, a chatbot prototype, or occasional ad-hoc queries, Ollama is the right tool. But if you are building a production data pipeline that needs to process tens of thousands of items with predictable throughput, vLLM is purpose-built for that workload.

It is worth noting that vLLM introduces more operational complexity than Ollama — model compatibility constraints, CLI configuration, quantization format requirements, and less forgiving error messages. Teams should weigh that added complexity against their throughput needs. For many use cases, Ollama's simplicity is the correct trade-off.

When Should You Make This Move?

Consider migrating from Ollama to vLLM if:

  • Your pipeline processes more than 1,000 items per run. Below that threshold, Ollama's simplicity may outweigh vLLM's throughput advantage.
  • You need predictable throughput for SLA-bound workloads. vLLM's continuous batching provides much more consistent performance under load.
  • You are running concurrent async workers. This is where vLLM's architecture shines. If your workload is single-threaded and sequential, the improvement will be more modest.
  • You are using structured output (JSON mode, function calling). vLLM's handling of constrained generation is mature and well-tested.
  • You have GPU headroom and want to maximize utilization. If your GPU is only hitting 30-40% utilization under Ollama, vLLM will put it to work.

Conclusion

The move from Ollama to vLLM was one of the highest-impact optimizations we have made in our ML infrastructure. A 10x throughput improvement — achieved through continuous batching, batch prompting, and right-sized model selection — turned a multi-day pipeline into one that completes in hours.

The migration itself was straightforward. vLLM's OpenAI-compatible API meant minimal client code changes. The quantized model delivered equivalent quality for our structured extraction tasks. And the architectural flexibility to swap backends via configuration means we are not locked into any single inference provider.

If you are building production LLM pipelines on local hardware, give vLLM serious consideration. The throughput gains are real, reproducible, and well worth the modest setup effort.


Mohammed M. Ahmed is the CEO of Green Olive Tech. He builds production AI systems for large-scale multilingual text processing and data extraction.

Have questions about local LLM inference optimization? Connect with me on LinkedIn.

What did you think of this article?