Skip to main content

Command Palette

Search for a command to run...

🧠 Running DeepSeek-OCR Locally with FlashAttention2 on RTX 3080 (No A100 Needed)

Published
4 min read

“You don’t need an A100 to run modern multimodal models — you just need persistence, BF16, and a bit of debugging.”


🚀 Why I Tried This

Most multimodal or OCR models assume you’re using a cloud A100 or H100 GPU.
But for developers like me running local experiments on consumer GPUs, that’s overkill — and expensive.

So I decided to run DeepSeek-OCR (a cutting-edge open-source OCR model) locally on my RTX 3080 (10 GB) using FlashAttention2 for acceleration.

It took some trial and error, but the final setup runs an 800×500 image in ~10 seconds, entirely offline.

This post covers what worked, what didn’t, and how to replicate it.


⚙️ Environment Setup

Hardware

  • GPU: NVIDIA GeForce RTX 3080 (10 GB VRAM)

  • OS: Windows 11 + WSL2 (Ubuntu 22.04)

  • Drivers: NVIDIA WSL-compatible drivers (CUDA 11.8 support)

Software Stack

conda create -n deepseek-ocr python=3.12 -y
conda activate deepseek-ocr

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.7.3 --no-build-isolation
pip install accelerate>=0.26.0
pip install -r requirements.txt

Tip: Make sure nvidia-smi works in WSL and shows your GPU.


⚡ The Key Breakthroughs

1. FlashAttention2

Add this line when loading your model:

_attn_implementation="flash_attention_2"

FlashAttention2 replaces PyTorch’s standard attention kernel with a tiled, on-chip computation.
Result: less VRAM usage and 2–5× faster inference on consumer GPUs.

2. device_map="cuda"

This tells Hugging Face Transformers to load weights directly onto the GPU, instead of loading on CPU first and transferring later.

Without it, model initialization takes forever and uses CPU RAM for minutes before moving to GPU.

3. torch_dtype=torch.bfloat16

This fixed the dreaded dtype mismatch error:

RuntimeError: masked_scatter_: expected self and source to have same dtypes but got Half and Float

DeepSeek-OCR internally mixes FP32 (from the vision encoder) and FP16 (from the text head).
bfloat16 has the range of FP32 and the speed of FP16 — perfect for RTX 30-series GPUs.


🧩 The Working Script

Here’s the minimal working example for a 3080:

from transformers import AutoModel, AutoTokenizer
import torch, os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
torch.backends.cudnn.benchmark = True

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,      # ✅ use bfloat16
    device_map="cuda",               # ✅ load directly to GPU
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval()

p = next(model.parameters())
print("Loaded on:", p.device, "| dtype:", p.dtype)

prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = "ocr_test.png"
output_path = "./output"

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=512,
    image_size=512,
    crop_mode=False,
    save_results=True,
    test_compress=True,
)
print(res)

🧠 Debugging Notes

1. Checking GPU Availability

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# True NVIDIA GeForce RTX 3080

If you see False, reinstall PyTorch with CUDA (+cu118 build).


2. Common Errors I Hit

ErrorFix
masked_scatter_: expected self and source to have same dtypesUse torch_dtype=torch.bfloat16
ImportError: Accelerate not installedpip install accelerate
CPU-only performance (10 mins per image)Add device_map="cuda"
FlashAttention warningEnsure you installed flash-attn==2.7.3 and loaded the model on GPU

🧪 Results

SettingRuntimeVRAMNotes
CPU (no CUDA)~8–10 minunusable
GPU FP16 (buggy)crashed9 GBdtype mismatch
GPU BF16 + FlashAttn28–15 s8.3 GB✅ stable
GPU SDPA fallback~20 s8 GBstill OK

The output quality is indistinguishable from the A100 benchmarks.
Even large 1024×1024 documents process under 25 seconds.


🔍 Why This Matters

This setup proves you can run multimodal models like DeepSeek-OCR locally and efficiently on consumer GPUs.
That’s important for:

  • Offline document analysis

  • Privacy-sensitive OCR

  • Edge AI and indie devs building local ML apps

And it reinforces a bigger point:

Optimization and understanding the stack beats brute force hardware every time.


🏁 Next Steps

You can build on this in several ways:

  • 🔄 Batch OCR loop: process entire folders in one go.

  • 📊 Benchmark FlashAttention vs SDPA vs Eager kernels.

  • 🌐 Wrap it in a Flask/Gradio app for quick uploads.

  • 🧩 Integrate into a local AI agent (e.g., your research assistant).


💬 Final Thoughts

This little journey reminded me why local development still matters.
Getting this model to run wasn’t just about “making it work” — it was about understanding how these GPU kernels, data types, and frameworks all fit together.

If you’re tinkering with models like this:
👉 Try it locally first.
You’ll learn more in two hours debugging CUDA than in two weeks watching cloud logs.