🧠 Running DeepSeek-OCR Locally with FlashAttention2 on RTX 3080 (No A100 Needed)

“You don’t need an A100 to run modern multimodal models — you just need persistence, BF16, and a bit of debugging.”

🚀 Why I Tried This

Most multimodal or OCR models assume you’re using a cloud A100 or H100 GPU.
But for developers like me running local experiments on consumer GPUs, that’s overkill — and expensive.

So I decided to run DeepSeek-OCR (a cutting-edge open-source OCR model) locally on my RTX 3080 (10 GB) using FlashAttention2 for acceleration.

It took some trial and error, but the final setup runs an 800×500 image in ~10 seconds, entirely offline.

This post covers what worked, what didn’t, and how to replicate it.

⚙️ Environment Setup

Hardware

GPU: NVIDIA GeForce RTX 3080 (10 GB VRAM)
OS: Windows 11 + WSL2 (Ubuntu 22.04)
Drivers: NVIDIA WSL-compatible drivers (CUDA 11.8 support)

Software Stack

conda create -n deepseek-ocr python=3.12 -y
conda activate deepseek-ocr

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.7.3 --no-build-isolation
pip install accelerate>=0.26.0
pip install -r requirements.txt

✅ Tip: Make sure nvidia-smi works in WSL and shows your GPU.

⚡ The Key Breakthroughs

1. FlashAttention2

Add this line when loading your model:

_attn_implementation="flash_attention_2"

FlashAttention2 replaces PyTorch’s standard attention kernel with a tiled, on-chip computation.
Result: less VRAM usage and 2–5× faster inference on consumer GPUs.

2. `device_map="cuda"`

This tells Hugging Face Transformers to load weights directly onto the GPU, instead of loading on CPU first and transferring later.

Without it, model initialization takes forever and uses CPU RAM for minutes before moving to GPU.

3. `torch_dtype=torch.bfloat16`

This fixed the dreaded dtype mismatch error:

RuntimeError: masked_scatter_: expected self and source to have same dtypes but got Half and Float

DeepSeek-OCR internally mixes FP32 (from the vision encoder) and FP16 (from the text head).
bfloat16 has the range of FP32 and the speed of FP16 — perfect for RTX 30-series GPUs.

🧩 The Working Script

Here’s the minimal working example for a 3080:

from transformers import AutoModel, AutoTokenizer
import torch, os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
torch.backends.cudnn.benchmark = True

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,      # ✅ use bfloat16
    device_map="cuda",               # ✅ load directly to GPU
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval()

p = next(model.parameters())
print("Loaded on:", p.device, "| dtype:", p.dtype)

prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = "ocr_test.png"
output_path = "./output"

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=512,
    image_size=512,
    crop_mode=False,
    save_results=True,
    test_compress=True,
)
print(res)

🧠 Debugging Notes

1. Checking GPU Availability

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# True NVIDIA GeForce RTX 3080

If you see False, reinstall PyTorch with CUDA (+cu118 build).

2. Common Errors I Hit

Error	Fix
`masked_scatter_: expected self and source to have same dtypes`	Use `torch_dtype=torch.bfloat16`
`ImportError: Accelerate not installed`	`pip install accelerate`
CPU-only performance (10 mins per image)	Add `device_map="cuda"`
FlashAttention warning	Ensure you installed `flash-attn==2.7.3` and loaded the model on GPU

🧪 Results

Setting	Runtime	VRAM	Notes
CPU (no CUDA)	~8–10 min	—	unusable
GPU FP16 (buggy)	crashed	9 GB	dtype mismatch
GPU BF16 + FlashAttn2	8–15 s	8.3 GB	✅ stable
GPU SDPA fallback	~20 s	8 GB	still OK

The output quality is indistinguishable from the A100 benchmarks.
Even large 1024×1024 documents process under 25 seconds.

🔍 Why This Matters

This setup proves you can run multimodal models like DeepSeek-OCR locally and efficiently on consumer GPUs.
That’s important for:

Offline document analysis
Privacy-sensitive OCR
Edge AI and indie devs building local ML apps

And it reinforces a bigger point:

Optimization and understanding the stack beats brute force hardware every time.

🏁 Next Steps

You can build on this in several ways:

🔄 Batch OCR loop: process entire folders in one go.
📊 Benchmark FlashAttention vs SDPA vs Eager kernels.
🌐 Wrap it in a Flask/Gradio app for quick uploads.
🧩 Integrate into a local AI agent (e.g., your research assistant).

💬 Final Thoughts

This little journey reminded me why local development still matters.
Getting this model to run wasn’t just about “making it work” — it was about understanding how these GPU kernels, data types, and frameworks all fit together.

If you’re tinkering with models like this:
👉 Try it locally first.
You’ll learn more in two hours debugging CUDA than in two weeks watching cloud logs.

🧠 Running DeepSeek-OCR Locally with FlashAttention2 on RTX 3080 (No A100 Needed)

🚀 Why I Tried This

⚙️ Environment Setup

Hardware

Software Stack

⚡ The Key Breakthroughs

1. FlashAttention2

2. `device_map="cuda"`

3. `torch_dtype=torch.bfloat16`

🧩 The Working Script

🧠 Debugging Notes

1. Checking GPU Availability

2. Common Errors I Hit

🧪 Results

🔍 Why This Matters

🏁 Next Steps

💬 Final Thoughts

Comments

More from this blog

The Tools and Plugins I Use to Build Local AI Projects (Without Spending a Fortune)

Command Palette

🚀 Why I Tried This

⚙️ Environment Setup

Hardware

Software Stack

⚡ The Key Breakthroughs

1. FlashAttention2

2. device_map="cuda"

3. torch_dtype=torch.bfloat16

🧩 The Working Script

🧠 Debugging Notes

1. Checking GPU Availability

2. Common Errors I Hit

🧪 Results

🔍 Why This Matters

🏁 Next Steps

💬 Final Thoughts

Comments

More from this blog

2. `device_map="cuda"`

3. `torch_dtype=torch.bfloat16`