Stop Wrestling SDKs: A Cheat Sheet for Unified LLM APIs
After spending way too much time untangling SDK quirks, I created a cheat sheet comparing three unified APIs so you don’t have to.
Modernizing marimo’s AI system wasn’t supposed to be simple, but I didn’t expect SDK quirks to eat most of my time. Every new feature meant wrestling with a different SDK, each with its own rules and response formats. Take reasoning tokens as an example: OpenAI’s SDK doesn’t always expose reasoning tokens (sometimes you only get a reasoning summary), Anthropic’s reasoning mode has constraints around temperature that can clash with coding/data science best practices (which often use 0), and Gemini’s streaming format is different enough from OpenAI/Anthropic to make integration painful. Same feature, three providers, three headaches. After spending way too many hours untangling this, I realized the obvious truth:
One model? Stick to that provider’s SDK.
Multiple models? You need a unified API.
At marimo, an AI-native python notebook (15k+★), every new feature touches multiple models and up until now we’ve been building our own SDK integrations. Lately, we’ve been exploring a unified API to simplify this so I went down the rabbit hole of what’s out there. This post is the culmination of my research thus far, a cheat sheet on the two main options right now (LiteLLM and LangChain) plus one strong up-and-comer (any-llm, which is backed by Mozilla AI.) There are other options out there, but for our use case, these three made the shortlist.
Use this cheat sheet as a quick, practical rundown of how these APIs stack up on the things that actually matter when you’re adding new features, which in marimo’s case included: 1) streaming, 2) tool calling, 3) reasoning, 4) provider coverage and 5) reliability (timeouts/retries/rate limits.) I included links where relevant.
If you find this post helpful, type your email and hit Subscribe. I’ll send the next installment straight to your inbox.
LiteLLM
What it is: An open-source SDK + deployable proxy (LLM gateway) by BerriAI that unifies many providers behind an OpenAI-compatible interface.
Primary differentiator: Breadth + ops in one package: 100+ model/provider coverage, a deployable proxy, routing/fallbacks, budgets and cost tracking.
1) Streaming
✅ Pros: SDK + Proxy support stream=True with examples and a helper to rebuild full text. Proxy is OpenAI-compatible so SSE clients work out of the box.
⚠️ Gotchas: Stream shape is normalized to OpenAI deltas so provider-specific nuances may be flattened. SSE via some proxies/CDNs may need tuning, but LiteLLM exposes logs to help debug.
2) Tool calling
✅ Pros: Supports function/tool calling, including parallel. Falls back to JSON-mode for providers without native tools (e.g., Ollama).
⚠️ Gotchas: Tool behavior depends on the provider. JSON-mode fallback loses vendor-specific semantics (e.g., tool deltas).
3) Reasoning
✅ Pros: Uses OpenAI-style inputs across providers with passthrough for vendor-specific reasoning options where supported.
⚠️ Gotchas: Advanced reasoning features aren’t always normalized. New vendor options may require provider-specific params and can lag in docs (OpenAI-first approach).
4) Provider coverage
✅ Pros: Very broad catalog (OpenAI, Azure, Anthropic, Google, Bedrock, Groq, Ollama, vLLM, etc.); “100+ LLMs” listed in the central providers page.
⚠️ Gotchas: Coverage depth varies by provider (endpoints/params); check docs before relying on niche features.
5) Reliability
✅ Pros: Built-in retries (num_retries), routing with load balancing/fallbacks and rate-limit-aware handling (Redis-assisted.) The proxy can also retry within a model group on rate limits.
⚠️ Gotchas: Some controls are beta or require env flags (e.g., multi-instance rate-limiting); expect config work in production.
Bottom line: LiteLLM is the fastest path to broad coverage with real ops features (routing, budgets, spend, logs.) Trade-off: OpenAI-centric normalization flattens vendor-specific features like tool deltas and exact stream semantics.
LangChain
What it is: A framework for building LLM apps (Python/JS) with standard model/tool/data interfaces; part of a suite with LangGraph (orchestration) and LangSmith (observability/evals).
Primary Differentiator: A full stack: LangChain (framework), LangGraph (orchestration) and LangSmith (observability/evals.) Not just a thin SDK.
1) Streaming
✅ Pros: Supported in chat models and Runnables (stream/astream); LangGraph adds run-level streaming events.
⚠️ Gotchas: Token-level streaming is provider-dependent; default adapters may return a single chunk if the backend doesn’t support it.
2) Tool calling
✅ Pros: First-class APIs + tool abstractions; supports both model-generated args and user-controlled execution.
⚠️ Gotchas: Behavior varies by model/provider; validate tool args and handle providers without native support.
3) Reasoning
✅ Pros: Passes through provider-specific kwargs (OpenAI, Anthropic, Gemini, Bedrock) to enable reasoning when models support it.
⚠️ Gotchas: No unified reasoning API. Exposure and event shapes vary by model/provider.
4) Provider coverage
✅ Pros: Broad integrations (OpenAI, Anthropic, Google/Gemini, Bedrock, Groq, Mistral, Vertex, etc.) with per-provider docs.
⚠️ Gotchas: Depth is uneven. Niche features/params may lag or live only in specific provider packages; check docs.
5) Reliability
✅ Pros: Built-in retries (RunnableRetry, .with_retry), in-memory rate limiter, and backoff support; LangGraph adds per-node retry configs.
⚠️ Gotchas: Retries/limits are opt-in and per-component; some adapters have retry issues so expect backend-specific tuning.
Bottom line: LangChain is a full ecosystem for agent workflows with broad integrations, mature tool calling and strong observability via LangSmith. If you don’t need all of that, it might be overkill. Trade-off: Heavy abstraction and less precision for vendor-specific features (reasoning, retries.)
any-llm
What it is: A Python library from Mozilla AI that unifies multiple LLM providers behind one interface.
Primary differentiator: Uses official provider SDKs under the hood and offers both a Completions API and an (experimental) OpenAI-style Responses API.
1) Streaming
✅ Pros: Completions support stream=True (sync/async) with clear docs + working public demo.
⚠️ Gotchas: Responses API streaming is experimental, inconsistent across providers, and may raise NotImplementedError
2) Tool calling
✅ Pros: Tools supported on Completions + Responses; accepts Python callables (with docstrings + type hints) or OpenAI-style dicts.
⚠️ Gotchas: Multi-tool behavior depends on the provider; any-llm won’t add features that aren’t there.
3) Reasoning
✅ Pros: Exposes reasoning_effort/ reasoning config where supported; docs + demo show “thinking” content.
⚠️ Gotchas: Reasoning traces differ by provider and model. Check the provider matrix before relying on support.
4) Provider coverage
✅ Pros: Growing list: OpenAI, Anthropic, Google, Mistral, Groq, Ollama, Azure OpenAI, etc., with a per-provider capability grid.
⚠️ Gotchas: Depth differs; some support Completions, but not Responses/Reasoning.
5) Reliability
✅ Pros: Pass timeouts + provider-specific kwargs (e.g., api_timeout, provider params) in one call.
⚠️ Gotchas: No cross-provider retries/backoff; behavior mirrors underlying SDKs (SDK-first approach). Responses API is “use with caution” (pre-GA).
Bottom line: any-llm is a thin, SDK-first unifier. Strong for quick, multi-provider Completions (with streaming + tools). Trade-off: Responses API is early/experimental and reasoning/streaming support varies by provider. Check the matrix before you depend on it.
When to choose
Roll-your-own: If you want pixel-perfect streaming + day-0 vendor features with full control. Trade-off: Highest maintenance is that you own SDK quirks as well as retries, rate limits and churn.
LiteLLM: If you want fastest breadth + ops (routing, retries, budgets, spend logs) on an OpenAI surface. Trade-off: Normalization flattens vendor-specific features (e.g., tool call semantics, reasoning traces and stream shapes.)
any-llm: If you want a lightweight, SDK-first library that stays close to vendor behavior with basics (completions, streaming, tools). Trade-off: Still early so currently has uneven coverage, no built-in ops/telemetry.
LangChain: If you want a full ecosystem for agent workflows (tools, memory, retrieval) with LangGraph/LangSmith. Trade-off: Heavy abstraction; response and reasoning formats vary across models/providers so you’ll need to handle different shapes for reasoning, structured output, message formats.
Rule of thumb:
When you need exact streaming/trace fidelity → Roll-your-own
When you want to ship a multi-provider MVP fast with routing/cost controls → LiteLLM
When you prefer lightweight, no proxy and staying close to vendor SDKs → any-llm
When you’re building agentic apps with tools, memory, evals, and tracing → LangChain
If you found this post useful, share it with a friend and consider subscribing. I will be sharing more lessons from the trenches of open‑source, Gen AI, and MCP every week.

