ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite
The big AI chat apps ship heavy rendering libraries to every device. Cheddy Chat renders markdown server-side and streams finished HTML, eliminating 160-440KB of client JavaScript while keeping the main thread free.
ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite.
A guideline I follow is: if something can be done on the server rather than the client, you should probably do it on the server.
When I started Cheddy Chat, I realized that I needed excellent markdown parsing. To do this you have to handle code blocks, tables, charts, LaTeX, among other things. You can perform all these rendering gymnastics on the client, but you can also do it on the server. Unlike all the major AI chat apps, I chose the latter option.
Cheddy Chat’s front end never sees a markdown parser. The server renders everything and streams finished HTML over SSE. The client receives it and paints it to the screen. That’s the entire rendering pipeline on the frontend: receive html, render html. It’s a simple, fast, and low-complexity architectural choice.
Here is a question most frontend developers rarely think to ask: why is this running in the browser? Not “how do I make this run faster” or “which library should I use.” Why is this code running on hardware I’ve never seen, competing for resources with 47 other tabs, on a device I can’t profile or reproduce bugs on?
ChatGPT, Claude.ai, and Gemini all have heavy-weight feature-rich web apps. These apps have huge JS bundles because they do a lot of work in the browser. This is pretty typical in the web app space. The client has a large amount of code dedicated to rendering UI components, as well as any libraries needed to help in the process. Even with SSR, the page is rendered as HTML but you still ship a client runtime (hydration/resume code) and component rendering code so the UI is able to rerender components as needed.
These chat apps also have to perform a very specific type of rendering in the browser: they render markdown (the language and format of LLMs) to html.
To accomplish this, the big three (and others) need to ship parsing libraries, syntax highlighting libraries, LaTeX renderers, and HTML sanitizers to every user’s device. This is all required to render a beautiful UI in the browser from markdown received from an LLM response.
Cheddy Chat’s SSE stream connects the browser directly to the API server because the data from an LLM originates there. Rendering at the source means the client never downloads a parser, never blocks the main thread on highlighting, and never ships all the custom rendering code that works in conjunction with the markdown rendering library.
I believe the biggest AI chat apps parse markdown in the browser because it’s conventional wisdom, not a deliberate architectural choice. (I could be wrong about this, but I doubt it.) I sidestep most of it. Users get strong rendering quality while their devices (and CPUs, and batteries, and memory) do far less work.
To be clear: this isn’t SSR in the Next.js sense. There’s no hydration step, no virtual DOM diffing, no framework re-attaching event listeners to server-rendered markup. This is plain hypermedia. The server streams finished HTML over SSE. The client appends it to the DOM. That’s the whole contract. If that sounds boring, that’s the point.
This post explains why I made that choice, what it actually costs on the server side, and where it hurts.
Main-thread preservation
Two principles drive this decision.
The first is to render where the stream terminates. My SSE stream connects the browser directly to the API server. The SSR/Node server (I’m using Marko Run) isn’t in that path. Rendering on the Node server would mean proxying every streaming chunk through it, adding a network hop per token. The real choice isn’t between “server vs client.” It’s between “render where the data originates vs ship rendering libraries to the browser.”
Here’s the data path:
flowchart LR Browser -->|SSE_stream| FastAPI FastAPI -->|LLM_tokens| LLM FastAPI -->|HTML via SSE| Browser Browser -->|PWA_assets| MarkoServer
The second is to reserve the main thread for interaction. Markdown parsing, regex-heavy syntax highlighting, LaTeX equation rendering, and DOM sanitization are CPU-intensive. On the server, they run on hardware I control. On a mid-range Android phone with 30 tabs open, they compete with scrolling, typing, and touch handling. I chose to keep the main thread free for what browsers do best: painting HTML and responding to user input.
On the server, the environment is deterministic. Same runtime, same library versions, same output for every user. On the client, every variable is someone else’s: the hardware, the browser version, what else is running. The less parsing logic I put there, the fewer surprises I get back.
This is not “never use JavaScript”
I am a proud member of the Anti-JavaScript JavaScript club. I love JS, but I love good architecture more. Good architecture usually involves not sending megabytes of data down the wire to render a page. There usually isn’t a real need to do so, it’s just a bad practice dressed as orthodoxy.
Cheddy Chat’s frontend is a Marko Run PWA so there’s plenty of client-side JavaScript. But every byte earns its place. Resumability, optimistic UI updates, IndexedDB reads via Dexie.js, SSE event handling. That’s JavaScript doing what JavaScript does best.
I blend SPA and hypermedia techniques, using each where it fits. Local-first data and optimistic UI are pure SPA. Streaming pre-rendered HTML via SSE is a hypermedia pattern where the server sends ready-to-display markup and the client paints it. I’m not in either camp. I pick the pattern that fits the job.
Marko Run is a fullstack framework with SSR, so I could render chat HTML on the Marko server. But the SSE stream connects the browser directly to the FastAPI server, and that’s where LLM tokens arrive. Rendering on the Marko server would mean proxying every chunk through it just to use its rendering pipeline. Using a fullstack framework doesn’t mean every piece of HTML has to come from that framework. Sometimes the right hypermedia server is your API.
The argument isn’t anti-JS. It’s anti-waste. Markdown parsing, syntax highlighting, and sanitization produce the same output regardless of who runs them. So I run them once, on hardware I control, and stream the result.
What I moved off the client
| Library | What it does | Gzipped |
|---|---|---|
| markdown-it | Markdown parsing | 50.7KB |
| highlight.js (15 common langs) | Syntax highlighting | 26.3KB |
| highlight.js (all 190 langs) | Syntax highlighting | 304.7KB |
| KaTeX | LaTeX rendering | 75.3KB |
| DOMPurify | HTML sanitization | 8.3KB |
| Conservative total (15 langs) | 160.7KB | |
| Full total (all langs) | 439.0KB |
That’s 160-440KB of JavaScript my client never downloads, parses, or executes. KaTeX also ships ~25KB of CSS and variable-size font files on top of that.
Smaller bundles aren’t just faster to load. They’re gentler on the CPUs and batteries of your users’ devices. Every byte of JavaScript the browser parses and executes is work that drains power and generates heat. This almost never comes up in the web app discourse, but I think we should respect our users’ devices, not just their time.
But bundle size is only the download tax. The runtime cost is worse.
Runtime cost: main thread blocking
Those libraries don’t just sit there after download. Every message that streams in triggers markdown parsing, regex-heavy syntax highlighting, LaTeX rendering, and DOM sanitization, all on the main thread. While highlight.js is grinding through a 200-line code block and KaTeX is rendering an equation, the user’s scroll, type, or tap could be affected. The UI might freeze or slow down or get janky. There are ways around this but only at the cost of added complexity and even more code.
Yes, you can push parsing/highlighting into workers, use WASM-based highlighters, or do lazy highlighting on idle. But you’re still shipping the full parsing/highlighting/sanitization toolchain to every device, and you’re still spending the user’s CPU budget on work that produces the same output regardless of where it runs.
How streaming rendering works
A naive “render on the server” approach would wait for the full response to finish and only then send completed HTML. That defeats the point of streaming.
Instead, the server sends small pieces of ready-to-display HTML as the response arrives, then does a final pass at the end to make sure structures that depend on full context, like tables or nested lists, are rendered correctly.
The important part is that the browser still does almost no rendering work. It receives safe, display-ready markup and inserts it into the page. The markdown remains the source of truth for storage and sync, while the HTML is the presentation layer users see in real time.
Server-side rendering: the right tools, not the most tools
My API server is FastAPI (Python), so this is a Python-native toolchain.
Moving rendering to the server only makes sense if the server-side libraries are just as good or better for the job. In my case they are, but not always for the reasons you’d expect.
Pygments vs highlight.js/shiki. The headline stat is 500+ lexers vs ~190, but the real win isn’t coverage breadth. It’s zero client cost. Every language Pygments supports costs the user nothing: no bundle bloat, no conditional loading, no “should I ship all 190 or just the top 15?” decision. The user pastes Terraform? Highlighted. VHDL? Highlighted. The client didn’t load a single extra byte for either. When a language isn’t recognized, Pygments falls back to TextLexer (clean plaintext). highlight.js’s auto-detection mode guesses, and guesses wrong often enough to be a support issue.
nh3 vs DOMPurify. nh3 is a Python binding to the Rust ammonia crate. Native Rust code, no DOM emulation, no jsdom, no WASM. One function: nh3.clean(html, tags=..., attributes=...). DOMPurify needs a DOM; in the browser that’s free, on a Node server you need jsdom (~800KB). nh3 also auto-adds rel="noopener noreferrer" to links. With DOMPurify, you configure that yourself.
python-markdown vs markdown-it. This one’s genuinely close. markdown-it has better CommonMark compliance, and if strict compliance mattered for my use case, I’d choose it. But python-markdown’s extension system (fenced_code, tables, nl2br) is stable and hasn’t had a breaking change in years. For a rendering pipeline I want to set and forget, boring is the feature. The compliance gap doesn’t affect the markdown patterns LLMs actually produce.
latex2mathml vs KaTeX. KaTeX (a JS library) produces better visual output. But less than 2% of messages in a typical chat session contain LaTeX. KaTeX costs 75KB gzipped + CSS + fonts for that 2%. latex2mathml outputs MathML (natively supported in all major browsers since Chrome 109), costs nearly nothing server-side, and requires zero client JavaScript. The output has visible quality differences (simpler typesetting, less precise spacing), but for the occasional inline equation in a chat message, it’s an acceptable tradeoff. If I were building a math-focused tool, I’d choose differently.
Interactive charts: when you need client-side JavaScript
Static rendering covers most of what an AI chat app displays: prose, code blocks, tables, LaTeX, even Mermaid diagrams. But interactive charts with zoom, pan, tooltips, and responsive axes genuinely need JavaScript running in the browser. This is the case that tests the architecture.
Here’s how I handle it without compromising the main app.
When a user asks for a data visualization, the heavy lifting happens on the backend. Sometimes the result is a static graphic that can be streamed like any other rendered content. When interactivity is needed, the client gets only a lightweight chart payload and displays it separately from the core chat UI.
When a user opens an interactive chart, it runs in an isolated sandbox so its JavaScript stays contained and can’t interfere with the main app. The charting code is loaded only when someone actually opens a chart, then reused for later views. Users who never touch charts never pay the cost.
This is the pattern the blog post is really about. The question isn’t “server or client.” It’s “does this code earn its place on the user’s device?” Static rendering doesn’t, so I moved it to the server. Interactive chart rendering does, so it stays on the client, but isolated from the main app, loaded only when needed, and invisible to the 95% of interactions that never touch it.
Where this hurts
Now for the negative aspects of this architectural choice.
Server CPU in the hot path. Pygments highlighting and markdown rendering now run on the same server handling LLM streaming. Under load, rendering competes with request handling. I haven’t hit this limit yet, but it’s a scaling concern. Rendering is per-chunk and fast, and horizontally scaling the API tier scales rendering with it. But the reality is, moving the work from the client to the server means additional costs for me.
Tail latency. Rendering in the streaming hot path adds time between receiving an LLM token and the client seeing it. For most chunks (plain text, short code), this is sub-millisecond. For a chunk that completes a large fenced code block, the Pygments pass is measurable. I accept this because shipping that highlighting work to the client’s main thread has worse tail latency and blocks user interaction.
Large code blocks. A user pasting a 2000-line code block triggers a single expensive Pygments render. This is the worst case for per-chunk latency, and in-process CPU bursts can absolutely affect throughput under load. I haven’t needed to optimize this yet. If I do, pushing highlighting into a thread pool (or separate worker) would be my next step.
Offline rendering. This is the sharpest tension. I’m building a local-first PWA with offline history via IndexedDB. If the device is offline and has messages stored locally, what renders the markdown? I persist both the raw markdown (source of truth) and the rendered HTML (a derived cache) along with a renderer version. Offline viewing displays the pre-rendered HTML. This means I store ~2x the content per message, but it means offline users see fully rendered messages without needing any rendering libraries on the client. If I change the renderer, I can invalidate and re-render cached HTML. If a message was never rendered (rare edge case: created offline on another device, synced as markdown only), the client displays the raw markdown as plain text, functional but unstyled.
Adding new rendering features. Client-side, you npm install a new library. Server-side, you add a Python dependency and redeploy the API. This is a real operational cost. Adding Mermaid diagram support, for example, requires a server deploy rather than a frontend build.
The core tradeoff is simple: I use more of my server’s CPU so my users don’t have to use theirs.
Security
Server-side rendering with strict allowlist sanitization (via nh3) reduces XSS attack surface by centralizing the parsing/sanitization pipeline and removing client-side markdown-to-HTML conversion, historically a rich source of injection vectors. But “reduces” isn’t “eliminates.” I still need strict tag/attribute allowlists, safe link handling (rel="noopener noreferrer", javascript: URL blocking), a Content Security Policy, and careful treatment of user-generated content that bypasses the renderer. The win is that there’s one place to audit, one set of rules, running in an environment I control instead of a moving target across browser versions and extensions.
Interlude: a note on AI agents and architecture
This is the kind of architectural decision an LLM coding agent wouldn’t suggest if you were trying to one-shot an app. Ask one to build a chat app like Cheddy Chat and you’ll get React + client-side markdown parser + highlight.js, the median of its training data. Contrarian choices that differentiate a product come from understanding specific constraints in a way that contradicts conventional wisdom. More and more we let the agent write our code, but we need to own the architecture.
The result
Cheddy Chat’s heaviest page ships ~160KB of gzipped JavaScript. The total JS shipped to the client, including Senry and Fathom scripts, is ~285KB (gzipped). That page has real-time streaming chat, private mode, conversation branching, voice-to-text, mermaid diagrams, artifact creation, model switching, file uploading, dataset support, and data analysis with visualizations. It loads quickly on mobile networks and feels native, not “fast for a web app.” The rendering pipeline that makes this possible is zero lines of client-side code. The server does the work. The browser paints the result. That’s the whole trade.
If you want a chat app you can install on your phone and desktop, that loads fast on mobile, respects your device, and lets you use your own API keys with sync across devices, try Cheddy Chat. Pro subscribers get Private models and a privacy-focused option that never stores or trains on your data.