← Back to Blog

SpeechDown: Building a speech-to-markdown app with three coding agents

I set out to build a proof of concept app using the web speech recognition API, and decided to test three different LLM coding agents head-to-head on the same task. What I found surprised me.

TLDR; Cursor CLI delivered a production-ready app immediately without any follow-up messages. Claude needed one fix but recovered well. Gemini struggled with basic setup and required manual intervention. The code quality differences matched the development experience, and I’m surprised Cursor’s Composer-1 isn’t included in more coding leaderboards.

The goal: dictate text into a markdown editor

This project started with me wanting to know more about current web speech recognition APIs. I thought it would be cool to be able to dictate text directly into a markdown document. Therefore I tested some coding agents to see how well they could build a first proof of concept using my favorite frontend technology: SvelteKit.

The prompt was straightforward: create a SvelteKit app that could write markdown in a text area using voice input via the web speech API. The agents I had easily available were Cursor CLI (using Composer-1), Claude, and Gemini CLI.

Note: My Gemini CLI had preview features disabled and used Gemini 2.x models. By the time I figured this out, I had burned through my quota and haven’t been able to redo the task with the newer 3.x models yet. This may have significantly impacted Gemini’s performance in this comparison.

1. Cursor CLI

Cursor screenshotView Cursor result

Cursor generated a proof of concept app very quickly. After the initial prompt it actually worked immediately, no fixups or extra input from me needed. The UI also looked good to me, it’s pretty clean and it shows real-time transcription as you dictate. One minor flaw is that it renders the markdown but with the selected daisyUI theme, a <h1> tag is not styled so you can’t actually see the headings in the preview pane.

2. Claude

Claude screenshotView Claude result

Claude was a bit slower to get going. After it was done generating its first version, there were some issues with the tailwind CSS integration, and I got a working, but completely unstyled app. I asked it to fix this and it managed to correct its own error. The end result worked well and the UI was also clean.

3. Gemini CLI

Gemini screenshotView Gemini result

Gemini was a bit of a mess compared to the other two. When it said it was done, I just got a bunch of compile errors from Vite. It took more back-and-forths to get through those, (one even with Gemini claiming I'm sure it'll work now., followed by I've made repeated errors, and I'm sorry. when I pointed out that there were still compile errors). When the dev server was finally up and running, I only saw the default Svelte (not SvelteKit) demo app. I asked it to fix it but it couldn’t. I had to manually instruct it to download the SvelteKit starter template and copy over the files that were needed.

It worked then but was very basic. It was also the only one that doesn’t show the transcription in real-time.

Technical Comparison by Claude

Once I had three working apps, I decided to use AI to compare the technical aspects of the codebases as well (I renamed them implementation 1, 2 and 3, just to make sure the LLM couldn’t easily slip some model preference in its judgement). I asked Claude to generate a report, including summarizing each codebase in one line. This is what it came up with:

  • Cursor CLI → “Senior architect’s production-ready blueprint” — Built like it’s going straight to production with enterprise-grade patterns.
  • Claude → “Mid-level developer’s solid MVP” — Ships features that work well without gold-plating.
  • Gemini → “Weekend hacker’s ‘works on my machine’ special” — Built fast, works now, refactor later.

I think Claude judged the code pretty well and it perfectly matches my development experience. The full report is available in the GitHub repo.

Why this matters

Cursor CLI outperformed Claude and Gemini by a big margin in this test. This was surprising to me, I had expected them to be closer and for Claude to come out on top based on my recent experience with it. With this result it really surprises me that I rarely see Cursor’s Composer-1 model included in leaderboards like SWE-bench.

This also makes me think: is there something specific to my particular task? I have a number of thoughts as a result of this project:

  • Svelte is not the most popular framework and models probably haven’t been trained on it as much as for instance on React. Does this mean that the future of coding agents will put strong pressure towards the most popular frameworks because those just work better?
  • Or is the reason Svelte 5 is still pretty new and the distinction between pure Svelte and SvelteKit is a bit tricky? It seems both Claude and Gemini struggled with the setup of the project and at first tried to start from the pure svelte starter.
  • How good is the ‘breadth’ of the existing agentic coding benchmarks? Now that I looked it up, I see that SWE-bench is focused on Python code. Additionally, my use case of building a completely new web app from scratch may be quite different from testing whether an agent can fix a bug in a (potentially huge) codebase.
Check out the code on GitHub

The results and prompt are availabe at

https://github.com/apptornado/speechdown

Hacker News Discuss on Hacker News