Every dictation tool optimizes the wrong half. They pour years of engineering into the first transformation — turning your speech into accurate, well-punctuated, filler-free text — and then stop the instant the words show up in a box. The race is all about that first half: lower word-error rates, faster transcription, smarter formatting. And it's genuinely impressive work.
It's also beside the point. Because text in a box was never the goal.
Nobody dictates to admire a paragraph. You dictate because there's something you want to do with those words — send them, sharpen them, translate them for a client, ask a question about them, turn them into a post. The dictation tool hands you a clean block of text and then walks away, right at the moment the actual work begins. It perfected the input and ignored the point.
What you actually do after you dictate
Watch the real moment, the half-second after the words appear. You almost never just leave them there. You:
- reread the tone and decide it's too stiff, so you start rewriting;
- realize the recipient speaks Spanish, so you switch tabs to a translator;
- copy the text, alt-tab to Slack, find the right channel, paste;
- think of a follow-up question and open a chatbot to ask it;
- decide it'd make a good LinkedIn post and start reshaping it into one.
Every one of those is a separate action, and today every one is manual. The dictation tool's job ended when the text appeared. Your job — the reason you opened your mouth in the first place — started there. So the thing marketed as "type by speaking" actually saves you the typing and leaves you with all the app-switching. You replaced one keyboard with a faster keyboard.
A faster keyboard is still a keyboard
That's the framing the whole category inherited: voice as a drop-in replacement for the keys. Type with your mouth instead of your fingers. It's a reasonable starting point, and it set the terms of the competition — whoever transcribes fastest and cleanest wins.
But notice the ceiling that framing builds in. A keyboard, however fast, doesn't do anything. It puts characters in a field and waits for you to do everything else — the selecting, the sending, the switching. If your mental model of voice is "a better keyboard," then the best you can possibly build is a better way to fill a box. The box becomes the finish line. And the box was never where anyone was trying to go.
This is why a tool can have a near-perfect transcription engine and still leave you doing the same amount of clicking you did before. The accuracy went up. The number of apps you have to juggle didn't move.
The category already knows
Here's the tell: the better tools are quietly racing past their own framing.
Wispr Flow added a Command Mode — you highlight text, speak an instruction like "make this more concise" or "translate to French," and it rewrites the selection in place. It works, and it's a real step past the box. Worth noting it sits behind a paid plan and has to be switched on in settings before you can use it. (Wispr Flow Help Center.) Other tools have gone further: VoiceOS ships an "Agent mode" that connects calendar, email, and Slack so you can trigger real actions by voice, and the 2026 dictation roundups now describe the frontier as "voice input that triggers actions in other applications, not just text insertion." (Zapier's 2026 dictation guide; VoiceOS Agent-mode review.) There's even a new entrant whose entire pitch is the phrase itself — "Voice to Action OS." (Zavi on Product Hunt.)
None of that is a knock. It's the opposite — it's the market conceding the point. Everyone serious is admitting that text-in-a-box was a waypoint, not the destination. The category isn't "voice to text." It's becoming voice to action, and the tools are getting there one bolted-on feature at a time.
davr's stance: the dictation is the input, the action is the product
davr starts from the other end. Dictation is assumed — table stakes, the part everyone can do. What we build is the layer that comes after the words appear: the action.
Concretely, that means the thing you wanted to do is the thing you say:
- Transform Text — highlight any text in any app and rewrite, expand, shorten, or change its language by voice, replaced in place.
- Inline translation — speak one of around 40 languages and have the output arrive in another, so "say it in English, send it in Spanish" is a single step, not a tab switch.
- Ask Claude from a hotkey — pose a question out loud anywhere and get the answer typed in wherever your cursor already is, no chatbot tab required. [link: ask-claude-from-a-hotkey-in-any-app]
- Briefs — hand a spoken thought to davr and get back a finished post, email, or piece of content, not a transcript you still have to write up (Max).
- Veil — dictate a private message and have it encrypted and buried inside innocent-looking cover text only your contact can decode (Max).
- Tasks — speak a task and davr auto-generates a calendar event; opening it adds it to your calendar of choice.
- Vibe Coding and Voice Navigation — dictate code straight into your IDE, and drive your machine — scroll, click, switch apps — hands-free.
One throughline runs under all of it: the speech doesn't just become text, it triggers the action, and it does so in whatever app you're already standing in. That's the whole thesis in one line — connect all your words to all your actions, in any application. The point was never to fill the box faster. It was to skip the box.
(Today that's Windows. Mac, iPhone, and Android are rolling out over the next couple of months — so if you're on a Mac right now, this is a "coming soon," not a "go install it.")
Why privacy makes this matter more, not less
There's a catch in building the action layer, and it's the reason we lead with privacy everywhere else.
A voice-to-text tool only ever sees a sentence on its way to a box. A voice-to-action tool sees the whole shape of your day: the client replies, the questions you ask Claude, the message sensitive enough that you reached for Veil, the same thought rewritten in three languages, your to-do list, your code. The more of the second half a tool handles, the more of your life flows through that one pipe. Richer action layer, higher stakes.
That's exactly why davr keeps the audio question answerable. Two toggles control where your words go. Local runs Whisper on your device, so speech-to-text never reaches OpenAI; Privacy Mode drops the Claude cleanup step, so the text never reaches Anthropic. Flip both and the whole pass — voice and text — stays on your machine. And there's a bring-your-own-key mode on top, where any cloud step you do keep runs through your own provider account instead of ours. The point isn't that privacy is a nice extra bolted onto the action layer. It's that an action layer you can't trust with your words is one you'll never actually talk to candidly — which means it'll never replace the app-switching it was built to kill. Architecture, not a toggle. [link: privacy-isnt-a-setting-its-the-architecture]
Try the half nobody else finished
If you've used a dictation tool and walked away faintly underwhelmed, this is probably why: it nailed the easy half and handed the hard half back to you. The words were clean. You still did all the work.
davr is built to take the second half too — to make the action the thing you say, in the app you're already in, with the option to keep your voice on your own machine while you do it.
You can test that for nothing. davr is free when you bring your own API key — dictation on your own OpenAI or Anthropic account, no middleman. And if you'd rather try the action features without wiring up keys first, there's a 14-day trial with no credit card. Dictate something, and then — instead of reaching for the mouse — just say what you wanted to do with it.
Start free with your own key, or take the 14-day trial — no card required.