There is a moment in every dictation app where the user pauses and thinks: "Wait, should I switch modes?" They are writing an email but the app doesn't know that. It transcribes their words literally, no greeting, no sign-off, no formatting. Just a wall of text they'll have to manually edit into something that looks like an email.

I decided to kill that moment. Yakki should detect that you're composing an email and act accordingly: capture the recipients, subject, thread context, and use all of that to format your dictation properly. No mode switching. No menus. You hold the key, you speak, and the text comes out formatted like an email because the app already knows that's what you're writing.

The Two-Layer Detection Strategy

Email detection has a fundamental split: native apps and browser-based email. Apple Mail has a bundle identifier you can match. Gmail in Chrome does not. So I built two detection layers that run in sequence.

Layer 1: Bundle ID matching for native apps.

private static let emailAppBundleIds: Set<String> = [
    "com.apple.mail",
    "com.microsoft.Outlook",
    "com.readdle.smartemail-Mac",  // Spark
    "com.superhuman.electron",
    "com.freron.MailMate",
    "com.postbox-inc.postbox",
    "com.mimestream.Mimestream",
]

Fast and definitive. If the frontmost app has one of these bundle IDs, you're in an email context. No ambiguity, no false positives.

Layer 2: Window title pattern matching for browser-based email.

private static let emailWindowPatterns: [String] = [
    "Gmail", "Outlook", "Yahoo Mail",
    "Proton Mail", "ProtonMail",
    // ... compose window indicators
]

var isEmailContext: Bool {
    // Check native email apps by bundle ID
    if Self.emailAppBundleIds.contains(bundleIdentifier) {
        return true
    }

    // Check browser with email-related window title
    let category = ApplicationCategory.category(for: bundleIdentifier)
    if category == .browser, let title = windowTitle {
        return Self.emailWindowPatterns.contains { pattern in
            title.localizedCaseInsensitiveContains(pattern)
        }
    }

    return false
}

The key insight: only check window titles when the frontmost app is a browser. This prevents false positives from, say, a text editor with a file called "Gmail notes.md" open. The ApplicationCategory system classifies apps by bundle ID into categories (browser, editor, terminal, etc.), and only falls through to title matching for browsers.

This two-layer approach catches every native email client and every major browser-based email service without any user configuration.

The Timing Problem

The whole illusion breaks if the user feels a delay. Context detection has a race condition built into its requirements. Three things must happen in the same instant:

  1. Know which app was in the foreground (before Yakki's window appears)
  2. Capture the window screenshot (before the app state changes)
  3. Start recording audio (without delay)

These compete. The screenshot needs the app to be visible. The audio needs to start immediately. And step 1 has to happen before macOS shifts focus to Yakki.

The solution: capture the app reference synchronously on hotkey press, then run everything else in parallel.

func hotkeyPressed() {
    // CONTEXT AWARE: Store reference to the current frontmost app
    // BEFORE Yakki takes focus. This must happen immediately.
    if appState.contextConfig.isEnabled {
        ContextIdentificationManager.shared.storePreviousApplication()
    }

    // ... start recording, UI updates, etc.
}

The call is nonisolated for a reason:

nonisolated func storePreviousApplication() {
    let app = NSWorkspace.shared.frontmostApplication
    Task { @MainActor in
        self.previousApplication = app
    }
}

The reference is captured on the calling thread. Microseconds, not milliseconds. The Task ships it to MainActor storage asynchronously. By the time the audio engine starts, the context manager already knows which app was active.

Meanwhile, the full context capture runs concurrently with recording:

// CONTEXT AWARE: Capture context in parallel with audio capture starting
if appState.contextConfig.isEnabled {
    Task {
        let context = await ContextIdentificationManager.shared.captureContext(
            config: appState.contextConfig
        )
        if let ctx = context, ctx.isEmailContext {
            let emailContext = await ContextIdentificationManager.shared
                .captureEmailContextIfNeeded(config: appState.contextConfig)
        }
    }
}

Hold the key. Start speaking. While your voice becomes text, the context manager is capturing a screenshot, running OCR, parsing recipients. By the time you release the key, both are done.

The OCR Pipeline

Knowing you're in an email app is half the problem. The app also needs to know who you're writing to, what the subject is, and whether this is a reply or a fresh thread. That's the difference between generic formatting and formatting that reads like your assistant already saw the conversation.

Stage 1: Screenshot capture.

The challenge is finding the right window. Apps like Apple Mail open a separate compose window, and I need to capture that one, not the inbox. The focused window's CGWindowID comes through the Accessibility API:

let appElement = AXUIElementCreateApplication(app.processIdentifier)
var windowRef: AnyObject?
let windowResult = AXUIElementCopyAttributeValue(
    appElement, kAXFocusedWindowAttribute as CFString, &windowRef
)

var focusedWID: CGWindowID = 0
let axErr = _AXUIElementGetWindow(windowRef as! AXUIElement, &focusedWID)

_AXUIElementGetWindow is a private HIServices function, stable since macOS 10.5, used by every accessibility tool on the platform. It gives us the exact CGWindowID for the focused window, even when the app has multiple windows across multiple screens.

Stage 2: Vision OCR.

A screenshot is just pixels. To get structured data, the pixels have to become text:

func extractContext(from image: CGImage) async -> EmailContext {
    let (ocrText, confidence) = await performOCR(on: image)

    let recipients = extractRecipients(from: text)
    let subject = extractSubject(from: text)
    let (isReply, isForward, threadContext) = extractThreadInfo(from: text)
    let signatureStyle = detectSignatureStyle(from: text)

    return EmailContext(
        recipients: recipients,
        subject: subject,
        threadContext: threadContext,
        signatureStyle: signatureStyle,
        isReply: isReply,
        isForward: isForward,
        ocrConfidence: confidence
    )
}

Vision runs entirely on-device. No network calls, no privacy concerns. The full OCR pass typically completes in 40-80ms depending on window size.

Stage 3: Pattern matching to extract fields.

Email fields aren't just "To:" and "Subject:". They are "Para:", "À:", "An:", "Asunto:", "Objet:", "Betreff:" depending on the user's locale:

private let toPatterns = ["To:", "Para:", "À:", "An:", "A:"]
private let ccPatterns = ["Cc:", "CC:", "Copia:", "Copie:", "Kopie:"]
private let subjectPatterns = ["Subject:", "Asunto:", "Objet:", "Betreff:", "Oggetto:", "件名:"]

private let replyPatterns = [
    "On .+ wrote:",        // English
    "El .+ escribió:",     // Spanish
    "Le .+ a écrit:",      // French
    "Am .+ schrieb:",      // German
]

The extractor also classifies signature style (formal or casual) so the LLM can match the user's tone:

private let formalSignaturePatterns = [
    "Best regards", "Kind regards", "Sincerely",
    "Cordialmente", "Atentamente",
]
private let casualSignaturePatterns = [
    "Thanks", "Thank you", "Cheers",
    "Gracias", "Merci", "Danke",
]

All of this collapses into one struct:

struct EmailContext {
    var recipients: [EmailRecipient]
    var subject: String?
    var threadContext: String?
    var existingBodyText: String?
    var signatureStyle: SignatureStyle?
    var isReply: Bool
    var isForward: Bool
    var captureLatency: TimeInterval
    var ocrConfidence: Float
}

What This Requires

Building this kind of intelligence requires getting three things right.

Detection must be fast and definitive. Bundle IDs never lie. Window titles are accurate 99% of the time. Accessibility attributes give you the focused window with certainty. If you're building any kind of context-aware macOS app, start with NSWorkspace.shared.frontmostApplication and a bundle ID lookup table, basic application awareness in 20 lines of code. Accessibility APIs and OCR come after. The multilingual patterns for field extraction are 30 lines of code and the difference between a feature that works for everyone and a feature that works for English speakers.

Timing must be invisible. Capture the app reference synchronously on the hotkey press. Run everything else (screenshots, OCR, pattern matching) in a parallel Task. The entire pipeline runs in under 100ms, imperceptible to the user, especially since it runs concurrently with audio recording. The user should never feel the system thinking. The moment they do, the illusion is broken.

Privacy must be architectural. Some apps should never be observed: password managers, banking apps, terminals with sensitive output. I maintain a ContextBlocklist that short-circuits detection before any screenshot is taken. Privacy isn't a feature. It's a constraint that shapes every other decision in the system.


This is part of an ongoing series about building Yakki, a macOS dictation app. The context-aware system described here is the foundation for the email formatting mode, which uses LLM-based post-processing to turn raw dictation into properly formatted emails.