the operator api allows for powerful desktop automation using accessibility roles and elements. it provides a robust way to interact with applications programmatically.

compatibility note:

some functions of the operator api currently works on macos only. on windows/linux, please use the pixel api

windows in progress

instead.

to understand roles better, open the macos accessibility inspector and examine the roles for any application.

feel free to use our docs as context in cursor agent through MCP

installation

npm install @screenpipe/browser

this also works in node.js using @screenpipe/js

basic usage

import { pipe } from '@screenpipe/browser'

async function simpleAutomation() {
  try {
    // open an application
    await pipe.operator.openApplication("Chrome")
    
    // navigate to a website
    await pipe.operator.openUrl("https://github.com/mediar-ai/screenpipe")
    
    // find search fields by role
    const searchFields = await pipe.operator
      .getByRole("searchfield", {
        app: "Chrome",
        activateApp: true
      })
      .all(3)
    
    if (searchFields.length > 0) {
      // fill the search field
      await pipe.operator
        .getById(searchFields[0].id, { app: "Chrome" })
        .fill("automation")
      
      // click a button
      const buttons = await pipe.operator
        .getByRole("button", { app: "Chrome" })
        .all(5)
      
      if (buttons.length > 0) {
        await pipe.operator
          .getById(buttons[0].id, { app: "Chrome" })
          .click()
      }
      
      // scroll page content
      await pipe.operator
        .getById(searchFields[0].id, { app: "Chrome" })
        .scroll("down", 300)
    }
  } catch (error) {
    console.error("automation failed:", error)
  }
}

core methods

the operator api provides a set of intuitive methods for automating desktop interactions:

methoddescriptioncompatibilityexample
openApplication(name)launches an applicationmacOS onlypipe.operator.openApplication("Chrome")
openUrl(url, browser?)opens a url in a browsermacOS onlypipe.operator.openUrl("github.com")
getByRole(role, options)finds elements by accessibility rolemacOS onlypipe.operator.getByRole("button", {app: "Chrome"})
getById(id, options)gets element by idmacOS onlypipe.operator.getById("element-123", {app: "Chrome"})
.click()clicks an elementmacOS onlypipe.operator.getById(id).click()
.fill(text)enters text in a fieldmacOS onlypipe.operator.getById(id).fill("hello")
.scroll(direction, amount)scrolls an elementmacOS onlypipe.operator.getById(id).scroll("down", 300)
pixel.type(text)types textall platformspipe.operator.pixel.type("hello world")
pixel.press(key)presses a keyboard keyall platformspipe.operator.pixel.press("enter")
pixel.moveMouse(x, y)moves mouse cursor to positionall platformspipe.operator.pixel.moveMouse(100, 200)
pixel.click(button)clicks mouse buttonall platformspipe.operator.pixel.click("left")

pixel api

pixel API is a higher level API that is useful for:

  • controlling your iPhone through iPhone mirroring (because you cannot parse the screen of your iPhone)
  • Windows and Linux which does not support yet the functions like openApplication, getByRole, click, etc.
// type text
await pipe.operator.pixel.type("hello world")

// press key
await pipe.operator.pixel.press("enter")

// move mouse
await pipe.operator.pixel.moveMouse(100, 200)

// click
await pipe.operator.pixel.click("left") // "left" | "right" | "middle"

common accessibility roles

to understand better roles, feel free to open MacOS Accessibility Inspector and see the roles for any application.

when using getByRole(), you’ll need to specify the accessibility role. here are common ones:

  • "button" - clickable buttons
  • "textfield" - text input fields
  • "searchfield" - search input fields
  • "checkbox" - checkbox elements
  • "radiobutton" - radio button elements
  • "combobox" - dropdown menus
  • "link" - hyperlinks
  • "image" - images
  • "statictext" - text labels
  • "scrollarea" - scrollable containers

advanced usage examples

automating form filling

async function fillContactForm() {
  // open the app and navigate to form
  await pipe.operator.openApplication("Chrome")
  await pipe.operator.openUrl("https://example.com/contact")
  
  // wait for page to load (simple delay)
  await new Promise(resolve => setTimeout(resolve, 2000))
  
  // find form fields
  const nameField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome",
      activateApp: true
    })
    .first()
  
  const emailField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome"
    })
    .first()
  
  const messageField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome"
    })
    .first()
  
  // fill the form
  await pipe.operator
    .getById(nameField.id, { app: "Chrome" })
    .fill("john doe")
  
  await pipe.operator
    .getById(emailField.id, { app: "Chrome" })
    .fill("john@example.com")
  
  await pipe.operator
    .getById(messageField.id, { app: "Chrome" })
    .fill("this is an automated message from screenpipe!")
  
  // find and click submit button
  const submitButton = await pipe.operator
    .getByRole("button", { 
      app: "Chrome"
    })
    .first()
  
  await pipe.operator
    .getById(submitButton.id, { app: "Chrome" })
    .click()
}

automating app workflows

async function processImages() {
  // open photoshop
  await pipe.operator.openApplication("Adobe Photoshop")
  await new Promise(resolve => setTimeout(resolve, 3000)) // wait for app to launch
  
  // open file
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  // navigate file picker (simplified)
  // in practice, you'd need more complex logic to navigate the file picker
  
  // apply filter
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  // and so on...
}

ai-powered automation

for more powerful automation, combine the operator api with vercel ai sdk to enable ai-driven desktop interactions:

import { useState } from "react"
import { streamText, convertToCoreMessages } from "ai"
import { createOpenAI } from "@ai-sdk/openai"
import { pipe } from "@screenpipe/browser"
import { z } from "zod"

export function AIAutomationAgent() {
  const [input, setInput] = useState("")
  const [output, setOutput] = useState("")
  
  const handleSubmit = async (e) => {
    e.preventDefault()
    setOutput("")
    
    const model = createOpenAI({
      apiKey: process.env.OPENAI_API_KEY
    })("gpt-4o")
    
    const result = streamText({
      model,
      messages: convertToCoreMessages([{
        role: "user",
        content: input
      }]),
      system: "you are a desktop automation assistant. help users by performing actions on their computer.",
      tools: {
        openApplication: {
          description: "open an application",
          parameters: z.object({
            appName: z.string().describe("the name of the application to open")
          }),
          execute: async ({ appName }) => {
            const success = await pipe.operator.openApplication(appName)
            return success
              ? `opened ${appName} successfully`
              : `failed to open ${appName}`
          }
        },
        
        openUrl: {
          description: "open a url in a browser",
          parameters: z.object({
            url: z.string().describe("the url to open"),
            browser: z.string().optional().describe("the browser to use")
          }),
          execute: async ({ url, browser }) => {
            const success = await pipe.operator.openUrl(url, browser)
            return success
              ? `opened ${url} in ${browser || "default browser"}`
              : `failed to open ${url}`
          }
        },
        
        findAndClick: {
          description: "find an element by role and click it",
          parameters: z.object({
            app: z.string().describe("the application name"),
            role: z.string().describe("the accessibility role of the element")
          }),
          execute: async ({ app, role }) => {
            try {
              const elements = await pipe.operator
                .getByRole(role, { app, activateApp: true })
                .all(5)
                
              if (elements.length > 0) {
                await pipe.operator
                  .getById(elements[0].id, { app })
                  .click()
                return `clicked ${role} element in ${app}`
              }
              return `no ${role} elements found in ${app}`
            } catch (error) {
              return `error: ${error.message}`
            }
          }
        },
        
        fillText: {
          description: "fill text in a form field",
          parameters: z.object({
            app: z.string().describe("the application name"),
            role: z.string().describe("the role of the element to fill"),
            text: z.string().describe("the text to enter")
          }),
          execute: async ({ app, role, text }) => {
            try {
              const elements = await pipe.operator
                .getByRole(role, { app, activateApp: true })
                .all(5)
                
              if (elements.length > 0) {
                await pipe.operator
                  .getById(elements[0].id, { app })
                  .fill(text)
                return `filled text in ${role} element in ${app}`
              }
              return `no ${role} elements found in ${app}`
            } catch (error) {
              return `error: ${error.message}`
            }
          }
        }
      },
      toolCallStreaming: true,
      maxSteps: 5
    })
    
    for await (const chunk of result.textStream) {
      setOutput(prev => prev + chunk)
    }
  }
  
  return (
    <div>
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          placeholder="e.g., 'open chrome and go to github'"
        />
        <button type="submit">run</button>
      </form>
      <div>{output}</div>
    </div>
  )
}

for a complete implementation with automatic tool selection, see the hello-world-computer-use example pipe.

troubleshooting

if you’re having issues with the operator api:

  1. macos permissions: ensure screenpipe has accessibility permissions in system settings > privacy & security > accessibility
  2. app names: use exact app names as they appear in the applications folder
  3. timing issues: add delays between operations, as ui elements may take time to load
  4. debugging: log element ids and roles to help identify the right elements
  5. app focus: use the activateApp: true option to ensure the target app is in focus

for more detailed debugging, use the macos accessibility inspector to identify exact roles and properties of ui elements.

practical use cases

here are some real-world applications for the operator api:

  • messaging automation

    • scrape whatsapp conversations and export them to spreadsheets
    • auto-respond to common imessage inquiries when you’re busy
    • batch message linkedin connections with personalized outreach
    • track response rates across different messaging platforms
  • social media management

    • schedule and post content across multiple platforms
    • collect engagement metrics from twitter, instagram, or linkedin
    • automate following/unfollowing based on specific criteria
    • export comments and replies for sentiment analysis
  • data collection and research

    • extract data from websites that don’t have accessible apis
    • compile information across multiple applications into a single report
    • monitor prices or availability of products across different sites
    • build comprehensive research databases from scattered sources
  • personal productivity

    • automate repetitive daily tasks (checking emails, organizing files)
    • create custom workflows between applications that don’t normally integrate
    • set up intelligent reminders based on content of messages or emails
    • auto-fill forms with personal or business information
  • customer relationship management

    • track conversations across multiple platforms for each contact
    • automatically update crm systems with new interaction data
    • generate follow-up reminders based on conversation content
    • build comprehensive customer profiles from scattered data sources
  • content creation and editing

    • automate screenshots or recordings of specific application states
    • batch process images or documents using desktop applications
    • extract text from images or pdfs for further processing
    • organize and tag media files based on their content

these automation ideas become even more powerful when combined with ai for intelligent decision-making based on the content being processed.

examples:

Was this page helpful?