the operator api allows for powerful desktop automation using accessibility roles and elements. it provides a robust way to interact with applications programmatically.
compatibility note:
some functions of the operator api currently works on macos only. on windows/linux, please use the pixel api
windows in progress
instead.
to understand roles better, open the macos accessibility inspector and examine the roles for any application.
feel free to use our docs as context in cursor agent through MCP
installation
npm install @screenpipe/browser
npm install @screenpipe/browser
pnpm add @screenpipe/browser
bun add @screenpipe/browser
yarn add @screenpipe/browser
this also works in node.js using @screenpipe/js
basic usage
import { pipe } from '@screenpipe/browser'
async function simpleAutomation() {
try {
// open an application
await pipe.operator.openApplication("Chrome")
// navigate to a website
await pipe.operator.openUrl("https://github.com/mediar-ai/screenpipe")
// find search fields by role
const searchFields = await pipe.operator
.getByRole("searchfield", {
app: "Chrome",
activateApp: true
})
.all(3)
if (searchFields.length > 0) {
// fill the search field
await pipe.operator
.getById(searchFields[0].id, { app: "Chrome" })
.fill("automation")
// click a button
const buttons = await pipe.operator
.getByRole("button", { app: "Chrome" })
.all(5)
if (buttons.length > 0) {
await pipe.operator
.getById(buttons[0].id, { app: "Chrome" })
.click()
}
// scroll page content
await pipe.operator
.getById(searchFields[0].id, { app: "Chrome" })
.scroll("down", 300)
}
} catch (error) {
console.error("automation failed:", error)
}
}
core methods
the operator api provides a set of intuitive methods for automating desktop interactions:
method | description | compatibility | example |
---|
openApplication(name) | launches an application | macOS only | pipe.operator.openApplication("Chrome") |
openUrl(url, browser?) | opens a url in a browser | macOS only | pipe.operator.openUrl("github.com") |
getByRole(role, options) | finds elements by accessibility role | macOS only | pipe.operator.getByRole("button", {app: "Chrome"}) |
getById(id, options) | gets element by id | macOS only | pipe.operator.getById("element-123", {app: "Chrome"}) |
.click() | clicks an element | macOS only | pipe.operator.getById(id).click() |
.fill(text) | enters text in a field | macOS only | pipe.operator.getById(id).fill("hello") |
.scroll(direction, amount) | scrolls an element | macOS only | pipe.operator.getById(id).scroll("down", 300) |
pixel.type(text) | types text | all platforms | pipe.operator.pixel.type("hello world") |
pixel.press(key) | presses a keyboard key | all platforms | pipe.operator.pixel.press("enter") |
pixel.moveMouse(x, y) | moves mouse cursor to position | all platforms | pipe.operator.pixel.moveMouse(100, 200) |
pixel.click(button) | clicks mouse button | all platforms | pipe.operator.pixel.click("left") |
pixel api
pixel API is a higher level API that is useful for:
- controlling your iPhone through iPhone mirroring (because you cannot parse the screen of your iPhone)
- Windows and Linux which does not support yet the functions like
openApplication
, getByRole
, click
, etc.
// type text
await pipe.operator.pixel.type("hello world")
// press key
await pipe.operator.pixel.press("enter")
// move mouse
await pipe.operator.pixel.moveMouse(100, 200)
// click
await pipe.operator.pixel.click("left") // "left" | "right" | "middle"
common accessibility roles
to understand better roles, feel free to open MacOS Accessibility Inspector and see the roles for any application.
when using getByRole()
, you’ll need to specify the accessibility role. here are common ones:
"button"
- clickable buttons
"textfield"
- text input fields
"searchfield"
- search input fields
"checkbox"
- checkbox elements
"radiobutton"
- radio button elements
"combobox"
- dropdown menus
"link"
- hyperlinks
"image"
- images
"statictext"
- text labels
"scrollarea"
- scrollable containers
advanced usage examples
async function fillContactForm() {
// open the app and navigate to form
await pipe.operator.openApplication("Chrome")
await pipe.operator.openUrl("https://example.com/contact")
// wait for page to load (simple delay)
await new Promise(resolve => setTimeout(resolve, 2000))
// find form fields
const nameField = await pipe.operator
.getByRole("textfield", {
app: "Chrome",
activateApp: true
})
.first()
const emailField = await pipe.operator
.getByRole("textfield", {
app: "Chrome"
})
.first()
const messageField = await pipe.operator
.getByRole("textfield", {
app: "Chrome"
})
.first()
// fill the form
await pipe.operator
.getById(nameField.id, { app: "Chrome" })
.fill("john doe")
await pipe.operator
.getById(emailField.id, { app: "Chrome" })
.fill("john@example.com")
await pipe.operator
.getById(messageField.id, { app: "Chrome" })
.fill("this is an automated message from screenpipe!")
// find and click submit button
const submitButton = await pipe.operator
.getByRole("button", {
app: "Chrome"
})
.first()
await pipe.operator
.getById(submitButton.id, { app: "Chrome" })
.click()
}
automating app workflows
async function processImages() {
// open photoshop
await pipe.operator.openApplication("Adobe Photoshop")
await new Promise(resolve => setTimeout(resolve, 3000)) // wait for app to launch
// open file
await pipe.operator
.getByRole("menuitem", {
app: "Adobe Photoshop"
})
.first()
.then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
await pipe.operator
.getByRole("menuitem", {
app: "Adobe Photoshop"
})
.first()
.then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
// navigate file picker (simplified)
// in practice, you'd need more complex logic to navigate the file picker
// apply filter
await pipe.operator
.getByRole("menuitem", {
app: "Adobe Photoshop"
})
.first()
.then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
// and so on...
}
ai-powered automation
for more powerful automation, combine the operator api with vercel ai sdk to enable ai-driven desktop interactions:
import { useState } from "react"
import { streamText, convertToCoreMessages } from "ai"
import { createOpenAI } from "@ai-sdk/openai"
import { pipe } from "@screenpipe/browser"
import { z } from "zod"
export function AIAutomationAgent() {
const [input, setInput] = useState("")
const [output, setOutput] = useState("")
const handleSubmit = async (e) => {
e.preventDefault()
setOutput("")
const model = createOpenAI({
apiKey: process.env.OPENAI_API_KEY
})("gpt-4o")
const result = streamText({
model,
messages: convertToCoreMessages([{
role: "user",
content: input
}]),
system: "you are a desktop automation assistant. help users by performing actions on their computer.",
tools: {
openApplication: {
description: "open an application",
parameters: z.object({
appName: z.string().describe("the name of the application to open")
}),
execute: async ({ appName }) => {
const success = await pipe.operator.openApplication(appName)
return success
? `opened ${appName} successfully`
: `failed to open ${appName}`
}
},
openUrl: {
description: "open a url in a browser",
parameters: z.object({
url: z.string().describe("the url to open"),
browser: z.string().optional().describe("the browser to use")
}),
execute: async ({ url, browser }) => {
const success = await pipe.operator.openUrl(url, browser)
return success
? `opened ${url} in ${browser || "default browser"}`
: `failed to open ${url}`
}
},
findAndClick: {
description: "find an element by role and click it",
parameters: z.object({
app: z.string().describe("the application name"),
role: z.string().describe("the accessibility role of the element")
}),
execute: async ({ app, role }) => {
try {
const elements = await pipe.operator
.getByRole(role, { app, activateApp: true })
.all(5)
if (elements.length > 0) {
await pipe.operator
.getById(elements[0].id, { app })
.click()
return `clicked ${role} element in ${app}`
}
return `no ${role} elements found in ${app}`
} catch (error) {
return `error: ${error.message}`
}
}
},
fillText: {
description: "fill text in a form field",
parameters: z.object({
app: z.string().describe("the application name"),
role: z.string().describe("the role of the element to fill"),
text: z.string().describe("the text to enter")
}),
execute: async ({ app, role, text }) => {
try {
const elements = await pipe.operator
.getByRole(role, { app, activateApp: true })
.all(5)
if (elements.length > 0) {
await pipe.operator
.getById(elements[0].id, { app })
.fill(text)
return `filled text in ${role} element in ${app}`
}
return `no ${role} elements found in ${app}`
} catch (error) {
return `error: ${error.message}`
}
}
}
},
toolCallStreaming: true,
maxSteps: 5
})
for await (const chunk of result.textStream) {
setOutput(prev => prev + chunk)
}
}
return (
<div>
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={e => setInput(e.target.value)}
placeholder="e.g., 'open chrome and go to github'"
/>
<button type="submit">run</button>
</form>
<div>{output}</div>
</div>
)
}
for a complete implementation with automatic tool selection, see the hello-world-computer-use example pipe.
troubleshooting
if you’re having issues with the operator api:
- macos permissions: ensure screenpipe has accessibility permissions in system settings > privacy & security > accessibility
- app names: use exact app names as they appear in the applications folder
- timing issues: add delays between operations, as ui elements may take time to load
- debugging: log element ids and roles to help identify the right elements
- app focus: use the
activateApp: true
option to ensure the target app is in focus
for more detailed debugging, use the macos accessibility inspector to identify exact roles and properties of ui elements.
practical use cases
here are some real-world applications for the operator api:
-
messaging automation
- scrape whatsapp conversations and export them to spreadsheets
- auto-respond to common imessage inquiries when you’re busy
- batch message linkedin connections with personalized outreach
- track response rates across different messaging platforms
-
social media management
- schedule and post content across multiple platforms
- collect engagement metrics from twitter, instagram, or linkedin
- automate following/unfollowing based on specific criteria
- export comments and replies for sentiment analysis
-
data collection and research
- extract data from websites that don’t have accessible apis
- compile information across multiple applications into a single report
- monitor prices or availability of products across different sites
- build comprehensive research databases from scattered sources
-
personal productivity
- automate repetitive daily tasks (checking emails, organizing files)
- create custom workflows between applications that don’t normally integrate
- set up intelligent reminders based on content of messages or emails
- auto-fill forms with personal or business information
-
customer relationship management
- track conversations across multiple platforms for each contact
- automatically update crm systems with new interaction data
- generate follow-up reminders based on conversation content
- build comprehensive customer profiles from scattered data sources
-
content creation and editing
- automate screenshots or recordings of specific application states
- batch process images or documents using desktop applications
- extract text from images or pdfs for further processing
- organize and tag media files based on their content
these automation ideas become even more powerful when combined with ai for intelligent decision-making based on the content being processed.
examples: