architecture overview
screenpipe's architecture handles continuous screen and audio capture, local data storage, and real-time processing. here's a breakdown of the key components:
conceptual overview
at its core, screenpipe acts as a bridge between your digital activities and AI systems, creating a memory layer that provides context for intelligent applications. here's how to think about it:
capturing layer
- screen recording: captures visual content at configurable frame rates
- audio recording: captures spoken content from multiple sources
- ui monitoring: (experimental) captures accessibility metadata about UI elements
processing layer
- ocr engines: extract text from screen recordings (apple native, windows native, tesseract, unstructured)
- stt engines: convert audio to text (whisper, deepgram)
- speaker identification: identifies and labels different speakers
- pii removal: optionally redacts sensitive information
storage layer
- sqlite database: stores metadata, text, and references to media
- media files: stores the actual mp4/mp3 recordings
- embeddings: (coming soon) vector representations for semantic search
retrieval layer
- search api: filtered content retrieval for applications
- streaming apis: real-time access to new content
- memory apis: structured access to historical context
extension layer (pipes)
- pipes ecosystem: extensible plugins for building applications
- pipe sdk: typescript interface for building custom pipes
- pipe runtime: sandboxed execution environment for pipes
diagram overview
- input: screen and audio data
- processing: ocr, stt, transcription, multimodal integration
- storage: sqlite database
- plugins: custom pipes
- integrations: ollama, deepgram, notion, whatsapp, etc.
this modular architecture makes screenpipe adaptable to various use cases, from personal productivity tracking to advanced business intelligence.
data flow & lifecycle
here's the typical data flow through the screenpipe system:
-
capture
- screen is captured at the configured fps (default 1.0, or 0.5 on macos)
- audio is captured in chunks (default 30 seconds)
- ui events are optionally captured (macos only currently)
-
processing
- captured frames are processed through OCR to extract text
- audio chunks are processed through STT to generate transcriptions
- speaker identification is applied to audio transcriptions
-
storage
- processed data is stored in the local sqlite database
- raw media files are stored in the configured data directory
- metadata is indexed for efficient retrieval
-
retrieval
- applications query the database through the REST API
- real-time data can be streamed through SSE endpoints
- pipes can access data through the typescript SDK
-
extension
- pipes process the data to create higher-level abstractions
- pipes can integrate with external services (LLMs, etc.)
- pipes can control the system through the input API
data abstraction layers
screenpipe organizes data in concentric layers of abstraction, from raw data to high-level intelligence:
- core (mp4 files): the innermost layer contains the raw screen recordings and audio captures in mp4 format
- processing layer: contains the direct processing outputs
- OCR embeddings: vectorized text extracted from screen
- human id: anonymized user identification
- accessibility: metadata for improved data access
- transcripts: processed audio-to-text
- AI memories: the outermost layer represents the highest level of abstraction where AI processes and synthesizes all lower-level data into meaningful insights
- pipes enrich: custom processing modules that can interact with and enhance data at any layer
this layered approach enables both granular access to raw data and sophisticated AI-powered insights while maintaining data privacy and efficiency.
session and state management
screenpipe maintains several types of state:
-
session state
- managed by the core screenpipe server
- controls recording status, device selection, etc.
- accessible through the health API endpoint
-
configuration state
- stored in the settings database
- controls behavior of the core system
- accessible through the settings API
-
pipe state
- each pipe maintains its own state
- stored in the pipe's local storage or in screenpipe's settings
- isolated from other pipes for security
understanding the different state models is important for building robust applications. The health API (/health
) is particularly useful for checking the system's current state and ensuring services are running correctly.
database schema
screenpipe uses a SQLite database with the following main tables:
- frames: stores metadata about captured screen frames
- ocr_results: stores text extracted from frames
- audio_chunks: stores metadata about audio recordings
- transcriptions: stores text transcribed from audio
- speakers: stores identified speakers and their metadata
- ui_elements: stores UI elements captured from the screen
- settings: stores application configuration
- pipes: stores installed pipes and their configuration
detailed schema information is available by querying the database directly:
sqlite3 ~/.screenpipe/db.sqlite .schema
integration patterns
developers typically interact with screenpipe in one of these patterns:
-
retrieval pattern: query for relevant context based on the current task
const context = await pipe.queryScreenpipe({ q: "meeting notes", contentType: "all", limit: 10 })
-
streaming pattern: process events as they occur
for await (const event of pipe.streamVision()) { // Process each new screen event }
-
augmentation pattern: enhance user experience with context
// When user asks about a recent meeting const meetingContext = await pipe.queryScreenpipe({ q: "meeting", contentType: "audio" }) // Use context to generate response const response = await generateResponse(userQuery, meetingContext)
-
automation pattern: take actions based on context
// Monitor for specific content for await (const event of pipe.streamVision()) { if (event.data.text.includes("meeting starting")) { // Take action like sending notification } }
understanding these patterns will help you design effective applications that leverage screenpipe's capabilities.
status
Alpha: runs on my computer Macbook pro m3 32 GB ram
and a $400 Windows laptop, 24/7.
Uses 600 MB, 10% CPU.
- Integrations
- ollama
- openai
- Friend wearable
- Fileorganizer2000 (opens in a new tab)
- mem0
- Brilliant Frames
- Vercel AI SDK
- supermemory
- deepgram
- unstructured
- excalidraw
- Obsidian
- Apple shortcut
- multion
- iPhone
- Android
- Camera
- Keyboard
- Browser
- Pipe Store (a list of "pipes" you can build, share & easily install to get more value out of your screen & mic data without effort). It runs in Bun Typescript engine within screenpipe on your computer
- screenshots + OCR with different engines to optimise privacy, quality, or energy consumption
- tesseract
- Windows native OCR
- Apple native OCR
- unstructured.io
- screenpipe screen/audio specialised LLM
- audio + STT (works with multi input devices, like your iPhone + mac mic, many STT engines)
- Linux, MacOS, Windows input & output devices
- iPhone microphone
- remote capture (opens in a new tab) (run screenpipe on your cloud and it capture your local machine, only tested on Linux) for example when you have low compute laptop
- optimised screen & audio recording (mp4 encoding, estimating 30 gb/m with default settings)
- sqlite local db
- local api
- Cross platform CLI, desktop app (opens in a new tab) (MacOS, Windows, Linux)
- Metal, CUDA
- TS SDK
- multimodal embeddings
- cloud storage options (s3, pgsql, etc.)
- cloud computing options (deepgram for audio, unstructured for OCR)
- custom storage settings: customizable capture settings (fps, resolution)
- security
- window specific capture (e.g. can decide to only capture specific tab of cursor, chrome, obsidian, or only specific app)
- encryption
- PII removal
- fast, optimised, energy-efficient modes
- webhooks/events (for automations)
- abstractions for multiplayer usage (e.g. aggregate sales team data, company team data, partner, etc.)
LLM links
paste these links into your Cursor chat for context:
- https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-server/src/server.rs (opens in a new tab)
- https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-server/src/core.rs (opens in a new tab)
- https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/src/core.rs (opens in a new tab)
- https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-vision/src/core.rs (opens in a new tab)