conceptual overview
at its core, screenpipe acts as a bridge between your digital activities and ai systems, creating a memory layer that provides context for intelligent applications. here’s how to think about it:capturing layer
- screen recording: captures visual content at configurable frame rates
- audio recording: captures spoken content from multiple sources
- ui events (accessibility): captures keyboard input, mouse clicks, app switches, clipboard events via accessibility APIs (macOS)
processing layer
- ocr engines: extract text from screen recordings (apple native, windows native, tesseract, unstructured)
- stt engines: convert audio to text (whisper, deepgram)
- speaker identification: identifies and labels different speakers
- pii removal: optionally redacts sensitive information
storage layer
- sqlite database: stores metadata, text, and references to media
- media files: stores the actual mp4/mp3 recordings
- embeddings: (coming soon) vector representations for semantic search
retrieval layer
- search api: filtered content retrieval for applications
- streaming apis: real-time access to new content
- memory apis: structured access to historical context
extension layer (pipes)
- pipes ecosystem: extensible plugins for building applications
- pipe sdk: typescript interface for building custom pipes
- pipe runtime: sandboxed execution environment for pipes
diagram overview
- input: screen and audio data
- processing: ocr, stt, transcription, multimodal integration
- storage: sqlite database
- plugins: custom pipes
- integrations: ollama, deepgram, notion, whatsapp, etc.
data flow & lifecycle
here’s the typical data flow through the screenpipe system:- capture
- screen is captured at the configured fps (default 1.0, or 0.5 on macos)
- audio is captured in chunks (default 30 seconds)
- ui events (keyboard, mouse, app switches, clipboard) are captured via accessibility APIs (macos, enable in settings)
- processing
- captured frames are processed through ocr to extract text
- audio chunks are processed through stt to generate transcriptions
- speaker identification is applied to audio transcriptions
- storage
- processed data is stored in the local sqlite database
- raw media files are stored in the configured data directory
- metadata is indexed for efficient retrieval
- retrieval
- applications query the database through the rest api
- real-time data can be streamed through sse endpoints
- pipes can access data through the typescript sdk
- extension
- pipes process the data to create higher-level abstractions
- pipes can integrate with external services (llms, etc.)
- pipes can control the system through the input api
data abstraction layers
- core (mp4 files): the innermost layer contains the raw screen recordings and audio captures in mp4 format
- processing layer: contains the direct processing outputs
- ocr embeddings: vectorized text extracted from screen
- human id: anonymized user identification
- accessibility: metadata for improved data access
- transcripts: processed audio-to-text
- ai memories: the outermost layer represents the highest level of abstraction where ai processes and synthesizes all lower-level data into meaningful insights
- pipes enrich: custom processing modules that can interact with and enhance data at any layer
session and state management
screenpipe maintains several types of state:- session state
- managed by the core screenpipe server
- controls recording status, device selection, etc.
- accessible through the health api endpoint
- configuration state
- stored in the settings database
- controls behavior of the core system
- accessible through the settings api
- pipe state
- each pipe maintains its own state
- stored in the pipe’s local storage or in screenpipe’s settings
- isolated from other pipes for security
/health) is particularly useful for checking the system’s current state and ensuring services are running correctly.
database schema
screenpipe uses a sqlite database with the following main tables:- frames: stores metadata about captured screen frames
- ocr_results: stores text extracted from frames
- audio_chunks: stores metadata about audio recordings
- transcriptions: stores text transcribed from audio
- speakers: stores identified speakers and their metadata
- ui_elements: stores ui elements captured from the screen
- settings: stores application configuration
- pipes: stores installed pipes and their configuration
integration patterns
developers typically interact with screenpipe in one of these patterns:-
retrieval pattern: query for relevant context based on the current task
-
streaming pattern: process events as they occur
-
augmentation pattern: enhance user experience with context
-
automation pattern: take actions based on context
status
alpha: runs on my computermacbook pro m3 32 gb ram and a $400 windows laptop, 24/7.
uses 600 mb, 10% cpu.
- integrations
- ollama
- openai
- friend wearable
- fileorganizer2000
- mem0
- brilliant frames
- vercel ai sdk
- supermemory
- deepgram
- unstructured
- excalidraw
- obsidian
- apple shortcut
- multion
- iphone
- android
- camera
- keyboard
- browser
- pipes (plugins you can build, share & install to extend screenpipe). runs in bun typescript engine within screenpipe on your computer
- screenshots + ocr with different engines to optimise privacy, quality, or energy consumption
- tesseract
- windows native ocr
- apple native ocr
- unstructured.io
- screenpipe screen/audio specialised llm
- audio + stt (works with multi input devices, like your iphone + mac mic, many stt engines)
- linux, macos, windows input & output devices
- iphone microphone
- remote capture (run screenpipe on your cloud and it capture your local machine, only tested on linux) for example when you have low compute laptop
- optimised screen & audio recording (mp4 encoding, estimating 30 gb/m with default settings)
- sqlite local db
- local api
- cross platform cli, desktop app (macos, windows, linux)
- metal, cuda
- ts sdk
- multimodal embeddings
- cloud storage options (s3, pgsql, etc.)
- cloud computing options (deepgram for audio, unstructured for ocr)
- custom storage settings: customizable capture settings (fps, resolution)
- security
- window specific capture (e.g. can decide to only capture specific tab of cursor, chrome, obsidian, or only specific app)
- encryption
- pii removal
- fast, optimised, energy-efficient modes
- webhooks/events (for automations)
- abstractions for multiplayer usage (e.g. aggregate sales team data, company team data, partner, etc.)