architecture overview
screenpipe's architecture handles continuous screen and audio capture, local data storage, and real-time processing. here's a breakdown of the key components:
diagram overview
- input: screen and audio data
- processing: ocr, stt, transcription, multimodal integration
- storage: sqlite database
- plugins: custom pipes
- integrations: ollama, deepgram, notion, whatsapp, etc.
this modular architecture makes screenpipe adaptable to various use cases, from personal productivity tracking to advanced business intelligence.
data abstraction layers
screenpipe organizes data in concentric layers of abstraction, from raw data to high-level intelligence:
- core (mp4 files): the innermost layer contains the raw screen recordings and audio captures in mp4 format
- processing layer: contains the direct processing outputs
- OCR embeddings: vectorized text extracted from screen
- human id: anonymized user identification
- accessibility: metadata for improved data access
- transcripts: processed audio-to-text
- AI memories: the outermost layer represents the highest level of abstraction where AI processes and synthesizes all lower-level data into meaningful insights
- pipes enrich: custom processing modules that can interact with and enhance data at any layer
this layered approach enables both granular access to raw data and sophisticated AI-powered insights while maintaining data privacy and efficiency.
status
Alpha: runs on my computer Macbook pro m3 32 GB ram
and a $400 Windows laptop, 24/7.
Uses 600 MB, 10% CPU.
- Integrations
- ollama
- openai
- Friend wearable
- Fileorganizer2000 (opens in a new tab)
- mem0
- Brilliant Frames
- Vercel AI SDK
- supermemory
- deepgram
- unstructured
- excalidraw
- Obsidian
- Apple shortcut
- multion
- iPhone
- Android
- Camera
- Keyboard
- Browser
- Pipe Store (a list of "pipes" you can build, share & easily install to get more value out of your screen & mic data without effort). It runs in Bun Typescript engine within screenpipe on your computer
- screenshots + OCR with different engines to optimise privacy, quality, or energy consumption
- tesseract
- Windows native OCR
- Apple native OCR
- unstructured.io
- screenpipe screen/audio specialised LLM
- audio + STT (works with multi input devices, like your iPhone + mac mic, many STT engines)
- Linux, MacOS, Windows input & output devices
- iPhone microphone
- remote capture (opens in a new tab) (run screenpipe on your cloud and it capture your local machine, only tested on Linux) for example when you have low compute laptop
- optimised screen & audio recording (mp4 encoding, estimating 30 gb/m with default settings)
- sqlite local db
- local api
- Cross platform CLI, desktop app (opens in a new tab) (MacOS, Windows, Linux)
- Metal, CUDA
- TS SDK
- multimodal embeddings
- cloud storage options (s3, pgsql, etc.)
- cloud computing options (deepgram for audio, unstructured for OCR)
- custom storage settings: customizable capture settings (fps, resolution)
- security
- window specific capture (e.g. can decide to only capture specific tab of cursor, chrome, obsidian, or only specific app)
- encryption
- PII removal
- fast, optimised, energy-efficient modes
- webhooks/events (for automations)
- abstractions for multiplayer usage (e.g. aggregate sales team data, company team data, partner, etc.)