Chroma Sync
Serverless data ingestion for Chroma Cloud. Connect your S3 buckets, GitHub repositories, and websites. Chroma handles parsing, chunking, and embedding so you can start searching in minutes.
Built for scale and reliability
Whether you're syncing a handful of files or millions of documents, Sync runs the same pipeline: a queue-based system with retries, rate-limit awareness, and automatic error recovery.
Designed to maximize throughput without dropping work. Monitor every invocation in the dashboard or through the Sync API.
+--------------------------------------+
| YOUR DATA SOURCES |
| |
| +--------+ +--------+ +--------+ |
| | S3 | | GitHub | | Web | |
| | bucket | | repos | | pages | |
| +---+----+ +---+----+ +---+----+ |
| | | | |
+------+----------+----------+---------+
| | |
v v v
+======================================+
|| CHROMA SYNC ||
|| ||
|| PARSE --> CHUNK --> EMBED ||
|| ||
|| * Retries & error recovery ||
|| * Rate-limit awareness ||
|| * Maximum throughput ||
|| ||
+================+=====================+
|
v
+--------------------------------------+
| CHROMA DATABASE |
| |
| Ready for vector, full-text, |
| regex, sparse, and hybrid search |
| |
+--------------------------------------+
Three sources, one pipeline
Sync Features
S3 Sync
- Bucket-level connections
- PDFs, docs, images, and text
- Auto-sync for file updates
- Queue-based ingest at scale
GitHub Sync
- Public and private repositories
- Branch or commit targeting
- Diff-based incremental updates
- Syntax-aware code chunking
Web Sync
- JavaScript page rendering
- Recursive crawl from a seed URL
- Depth and path filters
- Structured markdown extraction
From data to search
1.
Parse
PDFs, Office documents, images, ebooks, HTML, and code, converted to clean markdown with tables, headings, and structure preserved.
2.
Chunk
Tree-sitter for syntax-aware code chunking. Structured markdown chunking for documents. Respects function boundaries and sections.
3.
Embed
Dense and sparse embeddings generated automatically with open models. No extra API keys needed.
4.
Search
Ready for semantic, sparse, hybrid, regex, and full-text search across your data.
Usage-based pricing
$0.04
per GiB processed
+ $0.01 / document page extracted
+ $0.01 / webpage scraped