Chroma Sync

Serverless data ingestion for Chroma Cloud. Connect your S3 buckets, GitHub repositories, and websites. Chroma handles parsing, chunking, and embedding so you can start searching in minutes.

Built for scale and reliability

Whether you're syncing a handful of files or millions of documents, Sync runs the same pipeline: a queue-based system with retries, rate-limit awareness, and automatic error recovery.

Designed to maximize throughput without dropping work. Monitor every invocation in the dashboard or through the Sync API.


 +--------------------------------------+
 |        YOUR DATA SOURCES             |
 |                                      |
 |  +--------+ +--------+ +--------+    |
 |  | S3     | | GitHub | | Web    |    |
 |  | bucket | | repos  | | pages  |    |
 |  +---+----+ +---+----+ +---+----+    |
 |      |          |          |         |
 +------+----------+----------+---------+
        |          |          |
        v          v          v
 +======================================+
 ||         CHROMA SYNC                ||
 ||                                    ||
 ||  PARSE --> CHUNK --> EMBED         ||
 ||                                    ||
 ||  * Retries & error recovery        ||
 ||  * Rate-limit awareness            ||
 ||  * Maximum throughput              ||
 ||                                    ||
 +================+=====================+
                  |
                  v
 +--------------------------------------+
 |        CHROMA DATABASE               |
 |                                      |
 |  Ready for vector, full-text,        |
 |  regex, sparse, and hybrid search    |
 |                                      |
 +--------------------------------------+

Three sources, one pipeline

Sync features

S3 sync

Bucket-level connections
PDFs, docs, images, and text
Auto-sync for file updates
Queue-based ingest at scale

GitHub sync

Public and private repositories
Branch or commit targeting
Diff-based incremental updates
Syntax-aware code chunking

Web sync

JavaScript page rendering
Recursive crawl from a seed URL
Depth and path filters
Structured markdown extraction

From data to search

Parse

PDFs, Office documents, images, ebooks, HTML, and code, converted to clean markdown with tables, headings, and structure preserved.

Chunk

Tree-sitter for syntax-aware code chunking. Structured markdown chunking for documents. Respects function boundaries and sections.

Embed

Dense and sparse embeddings generated automatically with open models. No extra API keys needed.

Search

Ready for semantic, sparse, hybrid, regex, and full-text search across your data.

Usage-based pricing

$0.04

per GiB processed

View full pricing →

+ $0.01 / document page extracted

+ $0.01 / webpage scraped

Start syncing your data

Chroma Sync

Built for scale and reliability

Three sources, one pipeline

From data to search

Parse

Chunk

Embed

Search

Usage-based pricing

Start syncing your data

Hidden

Product

Follow

Company

Legal