Chroma Sync

Serverless data ingestion for Chroma Cloud. Connect your S3 buckets, GitHub repositories, and websites. Chroma handles parsing, chunking, and embedding so you can start searching in minutes.

Built for scale and reliability

Whether you're syncing a handful of files or millions of documents, Sync runs the same pipeline: a queue-based system with retries, rate-limit awareness, and automatic error recovery.

Designed to maximize throughput without dropping work. Monitor every invocation in the dashboard or through the Sync API.


 +--------------------------------------+
 |        YOUR DATA SOURCES             |
 |                                      |
 |  +--------+ +--------+ +--------+    |
 |  | S3     | | GitHub | | Web    |    |
 |  | bucket | | repos  | | pages  |    |
 |  +---+----+ +---+----+ +---+----+    |
 |      |          |          |         |
 +------+----------+----------+---------+
        |          |          |
        v          v          v
 +======================================+
 ||         CHROMA SYNC                ||
 ||                                    ||
 ||  PARSE --> CHUNK --> EMBED         ||
 ||                                    ||
 ||  * Retries & error recovery        ||
 ||  * Rate-limit awareness            ||
 ||  * Maximum throughput              ||
 ||                                    ||
 +================+=====================+
                  |
                  v
 +--------------------------------------+
 |        CHROMA DATABASE               |
 |                                      |
 |  Ready for vector, full-text,        |
 |  regex, sparse, and hybrid search    |
 |                                      |
 +--------------------------------------+

Three sources, one pipeline

Sync Features
S3 Sync
  • Bucket-level connections
  • PDFs, docs, images, and text
  • Auto-sync for file updates
  • Queue-based ingest at scale
GitHub Sync
  • Public and private repositories
  • Branch or commit targeting
  • Diff-based incremental updates
  • Syntax-aware code chunking
Web Sync
  • JavaScript page rendering
  • Recursive crawl from a seed URL
  • Depth and path filters
  • Structured markdown extraction

From data to search

1.

Parse

PDFs, Office documents, images, ebooks, HTML, and code, converted to clean markdown with tables, headings, and structure preserved.
2.

Chunk

Tree-sitter for syntax-aware code chunking. Structured markdown chunking for documents. Respects function boundaries and sections.
3.

Embed

Dense and sparse embeddings generated automatically with open models. No extra API keys needed.
4.

Search

Ready for semantic, sparse, hybrid, regex, and full-text search across your data.

Usage-based pricing

$0.04
per GiB processed
View full pricing →
+ $0.01 / document page extracted
+ $0.01 / webpage scraped

Start syncing your data

Hidden