| Parameter | Type | Default | Description |
|---|---|---|---|
embedder | Union[str, Embedder, BaseEmbeddings] | OpenAIEmbedder | The embedder configuration. Can be an Agno Embedder (e.g., OpenAIEmbedder, GeminiEmbedder), a Chonkie BaseEmbeddings instance (e.g., OpenAIEmbeddings), or a string model identifier (e.g., "text-embedding-3-small") for Chonkie's AutoEmbeddings. |
chunk_size | int | 5000 | Maximum tokens allowed per chunk. |
similarity_threshold | float | 0.5 | Similarity threshold for grouping sentences (0-1). Lower values create larger groups (fewer chunks). |
similarity_window | int | 3 | Number of sentences to consider for similarity calculation. |
min_sentences_per_chunk | int | 1 | Minimum number of sentences per chunk. |
min_characters_per_sentence | int | 24 | Minimum number of characters per sentence. |
delimiters | List[str] | [". ", "! ", "? ", "\n"] | Delimiters to split sentences on. |
include_delimiters | Literal["prev", "next", None] | "prev" | Include delimiters in the chunk text. Specify whether to include with the previous or next sentence. |
skip_window | int | 0 | Number of groups to skip when looking for similar content to merge. 0 (default) uses standard semantic grouping; higher values enable merging of non-consecutive semantically similar groups. |
filter_window | int | 5 | Window length for the Savitzky-Golay filter used in boundary detection. |
filter_polyorder | int | 3 | Polynomial order for the Savitzky-Golay filter. |
filter_tolerance | float | 0.2 | Tolerance for the Savitzky-Golay filter boundary detection. |
chunker_params | Dict[str, Any] | None | Additional parameters to pass directly to Chonkie's SemanticChunker. |
Chunking
Semantic Chunking
Semantic chunking is a method of splitting documents into smaller chunks by analyzing semantic similarity between text segments using embeddings.
It uses the chonkie library to identify natural breakpoints where the semantic meaning changes significantly, based on a configurable similarity threshold.
This helps preserve context and meaning better than fixed-size chunking by ensuring semantically related content stays together in the same chunk, while splitting occurs at meaningful topic transitions.