Write and Validate a Custom Chunking Function
In code extension, function runs are specific to search index chunking, where your code controls how content is split before vectorization and indexing. When you select your custom function in the search index builder, Data 360 passes a list of document elements to your function, and your function must return a dictionary with generated chunks. This topic focuses on writing that logic and validating it before deployment.
| Edition Table |
|---|
| Available in: Developer, Enterprise, Performance, and Unlimited Editions. See Data 360 edition availability. |
| Permission Sets Needed | |
|---|---|
| To write and validate a custom chunking function: | Permission set:
|
- Required input: request object with an
inputlist. - Optional input metadata: fields such as
page_number,transcript_fields,text_as_html,source_dmo_fields,prepend, and image metadata fields (image_base64,image_mime_type,image_type), when available. - Required output: top-level
outputlist. - Required chunk fields:
text,seq_no,chunk_type. - Optional output fields:
citations.
Before you implement payload/entrypoint.py, define the logic your function applies to each DocElement list:
- Identify source patterns: List the content patterns you expect, for example, titles, section text, tables, lists, or mixed HTML blocks from your Data 360 source data.
- Define chunk boundaries: Decide exactly where one chunk starts and ends, for example, per heading, per section, or by token/length limits.
- Define context rules: Specify what must stay together in a chunk, for example, product name, specs, and link, to preserve retrieval quality.
- Define citation mapping: Decide which metadata fields to carry into
citations, for example,FilePath__c,page_number, or source type, for traceability. - Define validation scenarios: List the sample and org-derived payloads you use to verify output shape and chunk quality.
Encapsulate your custom chunking logic in a top-level callable in payload/entrypoint.py. In the initiated function package scaffold, the callable is:
- request (
SearchIndexChunkingV1Request): The typed request object containing input documents to chunk. - runtime (
Runtime): Runtime context provided by the platform (required by the callable signature). - returns (
SearchIndexChunkingV1Response): The typed response object containing generated chunks underoutput.
The runtime reads the entry point file from payload/config.json and executes your callable from that file. In the initiated scaffold, keep the callable signature as just shown.
The callable receives request and runtime. The request object uses the SearchIndexChunkingV1Request schema, which contains one top-level key, input.
| Request Key | Data Type | Required | Description |
|---|---|---|---|
input | List of DocElement | Yes | List of document elements to chunk. |
Each DocElement in the list is a dictionary representing a piece of content from the source document.
| Key | Description |
|---|---|
| text | The text content of the element. |
| metadata | Information about the element (type, origin, and so on). |
Example:
The fields inside metadata vary by element type.
The element type is specified by the type key inside metadata. Types are grouped into these categories.
| Type | Description |
|---|---|
title | Structural titles and section headers. |
text | Standard blocks of text. |
list_item | Individual items within a list. |
table | Structured, tabular data where text contains extracted table text. When Code Extension parsing and preprocessing is enabled, metadata can also include table HTML. |
code_snippet | Pre-formatted source code or command snippets. |
image | Visual elements (charts, diagrams, pictures). Metadata includes image_base64. |
page_metadata | Peripheral content such as headers, footers, and page numbers |
For simple file formats, for example, .html, .txt, .log, and .aspx, you can pass the entire file content as a single DocElement. In that case, text holds the full content and the metadata type is text.
Use this as the complete metadata field reference.
| Metadata Key | Data Type | Required | Description |
|---|---|---|---|
type | Enum (text, title, table, image, list_item, code_snippet, page_metadata) | No | Element type. Defaults to text when omitted. |
page_number | Integer | No | Page number in the source document. |
transcript_fields | Object | No | Transcript metadata for audio/video content. |
text_as_html | String | No | HTML representation of content (for example, table HTML). |
source_dmo_fields | Dictionary (string -> string/int/float) | No | Source DMO field values as key-value pairs. |
prepend | List of objects | No | Fields whose values are prepended before indexing. |
image_base64 | String | No | Base64 image data for image-type elements. |
image_mime_type | String | No | Image MIME type (for example, image/png). |
image_type | String | No | Semantic image category (for example, diagram). |
Nested object fields:
transcript_fields.speaker(String, optional)transcript_fields.start_timestamp(Float, optional)transcript_fields.end_timestamp(Float, optional)prepend[].dmo_name(String, optional)prepend[].field_name(String, optional)prepend[].value(String, optional)
- Unstructured Data Model Objects (UDMOs): Input metadata sent to your function can include fields such as
Col1__cthroughCol5__c,ContentType__c,DataSourceObject__c,DataSource__c,ETag__c,FilePath__c,InternalOrganization__c,KQ_FilePath__c,LastModified__c,Metadata__c,ParentId__c,ResolvedFilePath__c, andSize__c, as applicable. - Structured DMOs: Input metadata sent to your function can include fields such as
datasourceobject__canddatasource__cwhen chunking from a specific field.
These are input fields provided in the request metadata when available. You don’t create these metadata fields in your function. Use available fields in your citations to trace chunks back to their source.
Your function generates the output. It must return a dictionary with an output key. The value of output must be a list of Chunk dictionaries, in sequential order.
Use this table to decide which keys are mandatory.
| Key | Required | Description |
|---|---|---|
output | Yes | List of chunk dictionaries returned by your function. |
Each chunk in the list must be a dictionary with these keys.
| Key | Data Type | Description |
|---|---|---|
| text | String (required) | The text content of the chunk. |
| seq_no | Integer | The sequence number of the chunk, starting from 1. Must be a continuous, ordered sequence. |
| chunk_type | String | Chunk type. Use text. |
| citations | Dictionary of string key-value pairs | Optional metadata from source DocElements (for example, FilePath__c, page_number) to identify the origin of the chunk. |
The initiated project sample can include additional keys that are ignored by the typed model. Keep your function output aligned with the response schema by always returning a valid output list with correctly ordered chunk objects.
Use this template as a minimal baseline when you start authoring.
The following example creates one chunk per DocElement.
-
Use the auto-generated sample payload
payload/tests/test.jsonto run an initial validation of your custom logic inpayload/entrypoint.py. -
Validate the package by running this command.
Example:
--entrypoint ./payload/entrypoint.py(required): Path to your function entry point.--test-with ./payload/tests/test.json(required): Path to the JSON file that provides test input for the function run.--target-org <ORG_ALIAS>(optional): The alias or username of the Salesforce org where you want to run the code extension package. Required when your function testing includes callouts to generative or predictive models registered in your org. The org must have Data 360 enabled with the appropriate permissions.--config-file <path>(optional): JSON file that supplies input for the run. If omitted, the command uses the defaultpayload/config.json.--dependencies <value>(optional): Comma-separated dependency overrides for local runs.
-
If you need additional scenarios, create more test payloads in
payload/tests/by following the same request structure astest.json, and run the command again with the new file path in--test-with. -
Verify that the returned dictionary has the correct
outputshape, such as a list of chunks withtext,seq_no,chunk_type, and optionalcitations, and that chunking logic behaves as expected. -
Refresh package metadata before packaging or deployment.
Example:
--entrypoint ./payload/entrypoint.py: (optional if you run the command from the package root): Path to your function entry point.--dry-run(optional): Preview changes without modifying files.--no-requirements(optional): Updateconfig.jsonwithout changingrequirements.txt.
For best practices and limitations (for example, deterministic output, edge-case handling, and execution model constraints), see Considerations When Writing Custom Chunking Functions.
If you want to implement prompt-driven chunking logic by calling the LLM gateway from your function, see Use LLM Gateway Calls in Custom Chunking Functions.