Write and Validate a Custom Chunking Function

In code extension, function runs are specific to search index chunking, where your code controls how content is split before vectorization and indexing. When you select your custom function in the search index builder, Data 360 passes a list of document elements to your function, and your function must return a dictionary with generated chunks. This topic focuses on writing that logic and validating it before deployment.

Edition Table
Available in: Developer, Enterprise, Performance, and Unlimited Editions. See Data 360 edition availability.

Permission Sets Needed
To write and validate a custom chunking function:	Permission set: Data Cloud Architect

Authoring

Required and Optional Fields at a Glance

Required input: request object with an input list.
Optional input metadata: fields such as page_number, transcript_fields, text_as_html, source_dmo_fields, prepend, and image metadata fields (image_base64, image_mime_type, image_type), when available.
Required output: top-level output list.
Required chunk fields: text, seq_no, chunk_type.
Optional output fields: citations.

Before you implement payload/entrypoint.py, define the logic your function applies to each DocElement list:

Identify source patterns: List the content patterns you expect, for example, titles, section text, tables, lists, or mixed HTML blocks from your Data 360 source data.
Define chunk boundaries: Decide exactly where one chunk starts and ends, for example, per heading, per section, or by token/length limits.
Define context rules: Specify what must stay together in a chunk, for example, product name, specs, and link, to preserve retrieval quality.
Define citation mapping: Decide which metadata fields to carry into citations, for example, FilePath__c, page_number, or source type, for traceability.
Define validation scenarios: List the sample and org-derived payloads you use to verify output shape and chunk quality.

Function Signature

Encapsulate your custom chunking logic in a top-level callable in payload/entrypoint.py. In the initiated function package scaffold, the callable is:

1def function(request: SearchIndexChunkingV1Request, runtime: Runtime) -> SearchIndexChunkingV1Response:

request (SearchIndexChunkingV1Request): The typed request object containing input documents to chunk.
runtime (Runtime): Runtime context provided by the platform (required by the callable signature).
returns (SearchIndexChunkingV1Response): The typed response object containing generated chunks under output.

The runtime reads the entry point file from payload/config.json and executes your callable from that file. In the initiated scaffold, keep the callable signature as just shown.

Input Specification

The callable receives request and runtime. The request object uses the SearchIndexChunkingV1Request schema, which contains one top-level key, input.

Request Key	Data Type	Required	Description
`input`	List of `DocElement`	Yes	List of document elements to chunk.

1{
2  "input": [
3    { ... DocElement 1 ... },
4    { ... DocElement 2 ... }
5  ]
6}

DocElement Object

Each DocElement in the list is a dictionary representing a piece of content from the source document.

Key	Description
text	The text content of the element.
metadata	Information about the element (type, origin, and so on).

Example:

1{
2  "text": "The actual text content of the element.",
3  "metadata": { ... }
4}

The fields inside metadata vary by element type.

DocElement Types

The element type is specified by the type key inside metadata. Types are grouped into these categories.

Type	Description
`title`	Structural titles and section headers.
`text`	Standard blocks of text.
`list_item`	Individual items within a list.
`table`	Structured, tabular data where `text` contains extracted table text. When Code Extension parsing and preprocessing is enabled, metadata can also include table HTML.
`code_snippet`	Pre-formatted source code or command snippets.
`image`	Visual elements (charts, diagrams, pictures). Metadata includes `image_base64`.
`page_metadata`	Peripheral content such as headers, footers, and page numbers

For simple file formats, for example, .html, .txt, .log, and .aspx, you can pass the entire file content as a single DocElement. In that case, text holds the full content and the metadata type is text.

Metadata Fields

Use this as the complete metadata field reference.

Metadata Key	Data Type	Required	Description
`type`	Enum (`text`, `title`, `table`, `image`, `list_item`, `code_snippet`, `page_metadata`)	No	Element type. Defaults to `text` when omitted.
`page_number`	Integer	No	Page number in the source document.
`transcript_fields`	Object	No	Transcript metadata for audio/video content.
`text_as_html`	String	No	HTML representation of content (for example, table HTML).
`source_dmo_fields`	Dictionary (`string -> string/int/float`)	No	Source DMO field values as key-value pairs.
`prepend`	List of objects	No	Fields whose values are prepended before indexing.
`image_base64`	String	No	Base64 image data for image-type elements.
`image_mime_type`	String	No	Image MIME type (for example, `image/png`).
`image_type`	String	No	Semantic image category (for example, `diagram`).

Nested object fields:

transcript_fields.speaker (String, optional)
transcript_fields.start_timestamp (Float, optional)
transcript_fields.end_timestamp (Float, optional)
prepend[].dmo_name (String, optional)
prepend[].field_name (String, optional)
prepend[].value (String, optional)

Standard Metadata Fields for UDMOs and DMOs

Unstructured Data Model Objects (UDMOs): Input metadata sent to your function can include fields such as Col1__c through Col5__c, ContentType__c, DataSourceObject__c, DataSource__c, ETag__c, FilePath__c, InternalOrganization__c, KQ_FilePath__c, LastModified__c, Metadata__c, ParentId__c, ResolvedFilePath__c, and Size__c, as applicable.
Structured DMOs: Input metadata sent to your function can include fields such as datasourceobject__c and datasource__c when chunking from a specific field.

These are input fields provided in the request metadata when available. You don’t create these metadata fields in your function. Use available fields in your citations to trace chunks back to their source.

Output Specification

Your function generates the output. It must return a dictionary with an output key. The value of output must be a list of Chunk dictionaries, in sequential order.

1{
2  "output": [
3    { ... Chunk 1 ... },
4    { ... Chunk 2 ... }
5  ]
6}

Top-Level Response

Use this table to decide which keys are mandatory.

Key	Required	Description
`output`	Yes	List of chunk dictionaries returned by your function.

Chunk Object

Each chunk in the list must be a dictionary with these keys.

Key	Data Type	Description
text	String (required)	The text content of the chunk.
seq_no	Integer	The sequence number of the chunk, starting from 1. Must be a continuous, ordered sequence.
chunk_type	String	Chunk type. Use `text`.
citations	Dictionary of string key-value pairs	Optional metadata from source DocElements (for example, `FilePath__c`, `page_number`) to identify the origin of the chunk.

The initiated project sample can include additional keys that are ignored by the typed model. Keep your function output aligned with the response schema by always returning a valid output list with correctly ordered chunk objects.

Minimum Valid Function

Use this template as a minimal baseline when you start authoring.

1from datacustomcode.function import Runtime
2from datacustomcode.function.feature_types.chunking import (
3    ChunkType,
4    SearchIndexChunkingV1Output,
5    SearchIndexChunkingV1Request,
6    SearchIndexChunkingV1Response,
7)
8
9
10def function(
11    request: SearchIndexChunkingV1Request, runtime: Runtime
12) -> SearchIndexChunkingV1Response:
13    chunks = []
14    for i, element in enumerate(request.input, start=1):
15        chunks.append(
16            SearchIndexChunkingV1Output(
17                text=element.text,
18                seq_no=i,
19                chunk_type=ChunkType.TEXT,
20            )
21        )
22    return SearchIndexChunkingV1Response(output=chunks)

Example Function

The following example creates one chunk per DocElement.

Example Input

1{
2  "input": [
3    {
4      "text": "Document Title",
5      "metadata": {
6        "type": "text",
7        "page_number": 1,
8        "source_dmo_fields": {
9          "FilePath__c": "relative_file_path_from_UDMO"
10        }
11      }
12    },
13    {
14      "text": "This is the first paragraph of the document, discussing important topics.",
15      "metadata": {
16        "type": "text",
17        "page_number": 1,
18        "source_dmo_fields": {
19          "FilePath__c": "docs/sample.pdf"
20        }
21      }
22    }
23  ]
24}

Example Implementation

1from datacustomcode.function import Runtime
2from datacustomcode.function.feature_types.chunking import (
3    ChunkType,
4    SearchIndexChunkingV1Output,
5    SearchIndexChunkingV1Request,
6    SearchIndexChunkingV1Response,
7)
8
9
10def function(
11    request: SearchIndexChunkingV1Request, runtime: Runtime
12) -> SearchIndexChunkingV1Response:
13    """
14    A simple chunking strategy that creates one chunk per DocElement.
15    """
16    chunks = []
17    for i, element in enumerate(request.input):
18        source_fields = element.metadata.source_dmo_fields if element.metadata else {}
19        citation_data = {
20            "source_type": element.metadata.type.value if element.metadata and element.metadata.type else "",
21            "page_number": str(element.metadata.page_number) if element.metadata and element.metadata.page_number is not None else "",
22            "file_path": str(source_fields.get("FilePath__c")) if source_fields and source_fields.get("FilePath__c") else "",
23        }
24
25        chunk = SearchIndexChunkingV1Output(
26            text=element.text,
27            seq_no=i + 1,
28            chunk_type=ChunkType.TEXT,
29            citations={k: v for k, v in citation_data.items() if v},
30        )
31        chunks.append(chunk)
32
33    return SearchIndexChunkingV1Response(output=chunks)

Example Output

1{
2  "output": [
3    {
4      "chunk_type": "text",
5      "text": "Document Title",
6      "seq_no": 1,
7      "citations": {
8        "source_type": "text",
9        "page_number": "1",
10        "file_path": "relative_file_path_from_UDMO"
11      }
12    },
13    {
14      "chunk_type": "text",
15      "text": "This is the first paragraph of the document, discussing important topics.",
16      "seq_no": 2,
17      "citations": {
18        "source_type": "text",
19        "page_number": "1",
20        "file_path": "docs/sample.pdf"
21      }
22    }
23  ]
24}

Validating

Use the auto-generated sample payload payload/tests/test.json to run an initial validation of your custom logic in payload/entrypoint.py.
Validate the package by running this command.
1sf data-code-extension function run --entrypoint <ENTRYPOINT_PATH> --test-with <TEST_JSON_PATH> [--target-org <ORG_ALIAS>] [--config-file <CONFIG_PATH>] [--dependencies <DEPENDENCIES>]
Example:
1sf data-code-extension function run --entrypoint ./payload/entrypoint.py --test-with ./payload/tests/test.json
- --entrypoint ./payload/entrypoint.py (required): Path to your function entry point.
- --test-with ./payload/tests/test.json (required): Path to the JSON file that provides test input for the function run.
- --target-org <ORG_ALIAS> (optional): The alias or username of the Salesforce org where you want to run the code extension package. Required when your function testing includes callouts to generative or predictive models registered in your org. The org must have Data 360 enabled with the appropriate permissions.
- --config-file <path> (optional): JSON file that supplies input for the run. If omitted, the command uses the default payload/config.json.
- --dependencies <value> (optional): Comma-separated dependency overrides for local runs.
If you need additional scenarios, create more test payloads in payload/tests/ by following the same request structure as test.json, and run the command again with the new file path in --test-with.
Verify that the returned dictionary has the correct output shape, such as a list of chunks with text, seq_no, chunk_type, and optional citations, and that chunking logic behaves as expected.
Refresh package metadata before packaging or deployment.
1sf data-code-extension function scan [--entrypoint <ENTRYPOINT_PATH>] [--config-file <CONFIG_PATH>] [--dry-run] [--no-requirements]
Example:
1sf data-code-extension function scan --entrypoint ./payload/entrypoint.py
- --entrypoint ./payload/entrypoint.py: (optional if you run the command from the package root): Path to your function entry point.
- --dry-run (optional): Preview changes without modifying files.
- --no-requirements (optional): Update config.json without changing requirements.txt.

For best practices and limitations (for example, deterministic output, edge-case handling, and execution model constraints), see Considerations When Writing Custom Chunking Functions.

If you want to implement prompt-driven chunking logic by calling the LLM gateway from your function, see Use LLM Gateway Calls in Custom Chunking Functions.

Next Step

Deploy Custom Function to Data 360