Write and Validate a Custom Chunking Function

In code extension, function runs are specific to search index chunking, where your code controls how content is split before vectorization and indexing. When you select your custom function in the search index builder, Data 360 passes a list of document elements to your function, and your function must return a dictionary with generated chunks. This topic focuses on writing that logic and validating it before deployment.

Edition Table
Available in: Developer, Enterprise, Performance, and Unlimited Editions. See Data 360 edition availability.
Permission Sets Needed
To write and validate a custom chunking function:Permission set:
  • Data Cloud Architect
  • Required input: request object with an input list.
  • Optional input metadata: fields such as page_number, transcript_fields, text_as_html, source_dmo_fields, prepend, and image metadata fields (image_base64, image_mime_type, image_type), when available.
  • Required output: top-level output list.
  • Required chunk fields: text, seq_no, chunk_type.
  • Optional output fields: citations.

Before you implement payload/entrypoint.py, define the logic your function applies to each DocElement list:

  1. Identify source patterns: List the content patterns you expect, for example, titles, section text, tables, lists, or mixed HTML blocks from your Data 360 source data.
  2. Define chunk boundaries: Decide exactly where one chunk starts and ends, for example, per heading, per section, or by token/length limits.
  3. Define context rules: Specify what must stay together in a chunk, for example, product name, specs, and link, to preserve retrieval quality.
  4. Define citation mapping: Decide which metadata fields to carry into citations, for example, FilePath__c, page_number, or source type, for traceability.
  5. Define validation scenarios: List the sample and org-derived payloads you use to verify output shape and chunk quality.

Encapsulate your custom chunking logic in a top-level callable in payload/entrypoint.py. In the initiated function package scaffold, the callable is:

  • request (SearchIndexChunkingV1Request): The typed request object containing input documents to chunk.
  • runtime (Runtime): Runtime context provided by the platform (required by the callable signature).
  • returns (SearchIndexChunkingV1Response): The typed response object containing generated chunks under output.

The runtime reads the entry point file from payload/config.json and executes your callable from that file. In the initiated scaffold, keep the callable signature as just shown.

The callable receives request and runtime. The request object uses the SearchIndexChunkingV1Request schema, which contains one top-level key, input.

Request KeyData TypeRequiredDescription
inputList of DocElementYesList of document elements to chunk.

Each DocElement in the list is a dictionary representing a piece of content from the source document.

KeyDescription
textThe text content of the element.
metadataInformation about the element (type, origin, and so on).

Example:

The fields inside metadata vary by element type.

The element type is specified by the type key inside metadata. Types are grouped into these categories.

TypeDescription
titleStructural titles and section headers.
textStandard blocks of text.
list_itemIndividual items within a list.
tableStructured, tabular data where text contains extracted table text. When Code Extension parsing and preprocessing is enabled, metadata can also include table HTML.
code_snippetPre-formatted source code or command snippets.
imageVisual elements (charts, diagrams, pictures). Metadata includes image_base64.
page_metadataPeripheral content such as headers, footers, and page numbers

For simple file formats, for example, .html, .txt, .log, and .aspx, you can pass the entire file content as a single DocElement. In that case, text holds the full content and the metadata type is text.

Use this as the complete metadata field reference.

Metadata KeyData TypeRequiredDescription
typeEnum (text, title, table, image, list_item, code_snippet, page_metadata)NoElement type. Defaults to text when omitted.
page_numberIntegerNoPage number in the source document.
transcript_fieldsObjectNoTranscript metadata for audio/video content.
text_as_htmlStringNoHTML representation of content (for example, table HTML).
source_dmo_fieldsDictionary (string -> string/int/float)NoSource DMO field values as key-value pairs.
prependList of objectsNoFields whose values are prepended before indexing.
image_base64StringNoBase64 image data for image-type elements.
image_mime_typeStringNoImage MIME type (for example, image/png).
image_typeStringNoSemantic image category (for example, diagram).

Nested object fields:

  • transcript_fields.speaker (String, optional)
  • transcript_fields.start_timestamp (Float, optional)
  • transcript_fields.end_timestamp (Float, optional)
  • prepend[].dmo_name (String, optional)
  • prepend[].field_name (String, optional)
  • prepend[].value (String, optional)
  • Unstructured Data Model Objects (UDMOs): Input metadata sent to your function can include fields such as Col1__c through Col5__c, ContentType__c, DataSourceObject__c, DataSource__c, ETag__c, FilePath__c, InternalOrganization__c, KQ_FilePath__c, LastModified__c, Metadata__c, ParentId__c, ResolvedFilePath__c, and Size__c, as applicable.
  • Structured DMOs: Input metadata sent to your function can include fields such as datasourceobject__c and datasource__c when chunking from a specific field.

These are input fields provided in the request metadata when available. You don’t create these metadata fields in your function. Use available fields in your citations to trace chunks back to their source.

Your function generates the output. It must return a dictionary with an output key. The value of output must be a list of Chunk dictionaries, in sequential order.

Use this table to decide which keys are mandatory.

KeyRequiredDescription
outputYesList of chunk dictionaries returned by your function.

Each chunk in the list must be a dictionary with these keys.

KeyData TypeDescription
textString (required)The text content of the chunk.
seq_noIntegerThe sequence number of the chunk, starting from 1. Must be a continuous, ordered sequence.
chunk_typeStringChunk type. Use text.
citationsDictionary of string key-value pairsOptional metadata from source DocElements (for example, FilePath__c, page_number) to identify the origin of the chunk.

The initiated project sample can include additional keys that are ignored by the typed model. Keep your function output aligned with the response schema by always returning a valid output list with correctly ordered chunk objects.

Use this template as a minimal baseline when you start authoring.

The following example creates one chunk per DocElement.

  1. Use the auto-generated sample payload payload/tests/test.json to run an initial validation of your custom logic in payload/entrypoint.py.

  2. Validate the package by running this command.

    Example:

    • --entrypoint ./payload/entrypoint.py (required): Path to your function entry point.
    • --test-with ./payload/tests/test.json (required): Path to the JSON file that provides test input for the function run.
    • --target-org <ORG_ALIAS> (optional): The alias or username of the Salesforce org where you want to run the code extension package. Required when your function testing includes callouts to generative or predictive models registered in your org. The org must have Data 360 enabled with the appropriate permissions.
    • --config-file <path> (optional): JSON file that supplies input for the run. If omitted, the command uses the default payload/config.json.
    • --dependencies <value> (optional): Comma-separated dependency overrides for local runs.
  3. If you need additional scenarios, create more test payloads in payload/tests/ by following the same request structure as test.json, and run the command again with the new file path in --test-with.

  4. Verify that the returned dictionary has the correct output shape, such as a list of chunks with text, seq_no, chunk_type, and optional citations, and that chunking logic behaves as expected.

  5. Refresh package metadata before packaging or deployment.

    Example:

    • --entrypoint ./payload/entrypoint.py: (optional if you run the command from the package root): Path to your function entry point.
    • --dry-run (optional): Preview changes without modifying files.
    • --no-requirements (optional): Update config.json without changing requirements.txt.

For best practices and limitations (for example, deterministic output, edge-case handling, and execution model constraints), see Considerations When Writing Custom Chunking Functions.

If you want to implement prompt-driven chunking logic by calling the LLM gateway from your function, see Use LLM Gateway Calls in Custom Chunking Functions.