Considerations When Writing Custom Code (Beta)
To write custom code for code extension in Data 360, follow these best practices, architectural constraints, and Spark API limitations.
Code extension is a pilot or beta service that is subject to the Beta Services Terms at Agreements - Salesforce.com or a written Unified Beta Agreement if executed by Customer, and applicable terms in the Product Terms Directory. Use of this pilot or beta service is at the Customer's sole discretion.
| Edition Table |
|---|
| Available in: Developer, Enterprise, Performance, and Unlimited Editions. See Data 360 edition availability. |
Keep your custom code logic in DataFrame and Dataset operations. When you write custom code for batch data transforms, use DataFrame operations (such as filter(), groupBy(), agg(), join(), and withColumn()) rather than iterating over data row by row. These operations are optimized for server-side execution and provide the best performance in Data 360's execution environment.
Prefer supported Spark-native expressions and built-in UDFs (user-defined functions) that run on the server over client-side iteration. Use functions from pyspark.sql.functions and other Spark-native APIs in your custom code. This approach maximizes performance and ensures compatibility with Data 360's execution model.
Keep your transformations free of side effects. The same execution plan can be analyzed or optimized multiple times. Eliminating side effects ensures your code produces consistent results when run with the same input data.
Treat code extension like a separate API connection. Not all Spark features are available in code extension. Test your code thoroughly in a sandbox before deploying to production.
In your custom code, log important steps in your transformation logic, handle errors gracefully, and provide meaningful error messages. Review execution logs in the Data 360 UI to troubleshoot issues.
For more information, see Write and Validate Custom Scripts (Beta).
These features are not supported in the Data 360 execution environment:
- Custom UDFs
- Spark Listeners
- Spark Extensions
- Full access to configuration options
Code extension is built on DataFrame and Dataset APIs. RDDs are not supported. Use DataFrame operations instead.
Many SparkContext-era patterns, such as custom accumulators, arbitrary driver-side callbacks, and some listeners, don't map cleanly to Data 360's execution model. Use DataFrame and Dataset APIs and supported patterns instead.
Operations such as calling a service per row, writing to a database inside a map operation, or incrementing counters can behave unexpectedly due to retries or replanning. Call external services in a controlled manner, outside of row-level transformations.
If downstream processes need data, write results to a Data 360 object (DLO or DMO) instead of by using collect(). Large result sets can cause memory issues and performance problems.
Data 360's execution environment has different performance characteristics from in-process Spark. Extra serialization and network overhead mean that chatty patterns (many tiny actions) can hurt performance more than in in-process Spark. Design your code for efficient batch operations. Performance when you run locally with the SDK can differ from performance when the same code runs in production in Data 360.
Even printSchema() or schema analysis can trigger resolution paths that present issues such as duplicate columns earlier than you expect. These operations have performance implications in the Data 360 execution environment.