Building a Batch Retry Framework With BatchApexErrorEvent

Today we all depend on many cloud-based services to help us go about our lives smoothly. When those services fail us, things can get frustrating, especially when it’s not clear what was at the root of the issue or how to get back on track. In this article, we dig into the new BatchApexErrorEvent platform event along with a sample app and framework, to see how it can be used to build an error reporting and retry facility for Batch Apex.

Batch Apex jobs allow you to orchestrate the execution of code over a set of data in the background. Apex exceptions can be thrown due to bugs in your code, intentionally or when limits are hit. By design, Batch Apex rolls back the current scope or chunk of work it is executing at the time. This means that not all of the job is affected, only the parts of it. To help us explore this and the new BatchApexErrorEvent feature, we are going to use a simple job that generates Invoices from a given set of Orders, the Invoice Generation job is started by clicking a button on the Orders list view.

The Apex Jobs page under Setup gives an overview of the job as it’s processing and after completion. The sample Order data has been deliberately seeded with records that will cause exceptions in the code. The UI below shows us that platform split the 1000 Orders passed to the InvoiceGenerationJob into five chunks of work and that two completed and three failed. The chunk size for the purposes of this illustration was set to 200, which means 600 Orders failed to process.

We can see that the UI tells us at least the reason why the first of the three failed chunks failed (200 out of the 600 failures), but what about the other two chunks? For these, the platform does email the job owner this information, but that’s not suitable for any type of logging that would give broader visibility and thus the ability to resolve more quickly.

Improving error capture, visibility and retry

In Winter’19 the BatchApexErrorEvent standard platform event was introduced (currently in Beta at time of writing). This event extends the above error reporting facilities with the ability to use Platform Events to listen (subscribe) to all job failures in variety of ways using clicks or code (clicks are not supported in Beta). The fields on the event give rich access to the exception type, stack trace, affected scope (records) and job ID. You can review a full list of the available fields here.

BatchApexErrorEvent. An event record provides more granular error tracking than the Apex Jobs UI. It includes the record IDs being processed, exception type, exception message, and stack trace. You can also incorporate custom handling and retry logic for failures. You can invoke custom Apex logic from any trigger on this type of event, so Apex developers can build functionality like custom logging or automated retry handling.

In this article, we are going to use Apex to log the errors received from this new event and provide a means to review them in the Lightning Experience Utility Bar. In addition to that we are going to provide a means for a user to retry (hopefully once issues have been addressed) only those parts of the jobs that failed. The GitHub repository associated with this article contains the full source code for the Apex code and Lightning components shown throughout this article. All classes and components are prefixed with brf (batch retry framework) for ease of recognition.

NOTE: The “Too many SOQL queries: 201” exception shown above is actually a limit exception. The interesting thing about this type of exception is that prior to this feature it was not possible to write error handling code when it and other limit exceptions are thrown. Well now you can!

The Batch Job Failures view above provides a dropdown action to retry specifically the failed chunks of a completed job. The Bad Orders list view provides an easy way for the purposes of this demo to access the rows causing the problem! Let’s deliberately only delete one of them, then use the Retry action to ask the retry framework (more on this later) to rerun the processing over the previously affected records and see what happens.

If there are any still remaining errors, the Batch Job Failures view continues to report them until everything is cleared out. Clicking the Refresh button causes it to check for fresh logs. In our case there are still two remaining. You can delete the other two bad Order records and retry again to clear things up. This framework could be enhanced further to use the new Streaming API Lightning Component to listen for log updates and automatically refresh the view.

Show me the code!

So let’s dig into how the above was achieved by walking through some key aspects of the sample code included here. First, let’s take a look at the InvoiceGenerationJob code. At first glance it looks very much like a standard Batch Apex class, right?

The execute method re-queries the records. This is a best practice in order to avoid reading stale records provided within the scope parameter, important for long running jobs. It also allows the retry framework we are building here a convenient assumption to make when retrying failed scopes, where only the scope record IDs are known.

The produceSomeException method called from execute is used to create some chaos in our otherwise simple example.

The class implements the traditional Database.Batchable methods, as well as the handleErrors method required by the brf_BatchableErrorHandler interface (included in the sample code). The interface is defined as follows:

Note that this interface extends the new Database.RaisesPlatformEvents interface, which ensures that this interface is always implemented by the Batch Apex class when using the framework. Database.RaisesPlatformEvents does not actually have any methods to implement, it’s known as a marker interface that tells the platform to send the event. The implementation of the handleErrors method is called by the framework’s code that handles the error event, more on this in just a moment!

As mentioned above we are going to use Apex to subscribe to the BatchApexErrorEvent via an Apex Trigger. You can read more about subscribing to Platform Events this way here. Now just because this is one of those fancy new Platform Event triggers does not mean we should abandon our best practices! The actual trigger code is thus very small…

NOTE: The formal documentation for this feature has a more self contained example here.

Per best practice the above trigger delegates to the brf_BatchApexErrorEvents class that encapsulates the behavior of dealing with the error events as they are received. A full walkthrough of the handle method is beyond the scope of this article. Principally though, it deals with storing the error information on the event in custom objects and making callback to the Batch Apex class via the handleErrors method on the brf_BatchableErrorHandler covered earlier.

Decomposing BatchApexErrorEvent into Custom Objects

To allow users to review the errors, the above handler code receives the event and stores everything in two custom objects as shown below. The master-detail usage allows for some rollups on the information and ease of cleanup when needed. You can also see that the Order object has a lookup to the failure records for added visibility when viewing Orders.

Handling and retrying errors

The framework makes the distinction between code that should be executed at the time the error event is received vs when the retry action is invoked. The handleErrors and execute methods on the InvoiceGenerationJob Batch Apex class are mapped to these two operations. The following diagram shows how and when they get invoked by the framework.

When the handleError method is called, you can consider updating related records with an error status. For example, to make the errors more contextual, the code below associates them with the Order records. Note also that in this execution context (platform event subscription) the event trigger code is running as the “Automated Process” user and not the user that invoked the job (see here for more on this). The InvoiceGenerationJob.handleErrors method is shown below:-

NOTE: So long as the Batch Apex class implementing the corresponding job has a default constructor, the above method is called automatically from within the brf_BatchApexErrorEvents.handle method (invoked by the event Apex Trigger).

When the user selects the Retry action from the Batch Job Failures view, the framework automatically calls the InvoiceGenerationJob.execute method for the failed scopes via a generic Batch Apex job called brf_BatchableRetryJob. This again automatically instantiates the applicable Batch Apex class so long as it has a default constructor. Providing the code in the execute method is re-querying the records needed (as highlighted above), so there is no need for it to care if it’s being called for the first time or as part of a retry.

Why use another Batch Apex job for Retry? The framework uses Batch Apex for retry because it allows for the same level of error handling as the original job (in case of repeat failures) and allows for greater scope and limits than for example trying retry scopes in a synchronous Apex context.

The brf_BatchableRetryJob class also uses the framework to ensure that if the retry attempts still fail (due to new issues or original issues having still not been addressed) those continue to be correctly logged, there is no escape! The framework optimistically deletes old logs when submitting a retry job, so is self cleaning. Of course, if users feel the issues have been addressed the logs can be manually deleted as well.

Conclusion

The BatchApexErrorEvent event enables a rich set of possibilities for adding more visibility and control over how you handle errors in your Batch Apex jobs. While not yet supported in Beta, the goal is also to support Process Builder and Flow as well, which would then allow admins to configure alerts and notifications. I am looking forward to seeing more features on the platform exposing more standard Platform Events. Meanwhile, thanks to the Apex team for leading the charge!

Resources

Notes

Support for Database.Stateful is not considered in the above. The challenge with retrying scopes from such jobs is reinstating the correct state for the failed scope. In such cases, the framework could be enhanced to allow the user to resubmit the original job providing the job was coded in such a way to support that scenario.
At time of writing only SObject based Batch Apex jobs is supported by the framework. The BatchApexErrorEvent JobScope field will contain the toString output from other types so it’s possible to extend support for this.
Automatic retry could easily be added by implementing the Apex Scheduler around the brf_BatchableRetryJob. In this case, scope of this job could be extended to failed scopes across multiple jobs, not just one.
Carefully consider what logic you perform in the event handler (handleErrors method if you are using this framework) as this is not running as a standard user — your logging and audit information will reflect the Automation Process user.
Make sure your Batch Apex class is at API version 44.0 or above.

As a beta feature, Batch Apex Error Events is a preview and isn’t part of the “Services” under your master subscription agreement with Salesforce. Use this feature at your sole discretion, and make your purchase decisions only on the basis of generally available products and features. Salesforce doesn’t guarantee general availability of this feature within any particular time frame or at all, and we can discontinue it at any time. This feature is for evaluation purposes only, not for production use. It’s offered as is and isn’t supported, and Salesforce has no liability for any harm or damage arising out of or in connection with it. All restrictions, Salesforce reservation of rights, obligations concerning the Services, and terms for related Non-Salesforce Applications and Content apply equally to your use of this feature.