Maximize the Scalability and Performance of Salesforce Pub/Sub API

In the first blog post in this series, we discussed how Salesforce Pub/Sub API enables developers to build scalable and controllable integrations by leveraging event-driven architecture. We highlighted how Salesforce Data Cloud leverages Pub/Sub API to solve for large volume replication needs at scale. We also analyzed the performance of Pub/Sub API compared to other approaches, and proved it to be more efficient at scaling system integrations through performance tests.

To maximize the performance, scalability, throughput, and reliability of API endpoints, it’s crucial to design and implement client applications following certain best practices. These best practices ensure that clients can handle high loads, maintain consistent performance, and recover gracefully from errors.

In this post, we’ll go over key recommendations from Salesforce experts to help you get the most out of eventing integrations.

Implement retry strategies with backoff

A backoff approach is often used in APIs to handle transient errors more gracefully and improve overall system stability by waiting for a certain period before retrying a failed operation. Implementing retry strategies with backoff is crucial in event-based architectures to ensure that neither the client nor the server gets overwhelmed. These strategies come in various flavors to choose from based on what is most useful to an individual client, for example, exponential backoff, jitters, fixed backoff, etc. In addition to backoff, limiting the number of retries, monitoring and logging them to understand failure patterns, and graceful degradation to failure will help you build a more resilient and robust integration client that is capable of handling transient failures without overwhelming either the client or the server.

Some tips on retries in times of failures and on handling API errors are documented in our developer documentation.

Flowchart representing fixed, incremental, and exponential backoff strategies

Implement debugging features like trace ID or RPC ID right from get go

When designing your application, make sure that you’re planning for and implementing debugging features, such as adding a trace ID and supporting rpc-id or any other unique identifiers that can help speed up and simplify the debugging process. These IDs come in handy, both on the client end as well on server side, when support cases are created with Salesforce.

Each RPC call response originating from Pub/Sub API includes an rpc-id field, which identifies the method execution in Salesforce. Salesforce Customer Support uses this rpc-id to troubleshoot issues. If a call fails, the exception will include the rpc-id. Be sure to provide this rpc-id when reaching out to support to fast track response from support. Even with successful calls, issues like slow event publishing can occur. If the issue isn’t originating from your code and you want to reach out to Salesforce, provide the rpc-id to Salesforce Customer Support for quicker assistance.

To help with troubleshooting errors that don’t originate from Pub/Sub API and don’t include the RPC ID, add the x-client-trace-id metadata to every RPC request that the client makes to Pub/Sub API as part of the authorization headers.

Explore streaming over unary for high throughput cases

The PublishStream method (see docs) uses bidirectional streaming. Clients don’t need to wait for publish responses before sending a new publish request. They can queue multiple publish batches, allowing for a higher publish rate by asynchronously publishing more events while previous publish operations are still being processed on the server. As a result, using the publish stream method helps clients efficiently scale to handle a large volume of publish requests. Some considerations around event ordering, keeping the publish stream alive and handling errors are documented in our developer docs. In the Summer ’24 release, we increased the timeout for retaining the client server connection during publish to 30 minutes.

A flowchart showing that bidirectional streaming increases throughput by allowing the client to send requests and receive responses simultaneously

Expect network connectivity failures for bidirectional streams

Clients should implement a retry mechanism to handle any network failures in their integrations. When developing your client, it’s crucial to consider the potential for network connectivity issues and design with failure scenarios in mind. A robust retry strategy will help ensure that these issues are effectively managed in most cases.

Share a gRPC channel for multiple requests to the server

A gRPC channel operates over a single HTTP/2 connection, allowing multiple calls to be handled simultaneously on that connection. Creating a new gRPC channel for each call can unnecessarily increases the time taken to process the request. Each call will necessitate multiple network round-trips between the client and the server to establish a new connection. Therefore, it is recommended to share an existing gRPC channel to send multiple requests to the server. You can send multiple concurrent requests per gRPC channel. When the number of active calls hits the connection’s stream limit, any additional calls are placed in a queue on the client side. These queued calls must wait for the active calls to finish before they can be processed.

For managed subscriptions, commit often

When using managed subscriptions, our latest Pub/Sub API beta feature as of the Winter ’25 release, follow the best practice of committing often. Clients need to commit the replay ID of the last processed event, so that the server can store it and resolve it when more events are requested. To ensure accuracy and stability, clients should commit often, ideally after every batch, even if they received an empty batch of events. It keeps the stream alive and the connection retained. Additionally, clients absolutely need to commit within 30 minutes of receiving a ManagedFetchResponse to prevent the connection from closing.

Keepalive to keep the stream open

The keepalive mechanisms ensure persistent and reliable connections between the client and the server. It’s crucial for the client to implement optimal keepalive behavior to maintain the integrity and continuity of event streaming. On subscribes, if there are pending events, the stream will remain open. However, if there are no pending events, the client must send a FetchRequest to keep the stream active. On the other hand, if there are pending events but no new events for the server to send, the server is responsible for keeping the stream alive by sending a keepalive signal.

For more information, see Subscribe Keepalive Behavior and ManagedSubscribe Keepalive Behavior.

Conclusion

Salesforce Pub/Sub API is a powerful tool for building scalable and manageable integrations through event-driven architecture. This API is functionally designed to efficiently handle the large volume replication and eventing needs of all Salesforce platform events, as well as external events using the event relay. Performance tests have demonstrated that Pub/Sub API is more efficient at scaling system integrations compared to other approaches.

In this post, we learned that it is essential to follow certain best practices to optimize API endpoints for maximizing performance, scalability, throughput, and reliability in client application design and implementation. These practices help manage the large volume of requests, ensure consistent API performance, and enable smooth error recovery for clients when needed.

Resources

About the Author

Neha Ahlawat is a Product Management Director at Salesforce, where she is focused on the product areas of Metadata API, Tooling API, source tracking, and change sets. Follow her on X and LinkedIn.