Improve Availability in Your Org

Think back to a time when your solution wasn’t working as expected. Maybe your users experienced poor performance when updating a record that had numerous Apex triggers, or an integration with the Salesforce REST API was overwhelmed by a high volume of calls due to an event. These experiences may be anywhere from minor inconveniences to a complete service outage for your users. The good news: often these limited availability incidents are preventable through thinking ahead and coding efficiently with availability and resilience at the top of your mind.

What is availability?

Availability is the percentage of time that a service successfully handles requests. Salesforce measures and tracks this percentage to make sure that our servers and services are highly available to every customer.

While server and service availability is critical to Salesforce — customer-experienced availability is even more so — which directly relates to how users experience Salesforce. For example, can users log in to your org and update records quickly and reliably? Is data being updated through APIs efficiently without errors?

Customer-experienced availability is tricky to measure since every org, user, and company is unique. A user’s experience can vary greatly depending on many factors, such as Apex code and Lightning component design, geographic region, and network connection. Despite the variability of these factors, Salesforce considers availability a top priority as a part of our core value of trust.

Why should I care about availability?

Availability, or lack thereof, affects everyone: you, your users, and your company. A highly available org means that your users are able to continue their responsibilities while supporting your company’s growth. An org that performs poorly or unreliably creates frustrations and, in extreme cases, could financially or reputationally affect your company.

As a developer, how you code your Salesforce solution has a direct impact on your users’ experience. It’s important that your code not only satisfies functional and business requirements, but that your users can confidently interact with and execute the code you wrote, whenever they need it.

What common availability anti-patterns should I avoid?

As a developer, be on the lookout for these common anti-patterns and ask critical questions to avoid them.

Anti-pattern #1: Brittle integration designs

While it’s easy to focus on the “happy path” of an integration and code something simple, how does your integration hold up against exceptions or under pressure? Here are a few common questions to ask before coding an integration:

What is the expected volume of data that will go through the integration, and at what rate? The Salesforce REST API can insert or update 200 records at a time. Making one API call to update 200 records is much faster and uses much less server capacity than having 200 API calls updating one record each.
Can your integration switch between the REST and Bulk API depending on the volume and rate of data coming in?
What if validation, trigger, or Flow logic changes in Salesforce in the future and then rejects the record update in the API?
What if Salesforce is down for maintenance? Can the integration hold pending API calls and reprocess them when Salesforce is available? Can the backlog efficiently handle the high volume of transactions?
How are unexpected errors handled? Are errors from the Salesforce API logged and reported for your team to investigate?

Address these questions up front in the integration design and ensure a more reliable, resilient, and adaptable integration during unforeseen situations.

Anti-pattern #2: Lack of Apex logic visibility in your “trigger framework”

It’s fantastic that so many developers are adopting the concept of a trigger framework to modularize and easily manage Apex logic. However, we’ve seen that this type of framework can also be a double-edged sword, for example:

Multiple development teams are working on Apex logic that calls upon the same object’s before/after trigger. However, the teams don’t have visibility to the other teams’ work. As a result, the combined Apex logic uses up more of the Apex limits and exceeds governor limits.
Orgs that have logic spread across Flows, Process Builders, workflows, and Apex code. This consumes a significant amount of server capacity when executed and can cause logic conflicts and governor limit breaches, not to mention a maintenance nightmare with considerable tech debt.

To avoid conflict, ask the following questions whenever you or your team plan to add new logic or processes into your trigger framework:

What other logic will be running alongside what I’m about to code? How much of the governor limits are they consuming already, and how much can I use?
Can my logic be combined with other logic? For example, can we share any existing SOQL queries and add one or two more fields to the query? Can record field updates be bundled in other logic that has existing DML statements?
Will other developers or admins add more logic into the org, with or without triggers or Flows? How do we coordinate, so additional logic doesn’t conflict with what I do?
Can I do more with less? Saving one SOQL query or one DML statement might not look like much. However, if your org has thousands of transactions in an hour, you can save thousands of database queries and free up server capacity, which means your org is less likely to experience a performance degradation.

Anti-pattern #3: Unstructured deployment during peak hours

The majority of availability incidents happen when changes are made to your org setup in production. You may think that these are because of untested changes. However, the most common issue caused by deployment is from the background jobs that are running because of a metadata change. For example:

Changing custom fields on objects with millions of records causes the entire database table to trigger across those records in the background.
Deploying new code invalidates the existing compiled Apex code and triggers a background job in Apex recompilation.

These background jobs also consume server capacity that could be used to serve your users’ interactions and API calls. When they run during your org’s peak business hours, the risk of straining the server and impairing your users’ experience increases.

This problem can be further exacerbated by unstructured metadata changes, like manually changing a custom field through the Setup UI, then moving on to make another manual change on a different custom field.

Next time you plan to deploy code, consider how you can do it more efficiently, and ask the following questions:

When is the best time to deploy code that minimizes the risk of user disruption?
How can I bundle all my changes to be deployed at the same time? (Hint: Salesforce DX and DevOps Center!)
Are there also other deployments by other teams? Can we work to efficiently deploy without conflicting with each other?

Anti-pattern #4: Lack of meaningful debug logs or alerts

When developing a Salesforce solution, you’re serving the needs of business users, admins and the IT operations team. They’re running and managing your org, and it’s critical they know how the org is functioning and when to respond if errors occur. They need visibility into any complex logic you introduce into the org. The best way to provide that is through debug logging and ensuring that error alerts reach them.

Ask yourself these questions to give your admins and IT operations staff visibility into your code’s operations:

How can they know if my code is functioning as expected, and how much is my code being used by users?
What kind of errors can occur in my code? How can I alert those errors to admins and the IT operations team effectively?
If they want to triage and diagnose potential issues, how can they see how my code is being executed without overwhelming them with every line of code?

Anti-pattern #5: “Fix it first” mentality during emergencies

Developers are often paged when an org encounters an issue that stops users from conducting key business functions. They typically focus on “fixing the problem” to bring service back to users. However, during an emergency availability incident, this can be detrimental.

During an incident, the priority order of operations should be to minimize business impact, recover system operations, then find and fix the root cause.

When asked to look at an issue during an incident, ask yourself:

How are users being impacted by the incident right now? How can I quickly minimize their impact?
Is the incident caused by a recent change? If yes, can I reverse the change?
How do I enable admins and the IT operations team to triage the incident? Can they gather valuable details about the incident, so I can jump straight into the issue and bring operations online?

Don’t forget to conduct a post-mortem after an incident to investigate the root cause and fix the issue. This should happen after an incident is remediated.

More tips on improving availability in your org

That’s a crash course on how you as a developer and your code can improve the availability of your org. We don’t expect you to be an expert right away, but you can begin improving availability today by:

Being curious about your org and its design. If you see an existing inefficiency, add it to the backlog to investigate.
Asking questions. If you’re given requirements to code but it would involve solutions that could create scalability issues in the future, be proactive in calling out potential availability risks during design review so you can ensure they’re covered as you write your code.
Stay up-to-date on best practices using Salesforce Help and the Salesforce Developers website. If you can’t find an answer, check out the Trailblazer Community.
Keep learning. Use Trailhead to stay up-to-date and familiarize yourself with the latest and greatest coming out of Salesforce.

We have several resources to help you along the way. Check out Salesforce Availability, a new website launched to help you improve the availability of your implementation. Additionally, the new Availability Help and Training covers availability best practices in more detail. We’re also working closely with Well-Architected to ensure these concepts are well-embedded into a single framework for all Salesforce professionals to follow. Be on the lookout as we roll out more tools and resources to help improve availability in the near future.

About the author

Jsun Pe is a Director of Product Management in Availability & Infrastructure Engineering Services at Salesforce, focusing on enabling Salesforce customers in building highly available and resilient Salesforce solutions. Since starting in the Salesforce ecosystem in 2009, he witnessed the growth of the Salesforce Platform while earning his Salesforce Certified Technical Architect credential. Jsun helped to build technical practices for top consulting partners in Australia and New Zealand, then joined Salesforce in 2016. During this time, he discovered his interest in advanced architecture considerations in platform performance and availability, ultimately leading to his current role.