In the Salesforce Spring ’25 release, native Apex Zip support is generally available through the ZipReader and ZipWriter classes in the Compression namespace. This opens the door for many great use cases. One of these use cases is the ability to read and write Excel (XLSX) files natively on the platform using Apex only, without the need for any third-party application or integration.
In highly regulated enterprises, procuring third-party applications from new vendors or implementing new integrations can be very expensive and time-consuming. For common and relatively simple use cases like reading and processing an Excel file (think of email attachments or weekly file uploads), this might be too costly.
The ability to use Excel data anywhere in your business process automation allows you to streamline your data processes even further. In this blog post, we’ll discuss an open source Apex utility called Lightweight – XLSX Util that leverages these new Zip classes to read and create XLSX files.
Let’s start by looking at the high-level process flow.
Reading an XLSX file process flow
The reading process starts with an XLSX file that is stored in Salesforce as a ContentDocument, Document, or Attachment. The new ZipReader methods extract the multiple worksheet XML files from the XLSX file. The utility then processes these XML files and converts the Excel data to Apex data types. At that point, we have the data available in Apex, and we can then expose and use that data anywhere we want in Salesforce.
To understand how an Excel file works, lets dive into the Excel file structure.
The ECMA-376 standard
Excel (XLSX) files are actually Zip archives that contain a set of XML files that together form a workbook. These files include components like spreadsheets, styles, and other metadata. The Open Office ECMA-376 standard describes how these archives are structured.
An unzipped XLSX file contains folders and files with information about the structure, the document properties like author, title, and description, and most importantly, the actual data.
The data is stored in the xl/worksheets
folder. Each worksheet in a workbook is named sheetX.xml
(see docs), where X represents the order in the workbook.
The only other data-related file is the sharedStrings.xml
(see docs). This file contains a single array with all the unique strings in the entire workbook.
Data format
Now that we know in which files in the data is stored, let’s look at the format of a basic Excel sheet, represented in the example below.
In this example, there is a number value in cells A1
and B1
. Cell A2
has a text value and cell B2
contains a formula that calculates the sum of A1
and B1
.
For this worksheet, the values would be stored in an XML file that looks like this:
The first worksheet element is dimension
(see docs) This element contains the size of the worksheet, followed by the sheetData
(see docs) element which contains row and cell data.
The sheet data contains row
(see docs) elements that are ordered by the number stored in the r
attribute. Row elements contain cell elements named c
(see docs) that just like rows, store their name (A1, B2 etc.) in the r
attribute.
Cells have a t
attribute to indicate their data type (see docs), for example, string, boolean, number, or shared string. The cell data is stored in the v
child element of a cell (see docs). Depending on the data type, the data is either the actual cell value or a reference index in the shared strings array.
Formula cells contain both the formula and the pre-calculated value. The utility does not re-calculation formula values.
Converting ECMA-376 to Apex
The Lightweight – XLSX Util converts the Excel data structure to Apex in a way that is manageable, scalable, maintainable, and most importantly, user friendly for the implementing developers.
Excel data is made up of worksheets, columns, and rows. A worksheet is identified by it’s name, a column by a letter, and the row by a number. This means that a value on the first cell in the first row on the first worksheet has the format worksheet1.A.1
. The cell name would be A1
.
If we make this abstract, we get the format: worksheet.column.row
and worksheet.cellname
. The library can also output the format worksheet.row.column
to handle record data more easily.
Generally speaking, there are two types of data used in Excel: data tables or a set of specific cells.
- Using a multi-dimensional array: In Apex, data tables convert very well to a multi-dimensional arrays. The cell values are stored at the index of
worksheetIndex.columnIndex.rowIndex
. The Apex data type isObject[][][]
orList<List<List<Object>>>
(whichever coding style you prefer). This is perfect for large data volumes. - Using a list of maps: If you require specific values, it’s better to use a list of maps, where the map key matches the cell name and the map value contains the cell value. The output data type is
List<Map<String,Object>
. Each worksheet is a list item. This approach is slightly more complex to process, but better suited for smaller data volumes.
For performance reasons, all values are of the type Object
. It’s more efficient to cast the data at the point where it is used rather than casting all values at the time of reading. If you’re interested in in-depth performance analysis, please see the various links in the resources section below. The article Salesforce Apex Optimization: Maps vs Multi-Dimensional Arrays goes into a lot of detail on the differences.
Reading and parsing the XLSX file
XLSX files are actually Zip files. We use the getEntriesMap()
method (see docs) from the ZipReader class to get the zipped content. This method returns a map with all the files in the zip archive formatted like this: Map<
String
,
Compression.ZipEntry
>
.
The utility’s main parsing methods are a part of the Apex xlsx.Parse
class. The xlsx namespace allows for user-friendly and descriptive method names like xlsx.Parse.toArray()
and xlsx.Parse.toMap()
.
The following Apex snippet shows how an XLSX file stored as Document is queried and parsed to an array and map.
The array output looks like the output below. For readability purposes, the output is converted to JSON. Note that the JSON parser converts the Object data type to the String data type when rendering them (in a debug call, for instance), hence the numbers are represented as Strings in the example below.
The result for the map method looks like this:
Just like that, we have all the data from the Excel workbook at our fingertips in Apex.
Excel files can be very large. For more details on heap-efficient ways to process large amounts of data, see Salesforce Apex Optimization: Large Strings vs Heap Size and CPU Time. This article explains why it is most efficient to use the file entry maps as a the parse method arguments, and how to optimize the above example even further.
Practical applications
Now we know the technical principle, let’s look some additional methods and examples that can be useful when you implement this library in practice.
You might not always know the order or the exact name of the worksheets. You can use the xlsx.Parse.toWorksheetNameIndexMap()
method to create a map with all the worksheet names and indexes.
If you work with record data where each row in a worksheet represent a record, it makes more sense to have the columns and rows switched around. You can use the xlsx.Parse.toArrayInverted()
method to get the output in the worksheet.row.column
format.
There is a XLSX to sObject example and XLSX to CSV example in the examples folder in our GitHub repo.
Final notes
- The library is designed for handling cell data. For most scenarios, this is sufficient. However, more advanced Excel features, such as dropdown lists or pivot tables, may require additional custom code or extensions.
- You need to understand the data structure in advance. If the data structure changes often, you’ll need to write custom logic to dynamically determine things like data ranges, column positions, etc.
- The library doesn’t handle data type conversions. Casting data into specific data types is up to the implementing developer to handle.
- Be mindful of the governor limits: 6MB/10s for sync and 12MB/60s for async processes. File reading and processing times count against these limits. If you run into these limits, you might want to consider moving the file processing off-platform. Check the performance tests folder for benchmarks examples and results.
- Always speak to your technical architect to validate that using Apex to handle Excel data processing is architecturally the right solution.
Conclusion
We have shown an example of how the new Zip functionality can help with a real use case that was not possible before at this scale. The amazing Zippex library can now be retired after many years of loyal service to make way for the native solution.
Seeing this Excel example hopefully gives you an idea of what great things you can do with the new Zip functionality, and what trade-offs and performance optimizations you’ll deal with when translating an industry standard to Apex.
You can find the full open source project details here:
- Lightweight- XLSX Util repository
- Install the managed package in Production / Sandbox
- Install the unlocked package in Production / Sandbox
Resources
- GitHub: Lightweight – XSLX Util repository and package
- Documentation: Best Practices to resolve the Heap Size Error – Apex Governor Limits – CheckList
- Blog Post: Apex Zip Support Performance Test
- Blog Post: Salesforce Apex Optimization: Maps vs Multi-Dimensional Arrays
- Blog Post: Salesforce Apex Optimization: Large Strings vs Heap Size and CPU Time
- Reference: Wouter van Vugt – Open XML The Markup Explained
About the author
Justus van den Berg is a Technical Architect in the UK and Ireland organization at Salesforce with a passion for (Apex) development. He focuses on the “how” part of solving customer challenges.