+ Start a Discussion
Chris760Chris760 

How to work with PDF (base64) data in APEX as String?

I have a regular old PDF file on my desktop, and that same PDF file is attached to a record in Salesforce.  Now, when I change the PDF's extension on my computer to ".txt", I can then open it and read the texttual data that makes up the file.  I need to extract some data at the beginning of the ".txt" version of the file that details how many pages the PDF consists of, but I need to do this for many PDF files that are already in Salesforce.

I discovered the "String pdf = System.EncodingUtil.base64Encode(a.Body)" method, which apparently converts the base64 blob representation of the file into an unencoded string... but when I looked at the actual output, I realized that it was totally different compared to the ".txt" version of the PDF on my desktop.  It seems that I'm not encoding/decoding the attachment Body blob to the correct format, but I'm totally clueless as to what to convert it to, or what method I would need to use to convert it.

Does anyone have any ideas as to how I would convert the PDF data to whatever format it gets converted to when I just change the extension on my computer and view it in notepad?

Thanks!


Best Answer chosen by Chris760
Ray GuyRay Guy
This isn't easy at all unfortunately.   What you're trying to do is view the bytes of the attachment coverted into displayable characters where possible.   What opening it in a text editor is doing for you is ignoring the binary characters that don't make sense as display characters.   Unfortunately if you try and get there using apex, salesforce will stop you short since there's no way to tell it to ignore binary values that aren't character display data.   Instead of ignoring them, it will give you an error instead when it encounters them.  For example, you could try and convert the Attachment's Body field to a String directly using:

String pdf = attachment.Body.toString();

but it's going to give you an error saying the BLOB (binary data) is "not a valid UTF-8 string".   What it means by this is some of the values it's coming across in the binary data don't match to any character it could put in a String (Salesforce uses "UTF-8" string encoding) - so it rejects the whole lot.   Your text editor on the other hand just replaces these with whitespace and lets you view the valid ones.

You can turn binary data into a form that can be shown in a String but that's what you have already, and as you've found, in order to turn all the binary data into valid display characters it encodes the whole thing making it unreadable for your purposes.

It would be great if you were able to step through the binary data in the Attachment Body one byte at a time, extract the ones that are normal characters and ignore the others.   But this isn't possible in apex.

The convoluted workaround that some have done involves turning the binary data into encoded base64 format (as you have).   But then hand-rolling a base64 decoder which will pick through the encoded string piece by piece and give you access to the individual decoded byte values.    You would then be able to add your own logic to determine if the byte is in the normal ascii character range (0 to 128)  (This appears to be how the values are stored in the first part of a PDF)

Here's a stackexchange question where the accepted answer delves into this a bit: http://salesforce.stackexchange.com/questions/860/mimic-mysql-aes-encrypt-in-apex/910#910
But it's pretty heavy code work I'm afraid and far from simple.

Ray
ForceClarity Ltd.

All Answers

Vinita_SFDCVinita_SFDC
Hello,

Please refer below link for encoding/decoding base64 in apex with the help EncodeUtil class:

http://www.salesforcegeneral.com/salesforce-articles/base-64-encoding-in-apex.html
Ray GuyRay Guy
This isn't easy at all unfortunately.   What you're trying to do is view the bytes of the attachment coverted into displayable characters where possible.   What opening it in a text editor is doing for you is ignoring the binary characters that don't make sense as display characters.   Unfortunately if you try and get there using apex, salesforce will stop you short since there's no way to tell it to ignore binary values that aren't character display data.   Instead of ignoring them, it will give you an error instead when it encounters them.  For example, you could try and convert the Attachment's Body field to a String directly using:

String pdf = attachment.Body.toString();

but it's going to give you an error saying the BLOB (binary data) is "not a valid UTF-8 string".   What it means by this is some of the values it's coming across in the binary data don't match to any character it could put in a String (Salesforce uses "UTF-8" string encoding) - so it rejects the whole lot.   Your text editor on the other hand just replaces these with whitespace and lets you view the valid ones.

You can turn binary data into a form that can be shown in a String but that's what you have already, and as you've found, in order to turn all the binary data into valid display characters it encodes the whole thing making it unreadable for your purposes.

It would be great if you were able to step through the binary data in the Attachment Body one byte at a time, extract the ones that are normal characters and ignore the others.   But this isn't possible in apex.

The convoluted workaround that some have done involves turning the binary data into encoded base64 format (as you have).   But then hand-rolling a base64 decoder which will pick through the encoded string piece by piece and give you access to the individual decoded byte values.    You would then be able to add your own logic to determine if the byte is in the normal ascii character range (0 to 128)  (This appears to be how the values are stored in the first part of a PDF)

Here's a stackexchange question where the accepted answer delves into this a bit: http://salesforce.stackexchange.com/questions/860/mimic-mysql-aes-encrypt-in-apex/910#910
But it's pretty heavy code work I'm afraid and far from simple.

Ray
ForceClarity Ltd.
This was selected as the best answer
Chris760Chris760
Hi Ray,

Man, thanks a lot for taking the time to write that all out.  Really great information!  I went ahead and took a look at that stack exchange post, and it actually gave me an idea... to use the convertToHex(Blob) method to convert it to Hex first, and then use a While loop and a Map to convert it all to regular ASCII and it actually worked!  Soooooo stoked!  And there’s no way I would have been able to pull it off without the background and resources you provided, so much thanks. :)

Basically what I ended up doing (to be even more detailed) was I first converted the Base64 blob to a Hex string.  Then I used the IndexOf method to find the beginning and end of the substring that would contain the page number of each PDF (the PDF page numbers are always sandwiched between the words “/Count “ and “/Kids”), except I just converted those two search terms into Hex and searched for their Hexadecimal equivalents using indexOf.  That made the While loop WAY shorter since it didn’t have to convert everything, just the data I actually needed to see in ASCII.  But if you needed to convert the entire document, it wouldn’t be that much trouble… you’d just need to add the other characters to your Hex to ASCII map and convert the whole thing, instead of just a substring like I did (I only included the numbers 0-9 in my map since I only extracted number values in my substring).

The only slight wall I ran into was governor limits -- not surprisingly.  The heap size was pretty massive since the query was calling full sized PDF documents in a list.  I finally found that limiting the query to 100 records allowed it to always update the entire batch without capping out, but there was no way I was gonna sit there and execute the code 400 times to update all the records in the org with PDF’s attached, so I made a quick BatchApex class to execute the code in groups of 100 records and then updated them all in one go.  I've pasted the code below in case you or anyone else might ever find it helpful.

Below is my BatchApex class containing all the code:

global class batchFaxUpdate implements Database.Batchable<sObject>
{
    global Database.QueryLocator start(Database.BatchableContext BC)
    {
        String query = 'select Id, efaxapp__Total_Pages__c, App__c, Updated__c from efaxapp__Received_Fax__c where App__c = \'MyFax\' and efaxapp__Total_Pages__c < 2 and Updated__c = false';
        return Database.getQueryLocator(query);
    }
 
    global void execute(Database.BatchableContext BC, List<efaxapp__Received_Fax__c> scope)
    {
        list<efaxapp__Received_Fax__c> processed = new list<efaxapp__Received_Fax__c>();
       map<Id,efaxapp__Received_Fax__c> receivedFax = new map<Id,efaxapp__Received_Fax__c>();
       map<String,String> hexToAscii = new map<String,String>{'30'=>'0','31'=>'1','32'=>'2','33'=>'3','34'=>'4','35'=>'5','36'=>'6','37'=>'7','38'=>'8','39'=>'9'};
      
        for(efaxapp__Received_Fax__c r : scope)
        {
        receivedFax.put(r.Id,r);
        }

  for(Attachment a : [select Id, Body, ParentId, ContentType from Attachment where ContentType = 'application/pdf' AND ParentId IN: receivedFax.keySet()]){

        String pdf = System.EncodingUtil.convertToHex(a.Body);
        String header = pdf.left(8000);
        String footer = pdf.right(8000).reverse();
        Integer StartIndex = 0;
        Integer EndIndex = 0;
        String Count = '';
        String ConvertedOutput = null;

        if(header.indexOf('2f436f756e7420') > 0){
            StartIndex = header.indexOf('2f436f756e7420')+14;
            EndIndex = header.indexOf('0d',StartIndex+1);
            Count = header.substring(StartIndex,EndIndex);
        }
        else if(footer.indexOf('374696b4f2a0') > 0){
            StartIndex = footer.indexOf('374696b4f2a0')+12;
            EndIndex = footer.indexOf('0247e657f634f2',StartIndex+1);
            Count = footer.substring(StartIndex,EndIndex).reverse();
        }
        if(StartIndex > 0){
         StartIndex = 0;
         EndIndex = 2;
         while(Count.length() > StartIndex){
             ConvertedOutput = (ConvertedOutput == null ? hexToAscii.get(Count.substring(StartIndex,EndIndex)) : ConvertedOutput + hexToAscii.get(Count.substring(StartIndex,EndIndex)));
             StartIndex = StartIndex+2;
             EndIndex = EndIndex+2;
         }
     }
        efaxapp__Received_Fax__c r = receivedFax.get(a.ParentId);
        r.efaxapp__Total_Pages__c = Integer.valueOf(ConvertedOutput) == null ? r.efaxapp__Total_Pages__c : Integer.valueOf(ConvertedOutput);
        r.Updated__c = true;
        processed.add(r);
     }

  update processed;
        }
    global void finish(Database.BatchableContext BC)
    {
    }
}


And below is my Execution statement (which I just ran in the ExecuteAnonymous window to initiate the BatchApex in groups of 100 records):

batchFaxUpdate b = new batchFaxUpdate();
database.executeBatch(b,100);



Anyway, thanks again Ray!  You rock!!
Mitesh SuraMitesh Sura
Thanks Ray for such great explanation! I am same boat, however I have no clue of the contents of the attachments I am trying to show on VF page (rendered as PDF). It can contain text, images, or just about anything PDF can hold.

So the whole idea is to get PDF and show it within VF page rendered as PDF. I am trying to find a solution for many days now.Would above solution still work in this scenario?
usha kamilusha kamil

Thanks Ray for such great explanation!

-Badoo (https://apk4f.com/badoo-premium-apk)

Ashu PatelAshu Patel
Download Best Hindi Status Updates (https://www.hindistatusupdates.com/) for Whatsapp.
Download the Latest and Best Kamina Status in Hindi (https://www.hindistatusupdates.com/2020/05/kamina-status-for-whatsapp-in-hindi.html)