Error
This content is not currently supported on this browser.
+ Start a Discussion
DonSTDonST 

Convert a string to ASCII values?

In order to integrate with a legacy system from Salesforce, I need to generate a digest in a certain way. Luckily the methods in the Crypto class get me 99% there, but for one step of the process I need to convert the characters in a string into their ASCII code values in order to do some math with the results.  

The problem is that I can't find a simple way to do this in APEX. Conversions between strings and integers don't appear to do what I need as they would in most other languages.  I can use the convertToHex method in EncodingUtils to get the hex value for a string, but the result is yet another string  and the Integer class valueOf doesn't accept a number base.  In the worse case, I can write the conversion of a hex string to an integer myself, but I want to be sure there's not another answer.

I know this a rather low-level task for APEX. Any suggestions?  

rtuttlertuttle

Super old post, but I ran into the same issue and thought I'd post my solution in case anyone has to tackle this exact issue:

 

http://www.cloudywithachanceofcode.com/converting-string-to-decimal-character-array-in-apex/

 

 

Hope it helps someone.

 

-Richard

boBNunnyboBNunny

I found a way to see if a string is Unicode.  The same technique could be used to figure out the ASCII value for characters if desired.  The problem with "blob.valueOf" is that it sees Extended ASCII (128-255) as Unicode.  This of course isn't true.  Use this function and you will get back true if any part of your string is Unicode.  If you want the ASCII values, you can create a loop with the string to parse the values 2 at a time, and if one starts with e3, then the next character is the unicode character within that set.

 

 

 

public static Boolean isUnicodeString(string strInput) {
  Boolean rtn = false;
  if (strInput == null || strInput == '' || blob.valueOf(strInput).size() == strInput.length()) return rtn;

  string strHex = encodingUtil.convertToHex(blob.valueOf(strInput));
  if (!strHex.contains('e3')) return rtn;
  
  return true;
 }

rtuttlertuttle

I should have updated this.  I completely modified the code to support the unicode characters.  This basically gives you utf8 code point numbers.  Also for my purposes I realized I don't have to convert them the way I was.  I needed a way to get an equivalent of byte in java, which I was able to work out just fine with converting the string to hex then do integer.  This code below is something I came up with while trying to figure it all out and might be useful to anyone who needs unicode values for strings.

 

 

private static Map<String,Integer> hexMap = new Map<String,Integer>();
static {
	hexMap.put('0',0);
	hexMap.put('1',1);
	hexMap.put('2',2);
	hexMap.put('3',3);
	hexMap.put('4',4);
	hexMap.put('5',5);
	hexMap.put('6',6);
	hexMap.put('7',7);
	hexMap.put('8',8);
	hexMap.put('9',9);
	hexMap.put('A',10);
	hexMap.put('B',11);
	hexMap.put('C',12);
	hexMap.put('D',13);
	hexMap.put('E',14);
	hexMap.put('F',15);
	hexMap.put('a',10);
	hexMap.put('b',11);
	hexMap.put('c',12);
	hexMap.put('d',13);
	hexMap.put('e',14);
	hexMap.put('f',15);
}

/*  stringToCodePoint
 *  converts all strings to code point values (UTF8)
 *  which could be converted back to string values later
 */	
public static List<Integer> stringToCodePoint(String input) {
	String hex = EncodingUtil.convertToHex(Blob.valueOf(input));
	List<Integer> charList = new List<Integer>();
	Integer increment = 2;
	for(Integer i=0; i<hex.length(); i+=increment) {
		Integer out = 0;
		Integer c1 = (hexMap.get(hex.subString(i,i+1)) * 16) + (hexMap.get(hex.subString(i+1,i+2)));
		Integer c2 = 0;			
		Integer c3 = 0;
		Integer c4 = 0;
		if(c1 <128) {
			charList.add(c1);
			increment = 2;
			continue;
		}
		if(c1 > 127 && c1 < 192) {
			throw new InvalidByteTypeException('error parsing hex, probably not a utf8 hex string');
			continue;
		}
		if(c1 > 193 && c1 < 224) {
			// first of 2
			increment = 4;				
		}
		if(c1 > 223 && c1 < 240) {
			// first of 3
			increment = 6;
		}
		if(c1 > 239 && c1 < 245) {
			// first of 4
			increment = 8;
		}
		
		c2 = (hexMap.get(hex.subString(i+2,i+3)) * 16) + (hexMap.get(hex.subString(i+3,i+4)));						
		if(increment == 4) {
			out = (c1 - 192) * 64 + c2 - 128;
		}
		if(increment == 6) {
			c3 = (hexMap.get(hex.subString(i+4,i+5)) * 16) + (hexMap.get(hex.subString(i+5,i+6)));
			out = (c1-224)*4096 + (c2-128)*64 + c3 - 128;
		}
		if(increment == 8) {
			c4 = (hexMap.get(hex.subString(i+6,i+7)) * 16) + (hexMap.get(hex.subString(i+7,i+8)));
			out = (c1 - 240) * 262144 + (c2 - 128) * 4096 + (c3 - 128) * 64 + c4 - 128; 
		}
		charList.add(out);
	}		
	return charList;
}

 

 

boBNunnyboBNunny

This absolutely boggles my mind that SFDC hasn't created an ASC function to return the code for the character (0-65535).  Java inherently allows it, .NET, and virtually every other language, but for some reason SFDC has limited the Integer conversion and not provided an alternative.

SOA GuySOA Guy

Thank you rtuttle for that code snippet!

 

I have users entering Chinese characters into SFDC forms or copy/pasting in binary data from machine logs.  Also many web servers think input values containing angle brackets are scripting attacks.  When this data is sent via web callouts to an external web service it never arrives.  HTML or Base64 encoding is problematic because I don't control all the endpoints and they expect non-encoded strings.  I need to strip all outbound non-printable characters that users can enter for reliable XML transport.  Converting chinese characters stumped me because I didn't understand how they can Hex encode into varying lengths.

 

Here's what I built that strips invalid XML and invalid Windows 1252 characters from a string using APEX.  I'm fairly new to APEX programming and would love to see someone improve this.

 

... and btw, this for loop in C# is one line of code:   "foreach (char c in xml)"

 

public with sharing class XmlTextCleaner {

 

    /// <summary>

    /// Remove illegal XML characters from a string.

    /// </summary>

    public static string SanitizeXmlString(string xml)

    {

        if ((null == xml) || (0 == xml.length()))

            return xml;

           

        String ret = '';

 

           (Apex code from above removed to fit within allowed posting size)

 

            for (Integer i=0; i < hex.length(); i += increment)

            {

 

           (Apex code from above removed to fit within allowed posting size)

                 

            if (IsLegalXmlChar(out) && IsLegalWindows1252(out)) {

                  if (60 == out) // "<"

                        charList.Add(91);

                  else if (62 == out) // ">"

                        charList.Add(93);

                  else

                        charList.Add(out);

            }

            }

            String s = String.fromCharArray(charList);

           

        System.debug('SanitizeXmlString: ouput=' + s);

        return s;

    }

 

 

    /// <summary>

    /// Whether a given character is allowed by XML 1.0.

    /// </summary>

    private static boolean IsLegalXmlChar(integer character)

    {

        return

            (

                character == 9 /* == '\t' == 9   */        ||

                character == 10 /* == '\n' == 10  */        ||

                character == 13 /* == '\r' == 13  */        ||

                (character >= 32 && character <= 55295) ||

                (character >= 57344 && character <= 65533) ||

                (character >= 65536 && character <= 1114111)

            );

    }

       

 

    // from http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

    private static boolean IsLegalWindows1252(integer character)

    {

        return

            (

                character == 9 /* == '\t' == 9   */                       ||

                character == 10 /* == '\n' == 10  */                       ||

                character == 13 /* == '\r' == 13  */                       ||

                (character >= 32 && character <= 255) ||

                /* 0x01-- */

                character == 338 /* LATIN CAPITAL LIGATURE OE */         ||

                character == 339 /* LATIN SMALL LIGATURE OE */           ||

                character == 352 /* LATIN CAPITAL LETTER S WITH CARON */ ||

                character == 353 /* LATIN SMALL LETTER S WITH CARON */   ||

                character == 376 /* LATIN CAPITAL LETTER Y WITH DIAERESIS */ ||

                character == 381 /* LATIN CAPITAL LETTER Z WITH CARON */ ||

                character == 382 /* LATIN SMALL LETTER Z WITH CARON */   ||

                character == 402 /* LATIN SMALL LETTER F WITH HOOK */    ||

                /* 0x02-- */

                character == 710 /* MODIFIER LETTER CIRCUMFLEX ACCENT */ ||

                character == 732 /* SMALL TILDE */                       ||

                /* 0x2--- */

                character == 8211 /* EN DASH */                           ||

                character == 8212 /* EM DASH */                           ||

                character == 8216 /* LEFT SINGLE QUOTATION MARK */        ||

                character == 8217 /* RIGHT SINGLE QUOTATION MARK */       ||

                character == 8218 /* SINGLE LOW-9 QUOTATION MARK */       ||

                character == 8220 /* LEFT DOUBLE QUOTATION MARK */        ||

                character == 8221 /* RIGHT DOUBLE QUOTATION MARK */       ||

                character == 8222 /* DOUBLE LOW-9 QUOTATION MARK */       ||

                character == 8224 /* DAGGER */                            ||

                character == 8225 /* DOUBLE DAGGER */                     ||

                character == 8226 /* BULLET */                            ||

                character == 8230 /* HORIZONTAL ELLIPSIS */               ||

                character == 8240 /* PER MILLE SIGN */                    ||

                character == 8249 /* SINGLE LEFT-POINTING ANGLE QUOTATION MARK */     ||

                character == 8250 /* SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */    ||

                character == 8364 /* EURO SIGN */                         ||

                character == 8482 /* TRADE MARK SIGN */

            );

    }

}

 

boBNunnyboBNunny

I have come up with 2 methods to assist in this requirement.  One tells you simply if there is Unicode or not and what the Unicode characters are, and the other actually tells you the Unicode character number for each position.  I keep these 2 in a Utility class that I can access from anywhere.

 

	public static String isUnicodeString(string strInput) {
        /*
        Created By: Robert Nunemaker
        Created On: 12/04/2008
        Purpose: isUnicodeString
            Accepts a string and returns a string of the characters within that are Unicode. 
            If the returned string is empty, then the string is all ASCII (0-255).     
        --------------------------------------------------------------------------
        Modified By:  
        Modified On:  
        Modification: 
        */

		string rtn = '';
		string strChar = '';
		string strHex = '';
		if (strInput == null || strInput == '' || blob.valueOf(strInput).size() == strInput.length()) return rtn;

		system.debug('Length of string: ' + strInput + ' = ' + strInput.length());
		for (integer nCol = 0; nCol <> strInput.length(); nCol++) {
			strChar = strInput.substring(nCol, (nCol + 1));
			system.debug('Testing Character: ' + strChar);
			strHex = encodingUtil.convertToHex(blob.valueOf(strChar));
			if (strHex.length() > 2 && strHex.substring(0, 2) <> 'c2' && strHex.substring(0, 2) <> 'c3' && strHex.substring(0, 2) <> 'c4') {
				system.debug('Unicode found - Hex equivalent: ' + strHex);
				rtn += strChar;
			}
		}

		//string strHex = encodingUtil.convertToHex(blob.valueOf(strInput));
		//if (!strHex.contains('e3')) return rtn;

		return rtn;
	}

	public static Integer[] StringToAscWCode(string strInput) {
        /*
        Created By: Robert Nunemaker
        Created On: 12/04/2008
        Purpose: StringToAscWCode
            Accepts a string and returns an array of Integer representations 
            of the AscII/Unicode character NUMBERS associated with each character.    
        --------------------------------------------------------------------------
        Modified By:  
        Modified On:  
        Modification: 
        */
        
	
		string strHex = '0123456789ABCDEF';
		LIST<Integer> codeLIST = new LIST<Integer>();
		Map<String,Integer> hexMAP = new Map<String,Integer>();
		for (integer nLoop = 0; nLoop < 16; nLoop++) {
			hexMAP.put(strHex.substring(nLoop, nLoop), nLoop);
		}
		
		strHex = EncodingUtil.convertToHex(Blob.valueOf(strInput));
		if (strInput == null || strInput == '') return codeLIST;

		LIST<Integer> charLIST = new List<Integer>();
		Integer increment = 2;
		for(Integer i = 0; i < strHex.length(); i += increment) {
			Integer out = 0;
			Integer c1 = (hexMAP.get(strHex.subString(i,i + 1)) * 16) + (hexMAP.get(strHex.subString(i + 1,i + 2)));
			Integer c2 = 0;			
			Integer c3 = 0;
			Integer c4 = 0;
			if(c1 <128) {
				charList.add(c1);
				increment = 2;
				continue;
			}
			if(c1 > 127 && c1 < 192) {
				throw new InvalidArgumentException('Error parsing HEX, probably not a UTF8 HEX string');
				continue;
			}
			if(c1 > 193 && c1 < 224) {
				// first of 2
				increment = 4;				
			}
			if(c1 > 223 && c1 < 240) {
				// first of 3
				increment = 6;
			}
			if(c1 > 239 && c1 < 245) {
				// first of 4
				increment = 8;
			}
			
			c2 = (hexMAP.get(strHex.subString(i + 2,i + 3)) * 16) + (hexMAP.get(strHex.subString(i + 3,i + 4)));						
			if(increment == 4) {
				out = (c1 - 192) * 64 + c2 - 128;
			}
			if(increment == 6) {
				c3 = (hexMAP.get(strHex.subString(i + 4,i + 5)) * 16) + (hexMAP.get(strHex.subString(i + 5,i + 6)));
				out = (c1-224)*4096 + (c2-128)*64 + c3 - 128;
			}
			if(increment == 8) {
				c4 = (hexMAP.get(strHex.subString(i + 6,i + 7)) * 16) + (hexMAP.get(strHex.subString(i + 7,i + 8)));
				out = (c1 - 240) * 262144 + (c2 - 128) * 4096 + (c3 - 128) * 64 + c4 - 128; 
			}
			charLIST.add(out);
		}		
		return charList;
	} 

 

rtuttlertuttle

LOL thread confusion, ignore this message ;)

 

boBNunny, I like the change you made to the map, that'll save some space for sure!

 

 

-Richard

rtuttlertuttle

We should get a utility class together for code share.  Anyone interested?

boBNunnyboBNunny

Thanx.  I actually had a solution similar to yours, but you had a few efficiencies that I really liked, so (in the parlance of Rap music) I "sampled" your code.  :manhappy:

 

But I found I needed 2 different variations.  One for just did it have unicode or not, and one for what all of the actual values were.

 

But your code was definitely welcome and a great contribution.

 

I think a community share of Utility class methods is a good one.

 

Anyone else think so?

rtuttlertuttle

Hmm, from everything I read about unicode during researching the code I wrote I found that unicode is backwards compatible with standard ascii.  What cases were you running into that you had to specifically figure out if it was unicode?

 

The code should in theory take a standard ascii character and just spit the standard 1 byte value out for it.

 

-Richard

boBNunnyboBNunny

Well, the problem arises if it's a character (chinese for example) that can't be represented in ASCII.  Then it just shows as a ? if stripped.  But also, we have a legacy system that can't accept unicode, so if there is even 1 character, the update to the legacy system will fail.  Also, we have a requirement that Unicode be put into Localized fields mirroring those fields and so we can give a message to the user telling them to transfer the values first.

rtuttlertuttle

Ahh I could see how you would need that.  I'm running into a similar problem so I might use your isUnicode method in an upcoming project if you don't mind.

 

 

 

boBNunnyboBNunny

That's the reason for these boards IMO.  Great to have a virtual community to depend on.  Glad I could help.

SOA GuySOA Guy

Unicode issues aside, I had a web service callout failure where the user copy/pasted text into SFDC that contained a simple backspace character (0x08).  That is invalid according to the XML specification and the SOAP message could not be transported.

 

Had another time where the SOAP data contained a printable european character that was not supported by the Windows codepage.  It made it through the SOAP transports and then threw an exception deep within a 3rd party DLL from IBM.

 

It's much easier to clean the string BEFORE transport than try to diagnose where it failed.

boBNunnyboBNunny

Agree.  Sometimes you can have a hard space (ASCII 160) that can cause havoc with matching where trimming won't do it, so you need to strip those too.  Or even the wide dash that looks like a regular dash.  So those, you could fix during a matching call or even during storage.  Another issue is when the Unicode version of an Alpha or Numeric character is used and it LOOKS like ASCII, but it's more than 255.

SOA GuySOA Guy

I combined the ideas into this...

 

       public static LIST<Integer> StringToIntegerList(String strInput) {

 

              LIST<Integer> charLIST = new List<Integer>();

             

              if (strInput == null || strInput == '') return charLIST;

             

              string strHex = EncodingUtil.convertToHex(Blob.valueOf(strInput));

              if (strHex == null || strHex == '') return charLIST;

 

              // Build map to convert hex to decimal         

              Map<String,Integer> hexMAP = new Map<String,Integer>();

              for (integer nLoop = 0; nLoop < 16; nLoop++)          

                     hexMAP.put('0123456789abcdef'.subString(nLoop, nLoop+1), nLoop);

 

              Integer increment = 2;

              for(Integer i = 0; i < strHex.length(); i += increment) {

                     Integer out = 0;

                    

                     Integer c1 = (hexMAP.get(strHex.subString(i,i + 1)) * 16) + (hexMAP.get(strHex.subString(i + 1,i + 2)));

                     Integer c2 = 0;                  

                     Integer c3 = 0;

                     Integer c4 = 0;

                     if(c1 <128) {

                           out = c1;

                           increment = 2;

                     }

                     else

                     {

                           if(c1 > 193 && c1 < 224) {

                                  // first of 2

                                  increment = 4;                          

                           }

                           if(c1 > 223 && c1 < 240) {

                                  // first of 3

                                  increment = 6;

                           }

                           if(c1 > 239 && c1 < 245) {

                                  // first of 4

                                  increment = 8;

                           }

 

                           c2 = (hexMAP.get(strHex.subString(i + 2,i + 3)) * 16) + (hexMAP.get(strHex.subString(i + 3,i + 4)));                                     

                           if(increment == 4) {

                                  out = (c1 - 192) * 64 + c2 - 128;

                           }

                           else if(increment == 6) {

                                  c3 = (hexMAP.get(strHex.subString(i + 4,i + 5)) * 16) + (hexMAP.get(strHex.subString(i + 5,i + 6)));

                                  out = (c1-224)*4096 + (c2-128)*64 + c3 - 128;

                           }

                           else if(increment == 8) {

                                  c4 = (hexMAP.get(strHex.subString(i + 6,i + 7)) * 16) + (hexMAP.get(strHex.subString(i + 7,i + 8)));

                                  out = (c1 - 240) * 262144 + (c2 - 128) * 4096 + (c3 - 128) * 64 + c4 - 128;

                           }

                     }

                    

                     if ((out != 0) && IsLegalXmlChar(out) && IsLegalWindows1252(out))

                           charLIST.add(out);

              }            

              return charList;

       }

boBNunnyboBNunny

That works, but for our purposes, we needed to simply know if it was Unicode or not.  Since this returns a list, we would have to traverse the list to find the numbers (my second method).  My first returns a string of the illegal characters.  So if it's not Unicode, the string length would be 0.

 

But, whatever works for anyone is great.  There's really no one way to do anything.

SOA GuySOA Guy

I needed to make two minor tweaks to make the Map initialization loop work for me.

 

- The hex alpha values were lower case after convertToHex.

- The substring needs to span a string length of one, not zero.

 

string strHex = '0123456789ABCDEF';
Map<String,Integer> hexMAP = new Map<String,Integer>();
for (integer nLoop = 0; nLoop < 16; nLoop++) {
    hexMAP.put(strHex.substring(nLoop, nLoop), nLoop);
}

 

 

Map<String,Integer> hexMAP = new Map<String,Integer>();

for (integer nLoop = 0; nLoop < 16; nLoop++)         

      hexMAP.put('0123456789abcdef'.subString(nLoop, nLoop+1), nLoop);