+ Start a Discussion

Shortest way to normalize strings - replace accents and non-ASCII chars with 'normal' chars



Banks are really old-fashioned in the way they handle text data in files sent through even the latest standards-managing IT systems.  In SEPA, for example, they still handle text in their own charset which is not Unicode, more similar to old ASCII or even EBCDIC :(


In the US, its not a big problem, but in European and Asian countries, this is mega-important.


In Java, the Normalize and Pattern classes are great for this : Unicode text can be normalized down to ASCII in 2 or 3 instructions. In Apex, things are *very* long.


This is the normalization I go through today :


private String clean(String in) {
    //String minmaj = "ÀÂÄÇÉÈÊËÎÏÛÜÔÖaàâäbcçdeéèêëfghiîïjklmnoôöpqrstuùûüvwxyz";        
    String acc = 'ÀÂÄÇÉÈÊËÎÏÌÛÜÙÔÖÒÑ' + '°()§<>%^¨*$€£`#,;./?!+=_@"' + '\'';        // et Œ, Æ, &; 
    String maj = 'AAACEEEEIIIUUUOOON' + '                          ' + ' ';
    String out = '';                 
    for (Integer i = 0 ; i < in.length() ; i++) {
        String car = in.substring(i, i+1);
        Integer idx = acc.indexOf(car);
        if (idx != -1){
            out += maj.substring(idx, idx+1);
        } else if (car == 'Œ') {
            out += 'OE';
        } else if (car == '&') {
            out += 'ET';
        } else if (car == 'Æ') {
            out += 'AE';
        } else {
            out += car;
    return out;


Remember, this is to produce files where alignment is important, so I can't just replace Æ with AE without fixing the padding instructions (elsewhere).


This method uses too many instructions : has anyone got a better way of doing it, still in APEX ?




Hi Codeizard,
I know Pattern and Matcher well.
What I really need is the equivalent of Java's Normalize to go with them : anyone got an idea ?

David WaughDavid Waugh
@altius_rup, just wanted to say this normalizations snippet helped me.  I extended it to cover additional characters.  Not sure if my extension is covered by your alt strings that are commented out.

String accents = 'ÃÁÀÂÄÇÉÈÊËÎÏÌÍÚÛÜÙÓÔÕÖÒÑÝ' + '°()§<>%^¨*$€£`#;?!+=@©®™"';
String maj     = 'AAAAACEEEEIIIIUUUUOOOOONY' + '                         ';

Thanks.  Would love to see more formal support for string Normalization from Salesforce.  Can't the java implementation be lifted??