+ Start a Discussion
davehilarydavehilary 

‘Regex too complicated’ error for large volume of data (and a simple regex)

I am getting a ‘Regex too complicated’ error below when loading data into our org using the following process:

 

1) an email service to receive the CSV data,

2) an APEX class to split and validate the CSV data, and then

3) a set of @future calls to upsert the data.

 

The same data works in smaller volumes, but not beyond a certain threshold. This applies whether we reduce the number of rows, or reduce the width of certain columns of data by truncating them to 3000 characters (a small number of columns have 10,000 characters of text included). When we do either or both of these steps in any combination to reduce the file size, we don't get this problem. It’s not a problem with a specific badly formatted row either, because reducing the number of rows in various combinations always causes the problem to go away.

 

So we don’t believe it is actually a regex problem, because the regular expression is just finding commas to split up a comma separated file/string - i.e. it's very simple.

 

This is why we think there's an undocumented storage or capacity limit somewhere within the APEX processing that is being exceeded - but one that doesn't have a governor limit associated with it, or indeed an accurate error message. We think it is an erroneous error message - i.e. it's not to do with complicated regex – and that this error message is a symptom of another issue.

 

This error has occurred in code that has been stable to date, but has appeared since the filesize we're uploading has increased to beyond about 4600-4800KB, which seems to be the threshold beyond which this problem occurs. There seem to be some undocumented limits in the volume of data than can be processed using the solution architecture we've designed.

 

We want to be able to code around this problem, but unless we know exactly what the error is, any changes we make to our code may not actually fix the problem and result in wasted effort. So I don't want to start changing this until I know exactly which part of the solution needs to be changed!

 

I’ve raised this with Salesforce as a potential bug or to see if they could clarify any undocumented limits on processing large volume datasets using the process we’ve designed, but they seem to have decided it’s a developer issue so won’t help.

 

The error message is below:

 

Apex script unhandled exception by user/organization: 

Failed to invoke future method 'public static void PrepareCSV(String, String, String, Integer, Boolean)'

caused by: System.Exception: Regex too complicated

Class.futureClassToProcess.GetList: line 98, column 17
Class.futureClassToProcess.parseCSV: line 53, column 38
Class.futureClassToProcess.PrepareCSV: line 35, column 20 External entry point

 The relevant code snippet is below:

 

 

 

public static list<List<String>> GetList(String Content)
        {
        Content = Content.replaceAll(',"""',',"DBLQT').replaceall('""",','DBLQT",');
            Content = Content.replaceAll('""','DBLQT');
            List<List<String>> lstCSV = new List<List<String>>();
            Boolean Cont = true;
            while (Cont == true){
                List<String> lstS = Content.Split('\r\n',500);
                if(lstS.size() == 500){
                    Content =lstS[499];
                    lstS.remove(499);
                }else{
                    Cont = false;
                }
                lstCSV.add(lstS);
            }
            return lstCSV;
        }

 

Any suggestions gratefully received as to whether we're missing something obvious, whether 4MB+ files just can't be processed this way, or whether this might actually be a SFDC APEX bug.

 

 

 

public static list<List<String>> GetList(String Content)
        {
            //Sanjeeb
            Log('GetList started.');
            Content = Content.replaceAll(',"""',',"DBLQT').replaceall('""",','DBLQT",');
            Log('Replaing DBLQT.');
            Content = Content.replaceAll('""','DBLQT');
            Log('Replaing DBLQT.');
            List<List<String>> lstCSV = new List<List<String>>();
            Boolean Cont = true;
            while (Cont == true){
                List<String> lstS = Content.Split('\r\n',500);
                Log('Split upto 500 Rows.');
                //List<String> lstS = Content.Split('\r\n',1000);
                if(lstS.size() == 500){
                    Content =lstS[499];
                    lstS.remove(499);
                }else{
                    Cont = false;
                }
                lstCSV.add(lstS);
            }
            Log('GetList ends.');
            return lstCSV;
        }
Best Answer chosen by Admin (Salesforce Developers) 
davehilarydavehilary

I got a response from my SFDC ISV partner technical support representative which I thought I'd post back to my own question since it's the only 'official' word I've received on this problem. He said:

 

"I’ve been doing some digging and the regex too complicated message is definitely based on the size of the files.  It looks like Email Services provides an entry point that allows developers to push in data sizes that far exceed the heap limits. The regex seems to be failing because of the heap supporting the regex. The only alternative is to cut the file sizes down or choose another integration approach."

 

He said this wasn't likely to change in the next release, although I still maintain it should provide a more meaningful error message.

 

In our use case, this means completely redesigning our integration solution. For anyone else, you should reconsider Email Services if you have large varying attachment sizes to be processed.

All Answers

gurumikegurumike

I've just run into this as well.  I have an InboundEmailService which is using some regexs to process the email.  My test code (which uses test data from an actual email) works just fine; but once in production, the code generates an exception.

 

I suspect there's a governor limit being reached in the InboundEmailService environment which is not present in the test environment.

 

The following thread says that there's an undocumented limit on the number of string accesses performed by a regex:

 

http://community.salesforce.com/t5/Apex-Code-Development/Pattern-and-Matcher-Question/m-p/135986

 

davehilarydavehilary

I got a response from my SFDC ISV partner technical support representative which I thought I'd post back to my own question since it's the only 'official' word I've received on this problem. He said:

 

"I’ve been doing some digging and the regex too complicated message is definitely based on the size of the files.  It looks like Email Services provides an entry point that allows developers to push in data sizes that far exceed the heap limits. The regex seems to be failing because of the heap supporting the regex. The only alternative is to cut the file sizes down or choose another integration approach."

 

He said this wasn't likely to change in the next release, although I still maintain it should provide a more meaningful error message.

 

In our use case, this means completely redesigning our integration solution. For anyone else, you should reconsider Email Services if you have large varying attachment sizes to be processed.

This was selected as the best answer
gurumikegurumike

Thanks for posting your response from SFDC.  I was able to work around my error by first using Sting.indexOf() and String.substring() to strip down my email to only the portion of the text in which I expect my regex to apply.  This incidentally helped me simplify my regexs as well.  In any case, I am no longer receiving the "Regex too complicated" message.  I agree that the error message needs to be changed to more accurately reflect the true cause.

davehilarydavehilary

We knew which area to apply the regex - it was to the entire 4MB CSV attachment! We tried alternative regex's and various combinations of functions but it started to feel like rearranging the deckchairs on the Titanic - the only workaround that had an effect was cutting the volume down (to a level that was unacceptable to the business). And this wasn't sustainable so we're going to try a completely different approach.

Scott.MScott.M

Just ran into this my self, does anyone know what the actual string size limit is that a regex can be applied to? 

gurumikegurumike

I don't think there's a specific size limit for the string.  I remember reading in a post somewhere else that the true cause of the error is the number of times the regex code accesses the string.  In other words, it's a combination of string size and regex complexity.  Thus, a simple regex on a large string would cause the "too complicated" error.  Likewise, a very complex regex on a small string could cause the error.

Scott.MScott.M

Yikes! That's going to make it really hard to determine what size of strings to split into.

mariagusmariagus

Hi,

 

Your post was really useful for me. I had this issue when I tried to process a CSV file of nearly 4MB. But I couldn't ask customers to send me smaller files, so I had to find another way to do it.

 

Finally I developed a Batch Apex process with a custom Iterator. For futher information: http://developer.financialforce.com/customizations/importing-large-csv-files-via-batch-apex/

SFDCMattSFDCMatt

Just wanted to pile on and say that this error creeps up the other way too, if you're constructing a large file in Apex. I've got a process where we are picking up a bunch of sObject records, constructing a .csv file where each record = 1 line, and then posting that out to a web service. Works fine for up to about 1800 records. But if you go much beyond that, it fails silently with the Regex Too Complicated message.

 

Workaround for me is to likely cut the file up into smaller batches and use the Winter '13 feature to chain batch jobs until my queue of records to post to the service is at 0.

 

Would still love to get an official response on this thread or a resolution to this (or heck even a better error message).

James KentJames Kent

I took the FinancialForce code, which is a really good solution, and gave it a once over.  Apex Row Iteration Class post.

quietopusquietopus
Found a more generalized workaround for this and wanted to share.

As others have mentioned, there is a limit on the number times a regex can match an input sequence (as of this writing).
If you are using Pattern and Matcher objects, you can periodically reset this limit by resetting the Matcher.

For example, while test a client's project, I generated a Regex too complicated error using the following (actual inputs redacted for possible IP concerns):
 
Matcher m = Pattern.compile(<a really complicated regex>)
    	   .matcher(<a complicated input>);

for(Integer i = 0; i < 10000; ++i)
{ 
    Boolean match = m.matches();
}

The following did not generate an error:
 
Matcher m = Pattern.compile(<a really complicated regex>)
    	   .matcher(<a complicated input>);

for(Integer i = 0; i < 10000; ++i)
{ 
    m.reset(<a complicated input>);
    Boolean match = m.matches();
}

This makes sense, since the documentation (now) says that the access count limit is tied to the input sequence. Resetting the input sequence, of course, resets the limit.

I suspect that, internally, the String.split() method is doing something similar to the first example.

Here is a safeSplit() method, which follows the second approach:
 
/**
* Split a string of any size, while avoiding the dreaded 'Regex too complicated'
* error, which the String.split(String) method causes on some large inputs.
*
* Note that this method does not avoid other errors, such as those related to 
* excess heap size or CPU time.
*/
List<String> safeSplit(String inStr, String delim)
{
    Integer regexFindLimit = 100;
    Integer regexFindCount = 0;
    
    List<String> output = new List<String>();
    
    Matcher m = Pattern.compile(delim).matcher(inStr);
    
    Integer lastEnd = 0;

    while(!m.hitEnd())
    {
        while(regexFindCount < regexFindLimit && !m.hitEnd())
        {
            if(m.find())
            {
                output.add(inStr.substring(lastEnd, m.start()));  
                lastEnd = m.end();
            }
            else
            {
                output.add(inStr.substring(lastEnd));
                lastEnd = inStr.length();
            }
            
            regexFindCount++;
        }

        // Note: Using region() to advance instead of substring() saves 
        // drastically on heap size. Nonetheless, we still must reset the 
        // (unmodified) input sequence to avoid a 'Regex too complicated' 
        // error.
        m.reset(inStr);        
        m.region(lastEnd, m.regionEnd());
        
        regexFindCount = 0;
    }
    
    return output;
}

// Testing code
///////////////////

Integer numRepeats = 50000;
String bigInput = 'All work and no play makes Jack a dull boy.\r\n'.repeat(numRepeats);

// This generates a 'Regex too complicated' exception.
//
// List<String> a = bigInput.split('\r\n');

// This avoids a 'Regex too complicated' exception.
//
String[] a = safeSplit(bigInput, '\r\n');

System.assertEquals(numRepeats+1, a.size());

As noted, you are on your own regarding memory and CPU usage. It actually required some golfing to find a numRepeats value which triggers "Regex too complicated" while avoiding "String is too long.", "Apex heap size too large...", and "Apex CPU time limit exceeded". For brevity, I'll leave the implementation of safeSplit(String inStr, String delim, Integer  maxOutputs) as an exercise for the reader.

Also, it may be necessary to modify the regexFindLimit. The maximum appropriate value seems to depend loosely on the the input sequence as well as the delimiter pattern. Worst-case, set it to 1 and pay for it in CPU time.
 
Ajit Kumar 19Ajit Kumar 19
@quietopus Nice one it Worked for me
Aleksandrs SavkinsAleksandrs Savkins
@quietopus thanks man! added to your code "if (Limits.getLimitCpuTime() - Limits.getCpuTime() <= 100) break;" and now i am able to process my csv with 150k lines in two executions
Rick Paugh 3Rick Paugh 3
@Aleksandrs Savkins, do you have an example of how you implemented that?