Home Ask Login Register

Developers Planet

Your answer is one click away!

James February 2016

std::regex fatal error

I'd like to think this isn't actually a bug in the standard library, but I'm running out of places to look.

The statement std::regex(expression) where expression is a std::string causes a memory access fatal error.

expression is declared by the statement:

std::string expression = std::string("^(") +
    std::string("[\x09\x0A\x0D\x20-\x7E]|") + // ASCII
    std::string("[\xC2-\xDF][\x80-\xBF]|") + // non-overlong 2-byte
    std::string("\xE0[\xA0-\xBF][\x80-\xBF]|") + // excluding overlong
    std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte
    std::string("\xED[\x80-\x9F][\x80-\xBF]|") + // excluding surrogates
    std::string("\xF0[\x90-\xBF][\x80-\xBF]{2}|") + // planes 1-3
    std::string("[\xF1-\xF3][\x80-\xBF]{3}|") + // planes 4-15
    std::string("\xF4[\x80-\x8F][\x80-\xBF]{2}") + // plane 16

This regex was taken from http://www.w3.org/International/questions/qa-forms-utf-8 to test whether a byte sequence is UTF8.

Is this actually a bug in the library, or am I missing something really tiny?

Compiled with VS2015 c++, if that happens to make a difference.

EDIT: I forgot to mention that there is one specific line in this that breaks the code. std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte is the only line that breaks. comment that out and it works fine. This line on it's own creates a memory access error.


sln February 2016

So, if you use escapes in string literals, without using raw syntax,
you have to escape the escapes.

Example, new string:

std::string expression = std::string("^(") +
    std::string("[\\x09\\x0A\\x0D\\x20-\\x7E]|") + // ASCII
    std::string("[\\xC2-\\xDF][\\x80-\\xBF]|") + // non-overlong 2-byte
    std::string("\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|") + // excluding overlong
    std::string("[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|") + // straight 3-byte
    std::string("\\xED[\\x80-\\x9F][\\x80-\\xBF]|") + // excluding surrogates
    std::string("\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|") + // planes 1-3
    std::string("[\\xF1-\\xF3][\\x80-\\xBF]{3}|") + // planes 4-15
    std::string("\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}") + // plane 16

When you don't escape them, the compiler tries to interpret it as a
special character. In this case it is interpreting those as hex binary characters.

And, while the regex engine probably gets the right character,
it is always better to pass hex to the engine so you can see the character
that might break it (if it does).

Post Status

Asked in February 2016
Viewed 2,000 times
Voted 7
Answered 1 times


Leave an answer

Quote of the day: live life