Maryam Pashmi February 2016

Mysql query performance issues while using several REGEXP?

In my query I need to have several regular expression in order to filter the user name or email because I am not interesting on them. So, I have written this query in mysql and after running it , it took so many time to give back me result. I got problem wit performance. Moreover after running this query, even don't filtering my information in the correct way. I am not sure how can I improve my query in order to :

  • speed up the query response time
  • applying correct filtering for regular expression

I will be appreciated for any help.

select DISTINCT t.user, vg_product_id,t.`platform`, pd.`mail`,
substring_index(group_concat(p.ts ORDER BY p.ts DESC SEPARATOR ','), ',', 1) as sub_start_ts, 
substring_index(group_concat(t.`expires_at`ORDER BY t.`expires_at` DESC SEPARATOR ','), ',', 1) as expired_time 
from users u
inner join tariff_subs_info t on (t.`user` = u.`user_xmpp_login` 
                                  and t.`user` NOT REGEXP ('^([A-Za-z]{2,3}(produsero|usero)+[0-9]{1,3})$' or '(\w+|\d+)?test(\w+|\d+)?' )
                                  and t.vg_product_id REGEXP "^(europe?|usa?|unlimited?|basic?)([a-zA-Z0-9]+|\_)+(and?|ios?)+$" )
left  join plus_data pd on (u.`user_xmpp_login` = pd.`user`)
inner join purchase_log p on (p.purchase_id = t.purchase_id)
WHERE (pd.mail not like '%guerrillamail.com' 
       or pd.mail is null) 
group by 1,2 ORDER BY DATE(p.ts);

and this is my result:

noadstestuser   basic_XXX_ios   ios NULL    2015-10-26 14:00:32 2015-10-26 14:05:24
brusero2    unlimited_XX_ios    ios brusero2@yhx.yg 2015-11-03 15:41:57 2015-11-03 15:46:45
brusero3    bXX_uscios  ios brusero3@tb.fff 2015-11-03 15:43:53 2015-11-03 15:48:42
esusero1    unliXX_usc  ios esusero1@es.userr   2015-11-03 13:51:54 2015-11-03 13:56:41
esusero3    basic_X_i os    esusero3@yn.yyf 2015-11-03 13:55:08 2015-11-03 14:00:02
esusero4    basic_X ios esusero4@yn.ttx 2015-11-03 14:01:50 2015-1        

Answers


Arth February 2016

OK first off:

t.`user` NOT REGEXP (
  'usero pattern' or
  'test pattern'
) 

is incorrect.. you can't or together two string inputs.. you need:

    t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')

You should probably test all the expressions directly on the relevant tables first to make sure the logic is there.. as mentioned by @WiktorStribizew the word and digit character sets may not be recognized.

Next I'm guessing you don't need DISTINCT and GROUP BY in your query, you can probably just drop the DISTINCT.

Unfortunately, your regular expressions are going to hit every row.. there's not a lot you can do about that.

If it's a one off query, you'd probably just have to eat the time. You should probably run an EXPLAIN to check nothing crazy is happening to be sure though.

If you are running the queries regularly I suggest you split out the pertinent information via your application as you save the rows and then index the resultant extra columns, it won't be normalised.. but it's probably a situation where the performance boost will make it worth it.

UPDATE

For example your first two rows could be saved as:

[user:'noadstestuser', is_usero:0, is_test:1],
[user:'brusero2',      is_usero:1, is_test:0]

Then your

    t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')

becomes simply

    t.is_usero = 0 AND t.is_test = 0

Much faster.. and you can index these fields if it helps.

Of course you'll have to do the matching before you save each row and this may be easier using your application logic (e.g. PHP code).


Wiktor Stribi┼╝ew February 2016

MySQL REGEXP does not support \w and \d shorthand character classes. So, (\w+|\d+)? subpattern is invalid in MySQL. Since the ? quantifier makes the subpatterns optional (repeat one or zero times), you can remove them altogether.

Thus, or '(\w+|\d+)?test(\w+|\d+)?' will turn into and t.`user` NOT REGEXP ('test'), but it is equal in meaning to and t.`user` NOT LIKE '%test%'.

Next, ([a-zA-Z0-9]+|\_)+ is also very problematic since there are nested quantifiers (a + inside an alternation group that has another + quantifier applied). This is a classical scenario when catastrophical backtracking may occur. I suggest replacing this subpattern with [a-zA-Z0-9_]+ to match letters, digits or an underscore. Or its equivalent [[:alnum:]_]+.

Post Status

Asked in February 2016
Viewed 1,057 times
Voted 8
Answered 2 times

Search




Leave an answer