Mat February 2016

What regex can I use to split a string into words but keep phrases in round brackets together?

I'd like to split a string like this:

my_string = "I want to split this (these should stay together) correctly"

and have the following result:

["I", "want", "to", "split", "this", "(these should stay together)", "correctly"]

I tried this:

my_string.split(/(?=[^\(]){1,} (?=[^\)]){1,}/)

But the elements inside the round brackets get separated. How can I achieve this?

Answers


anubhava February 2016

You can use split using this regex:

/ +(?![^()]*\))/

RegEx Demo

i.e.

my_string.split(/ +(?![^()]*\))/)

(?![^()]*\)) is negative lookahead that means don't match a space if it is followed by a 0 or more non-parentheses characters followed by a right parenthesis thus not matching spaces inside (...).


sawa February 2016

split is the wrong tool here. Use scan.

my_string.scan(/\([^)]*\)|\S+/)
# => ["I", "want", "to", "split", "this", "(these should stay together)", "correctly"]

If the balanced parentheses can be adjacent to other non-space characters, which you want to put together, then you might want this one, which works more generally:

my_string.scan(/(?:\([^)]*\)|\S)+/)

In general, when the delimiters can be expressed in a simple pattern, use split. When the content can be expressed in a simple pattern, use scan.


Cary Swoveland February 2016

It may be desirable to do it in two steps, to keep the regex simple:

first, middle, last = my_string.partition /\(.*\)/
[*first.split, middle, *last.split]
  #=> ["I", "want", "to", "split", "this", "(these should stay together)",
  #    "correctly"]

Another example:

first, middle, last = "x (x(x(x)x)x) x".partition /\(.*\)/
[*first.split, middle, *last.split]
  #=> ["x", "(x(x(x)x)x)", x"]

But it fails here:

first, middle, last = "x (x)x(x) x".partition /\(.*\)/
[*first.split, middle, *last.split]
  #=> [ "x, "(x)x(x)", "x"]

assuming ["x", "(x)", "x", "(x)", "x"] is desired.


mudasobwa February 2016

Just for the sake of curiosity:

my_string.gsub(/\(.+?\)/) { |m| m.gsub ' ', ' ' }.split(/ +/)

Try to copy-paste the code above into IRB and stay tuned:

#⇒ ["I", "want", "to", "split", "this", 
#   "(these should stay together)", "correctly"]

:)

NB This is a joke, please, do not use this in production.

As @sawa suggested, it is kinda escaping, so, to make this answer correct, one should convert back everything to normal spaces:

    my_string.gsub(/\(.+?\)/) { |m| m.gsub ' ', ' ' }
             .split(/ +/)
             .gsub ' ', ' '

Post Status

Asked in February 2016
Viewed 3,916 times
Voted 10
Answered 4 times

Search




Leave an answer