Search with Regex matching some strings but ignoring others that are identical

palomero · 16 August 2023 11:39

I am trying to retrieve as many instances of a demonstrative in my corpus. The DEM may have multiple shapes/ orthographic representations, such as “ina”, “inad”, “ena”, “enad” (all possibly followed by a comma or a whitespace), for which I am using the following regex:
(i|e)na(?:,|\s|d)

When running the search many of the results are correctly matched, but not all of them. For example, if we consider the screenshot below, segment 492 is shown as a result (as it contains the string “ina”), but segment 494 and 495 containing the strings (“ina” and “ina,” respectively) are ignored.

Are there any solutions to this issue? Thanks in advance

hasloe · 16 August 2023 14:59

Hello,

Could it be the last part of your regex is missing a quantifier? E.g. like this:
(i|e)na(?:,|\s|d)*
zero or more of the last group, or this:
(i|e)na(?:,|\s|d)?
once or not at all from that group?

And if the first bit of the last group ?:, is to match either of those 3 characters, you might formulate the regex like this (with square brackets):
(i|e)na([?:,]|\s|d)*
or even continue the |-ing (and escape the ?):
(i|e)na(\?|:|,|\s|d)*

Maybe this helps to find all the DEM’s you wish to retrieve.

-Han