regex/capturing parentheses in multi-file, multi-layer search

system · 22 August 2013 04:38

Hi,

I can see that for regular expression matching in the multi-layer search I can use capturing parentheses to match a previous match within the same annotation. I can’t seem to figure out how this might work across columns or layers. Is it possible to capture the match from one annotation and match it in another? If not, it would be a very useful addition!

For instance, I would like to find an annotation ending in (\w+)$ in the context of that same word occurring at, say, the end of the next annotation \1$ (or a million other useful searches I can think of!).

If, in the single layer search it was possible to treat all annotations as being separated by a newline character then such a search would be easily implemented.

Cheers,
Tom Honeyman

hasloe · 3 September 2013 16:10

Yes, this sounds like a useful extension. I don’t know how difficult it would be to build that in, but we’ll look into that. Some of what you want could maybe be achieved with the new variable match mode, if that could be combined with regular expressions.
I’ll add it to the wish list.

-Han

system · 9 September 2013 16:59

Hi Tom, you could consider using tiers with one word per annotation or use word-boundary regexp metacharacters to implement your “search as if there were line breaks between annotations” suggestion. I am tempted to also suggest N-gram search (with # as word wildcard) but cannot fit that into your specific example, so maybe my intuition is wrong here. However, there is another interesting new function that could help you:

Since the multicon project, complex multi layer search can use variables. If I assume that you have a word wise annotation tier where the final punctuation also is a separate annotation, you could search for:

$1 directly followed by “.” anywhere before $1 directly followed by “.”

In other words two sentences ending with the same word. While you can not do this in ELAN or Trova yet, the limitation is only in the user interface: At the moment, you cannot mix field types (variable, regexp, exact, substring) within one query because that would clobber your screen with buttons next to each field. The engine would allow it, though. So if you have an idea for the user interface, we could enable it. Note that using a variable in the first field of a query can make it much slower, because all annotations have to be considered as value for that variable then.

Another trick could be using both a word tier and a sentence tier, matching two arbitrary but adjacent sentences in the latter and any but identical words in the former. Add time overlap constraints to say that the words have to end when the sentences end, then you match “last words in sentences”. This works with all fields set to variable, already with the current version of ELAN and Trova: “In any sentence tier, $1 directly followed by $2; in any word tier $3 ends when $1 ends and later $3 ends when $2 ends”.

This query uses 2x2 fields, 3 different variable names, $3 used twice. Directly followed is “0 annotations between”, later as is “at any later point”. You can also add a constraint saying that the sentence and tier constraints have to be connected, as siblings or parent-child.

system · 10 September 2013 01:08

Hi Eric,

Thanks very much (and thanks to Han for the original response too!).

I’ll have to play around with variable match. I hadn’t noticed that new feature (that seems to be happening a lot lately! Many new features). Is it greedy? i.e. if I search across > N multiple annotations, can I make it stop searching for any subsequent $1 if something else comes up in the meantime (e.g. a ‘.’)?

The actual search I am trying to do is a little more complex. I just simplified it for the question, because I was primarily concerned about the functionality. I am actually looking for instances of what’s called `tail-head linkage’. So for instance in a narrative text, the end of one sentence starts the next e.g.:

"He ties the rope and tests it. Having tested it, he climbs down and…’

The words I am matching might be slightly different (e.g. ‘tests’ versus ‘tested’) but predictable in the language I am matching. And so I’m not just blind matching words at the end of sentences, and in fact I may be looking for portions of words with predictable mutations. And what gets repeated and where is a little variable too, so actually it could be a number of complex searches.

So your second trick doesn’t quite work for what I’m after either, but I can see how it works and might think of uses for it, thanks. Am I right in thinking that variable mode basically allows me equivalence/non-equivalence matching, and then I can fine tune this using the options in the drop down boxes? I can’t at any point also search for a string? The simplest version of the search that I actually want is that the last word from one sentence is the second last word from the next. Even better if I can actually specify that last word as it’s always the same. I can’t see how to do this using this trick.

As for the interface, thats a tricky one. It is already quite busy. And I’m no fan of right-clicking to turn features on and off. It would be a bit of a kludge, but what about changing the input fields to combo boxes with text input where the drop down menu gave you options to set the match type?

Being a regex user, I’m happy enough only using regexps! But I realise that it’s not for everyone. It would be better in my opinion if ‘variable’ matches were incorporated into regexps, but I admit it would make simple equivalence matching pretty complex for the average user.

I can see that carrying over regex matched variables to other fields would be a nightmare to program too. e.g. if there are capturing parentheses in multiple fields then where do you start numbering the matches? And could a match in one field apply in another while a match in that other field apply in the first? Hence my suggestion of treating a single tier as being separated by newline characters (or null characters?) - that way it’s an non-complex search domain and you wouldn’t need to extract portions of one match and insert them into others. Of course, it would be a problem if people added newlines to their annotations… and you wouldn’t be able to search for aligned annotations on multiple tiers (or it would be a pain to code matching up the annotations again).

But what about using regex named capturing groups? e.g. (?<NAME>X). That way the user could explicitly design the search with respect to matching between fields. And only named groups would be accessible across fields? I think it’s a java 7 feature only though. I think you’d probably have to restrict named variables to one field and make the match available in others, or it would be too complex to program, and at any rate it could be very very slow.

system · 1 August 2014 05:41

On a different but related note to this, what about being able to define re-useable portions of regexp for use in a search pattern?

For instance, say I wanted to define a set of vowels or consonants relevant to what I’m searching for, and then re-use that portion of regexp multiple times across a search. It’s much cleaner to say define them:

$v = “[aeiou]”
$c = “[ptkmnb]”

and then reuse the variables in a match:

\b$c$c?$v$c$c?$v$c?\b

to look for words with two syllables.

Basically the setup would swap in the contents of the variable before matching the regexp. This would greatly simplify building complex regular expressions. The swapped in version would then be:

\b[ptkmnb][ptkmnb]?[aeiou][ptkmnb][ptkmnb]?[aeiou][ptkmnb]?\b

Of course, this is a just a simple example, but the possibilities could be a lot more complex.

If the variables were stored against the search domain, then over time, a user could build a complex set of variables to help with searching within that search domain. Personally I would use it for all sorts of things from building custom character classes to defining verbal morphology, or more simply just for matching against controlled vocabularies for small classes of words (e.g. all the $pronouns, $demonstratives, etc).

system · 1 August 2014 05:53

If you like the idea, then I guess I’d also stretch it to include recursion, such that variables can exist in variables as well:

$v = “[aeiou]”
$c = “[ptkmnb]”
$disyllabic_word = “\b$c$c?$v$c$c?$v$c?\b”
$monosyllabic_word = “\b$c$c?$v$c?\b”

and so on…

hasloe · 3 August 2014 22:49

There is already an item on the wish list about storing regular expressions for later re-use, so that the same expressions don’t have to be entered over and over again. Your suggestion sounds like a more sophisticated version of this request. And it sounds useful, so we’ll add this to the existing request (without being able to tell if or when this can be implemented).

-Han

chrdoehler · 30 November 2018 05:48

Has this been implemented in one of the ELAN versions since this thread?

What I am looking for seems to be related. For example: I want replace a with ä, but only word-finally or before t.

I search for a(\b|t)
In other software, I would then invoke my capture group when doing the replace. I would replace with: ä\1 or with ä$1

The second step does not work in ELAN. not in the single file search, nor in the multiple search&replace?

hasloe · 3 December 2018 10:04

I’m afraid we haven’t been able to dedicate any time/resources to implementing either of these wishes (storing regex and supporting backreferences in search/replace) for the past few years. But these items are still on the list, together with a few other improvements to the search (and replace) engine.