Lexicon analyzer incorrectly infers targets for polysemous entries

Dear Divya,

In ELAN 6.9 (Apple Silicon), the new lexicon analyzer does not seem to infer the correct sense-level information for lexical entries associated with multiple senses, despite explicit statements in the lexicon connections. It appears to always retrieve information at the entry level, plus the first sense of the entry. Information from additional entries is not automatically retrieved correctly.

I describe the problem below. There are a couple of workarounds but I believe the incorrect inference is not the intended behavior of the new analyzer. An MWE can be downloaded (with .eaf and an xml lexicon) here, valid for 7 days.

Thank you for any help!

Best wishes,
Weijian

Set up

Lexicon analyzer:

source target1 target2
wd (word) mb (morpheme) ge (gloss)

Lexicon connection:

tier_type lexicon_field
mb lexical-id
ge sense/gloss
ps sense/grammatical-category
lexical_id id
sense_id sense/id

Tier hierarchy:

tier stereotype parent
mb (morpheme) symbolic_subdivision wd (word)
ge (gloss) symbolic_association mb
ps (part of speech) symbolic_association ge
lexical_id symbolic_association mb
sense_id symbolic_association ge

Example triggering incorrect behavior

For example, the entry ‘watch’ has two senses, and they have different glosses and grammatical categories.

  • sense_1: n., ‘timepiece’, sense_id = uuid_watch_sense_1
  • sense_2: v., `look’, sense_id = uuid_watch_sense_2

During interlinearization, selecting sense_2, the corresponding sense look is retrieved, but the parser retries n and uuid_watch_sense_1, instead of v and uuid_watch_sense_2.

workaround 1

One workaround is to extract all sense-pos combinations of the same lexical item into separate entries. For example, ‘water’ has two senses, separated into two lexical units:

  • water: (entry_id = uuid_watch_1) ‘n. h2o, sense_id = uuid_water_h2o’
  • water (entry_id = uuid_watch_2) ‘v. give.h2o, sense_id = uuid_water_give.h2o’

During ambiguity selection, the analyzer infers part of speech and sense_id correctly as there is only one sense associated with each lexical entry.

possible workaround 2

A second (possible) workaround, which does not require dividing up polysemous entries, is to configure more analyzers and manually retrieve the ‘additional’ information. While this is feasible for human-readable fields such as part of speech tags, it is not feasible for non-readable fields such as UUIDs.

Dear Weijian,

Thank you very much for the detailed description and the attached example file.
It helped me to reproduce the issue faster.
I tested the scenario with the attached file and I could see when there are multiple senses, the analyzer only retrieves the first sense and its values.
You are right, the incorrect sense inference is indeed not the intended behaviour.

A check was missing in the code during inference flow to see if the chosen sense is the correct sense level.
I added this additional check and with that correct sense ( id and grammatical-category) is getting inferred. The solution will be part of the next release.

Thank you very much for pointing this out.

Best,
Divya

1 Like

Dear Divya,

Thank you for fixing this!

I understand that a lot goes into the decision of when to release the next version. However, would it be possible for you to share a working version with the fix implemented? I’m about to start collecting data for a new project. As you see, not having the fix entails as specific structure of the lexicon, which will have to be changed later.

We’d need macOS (Intel and Apple Silicon) and Windows versions. If it’s not too difficult, I could try building it from source…

Thank you so much either way!

Best,
Weijian

Dear Weijian

The working version consists of other new changes along with java version change. So it would require a thorough build procedure of all components. The next release is planned on end of June or beginning of July. While you wait you could use the more analysers workaround. I know it’s not easy with senseID’s but it would be temporary.

Best,
Divya

Dear Divya,

Thanks for your suggestion! And it’s great to have a rough sense of the timeline.

Best,
Weijian