Lexicon analyzer incorrectly infers targets for polysemous entries

weijian · 7 April 2025 12:12

Dear Divya,

In ELAN 6.9 (Apple Silicon), the new lexicon analyzer does not seem to infer the correct sense-level information for lexical entries associated with multiple senses, despite explicit statements in the lexicon connections. It appears to always retrieve information at the entry level, plus the first sense of the entry. Information from additional entries is not automatically retrieved correctly.

I describe the problem below. There are a couple of workarounds but I believe the incorrect inference is not the intended behavior of the new analyzer. An MWE can be downloaded (with .eaf and an xml lexicon) here, valid for 7 days.

Thank you for any help!

Best wishes,
Weijian

Set up

Lexicon analyzer:

source	target1	target2
wd (word)	mb (morpheme)	ge (gloss)

Lexicon connection:

tier_type	lexicon_field
mb	lexical-id
ge	sense/gloss
ps	sense/grammatical-category
lexical_id	id
sense_id	sense/id

Tier hierarchy:

tier	stereotype	parent
mb (morpheme)	symbolic_subdivision	wd (word)
ge (gloss)	symbolic_association	mb
ps (part of speech)	symbolic_association	ge
lexical_id	symbolic_association	mb
sense_id	symbolic_association	ge

Example triggering incorrect behavior

For example, the entry ‘watch’ has two senses, and they have different glosses and grammatical categories.

sense_1: n., ‘timepiece’, sense_id = uuid_watch_sense_1
sense_2: v., `look’, sense_id = uuid_watch_sense_2

During interlinearization, selecting sense_2, the corresponding sense look is retrieved, but the parser retries n and uuid_watch_sense_1, instead of v and uuid_watch_sense_2.

workaround 1

One workaround is to extract all sense-pos combinations of the same lexical item into separate entries. For example, ‘water’ has two senses, separated into two lexical units:

water: (entry_id = uuid_watch_1) ‘n. h2o, sense_id = uuid_water_h2o’
water (entry_id = uuid_watch_2) ‘v. give.h2o, sense_id = uuid_water_give.h2o’

During ambiguity selection, the analyzer infers part of speech and sense_id correctly as there is only one sense associated with each lexical entry.

possible workaround 2

A second (possible) workaround, which does not require dividing up polysemous entries, is to configure more analyzers and manually retrieve the ‘additional’ information. While this is feasible for human-readable fields such as part of speech tags, it is not feasible for non-readable fields such as UUIDs.

divya.kanekal · 9 April 2025 08:20

Dear Weijian,

Thank you very much for the detailed description and the attached example file.
It helped me to reproduce the issue faster.
I tested the scenario with the attached file and I could see when there are multiple senses, the analyzer only retrieves the first sense and its values.
You are right, the incorrect sense inference is indeed not the intended behaviour.

A check was missing in the code during inference flow to see if the chosen sense is the correct sense level.
I added this additional check and with that correct sense ( id and grammatical-category) is getting inferred. The solution will be part of the next release.

Thank you very much for pointing this out.

Best,
Divya

weijian · 10 April 2025 09:00

Dear Divya,

Thank you for fixing this!

I understand that a lot goes into the decision of when to release the next version. However, would it be possible for you to share a working version with the fix implemented? I’m about to start collecting data for a new project. As you see, not having the fix entails as specific structure of the lexicon, which will have to be changed later.

We’d need macOS (Intel and Apple Silicon) and Windows versions. If it’s not too difficult, I could try building it from source…

Thank you so much either way!

Best,
Weijian

divya.kanekal · 14 April 2025 19:05

Dear Weijian

The working version consists of other new changes along with java version change. So it would require a thorough build procedure of all components. The next release is planned on end of June or beginning of July. While you wait you could use the more analysers workaround. I know it’s not easy with senseID’s but it would be temporary.

Best,
Divya

weijian · 15 April 2025 07:26

Dear Divya,

Thanks for your suggestion! And it’s great to have a rough sense of the timeline.

Best,
Weijian