Here’s a topic that’s near and dear to our hearts here at Appen: spelling standardization.
If you’re providing training data to a computer system to produce machine translation, speech recognition, or a computer voice, it’s important to spell each word the same way every time it comes up (otherwise, you’re watering down your training data and the language model gets confused).
Even if you’re not going that high-tech and you just want to have reliable search through your database of client questions or your fieldwork notes, spelling words consistently matters, standardized spelling, matters!
This is especially true for the kind of annotation we do at Appen, so we’re a little biased. Human-annotated data is, by definition, entered by a human being, and every person has different dialects, habits, and styles. Spelling a word correctly is vital to consistent, reliable data.
How hard can that be? Every word has a “correct” way it should be spelled, right? Just look it up in the dictionary if you’re not sure.
Oh boy. Follow us down the rabbit hole.
Dialects
Here’s a problem straight away: is it “standardisation” or “standardization”? This one’s region-based, so it’s not too hard to come up with the relevant spelling for your database. But some cases may be more complex than this – in Norwegian there are two entirely separate spelling systems (Bokmål and Nynorsk) intended to reflect different sets of dialects.
Usually this area isn’t too hard – you decide in advance which spelling convention to follow for your chosen language and dialect. Stray spellings from other systems can be identified through automated checks and post-editing.
Register
Is it “gonna”, “goin’ a”, “gon’ to”, or “going to”? This one’s more difficult: the latter spelling is the formally correct option, but in some cases it can be a long way removed from the sounds coming from a person speaking. What if you need to search later for one of the more diverse pronunciations of a phrase? How do you separate the pronunciations in your lexicon, if you’re producing a speech database?
In some cases, the difference may be minimal enough that you can standardize to the dictionary form. In others it may be more sensible to adopt an informal representation.
No matter how you choose to approach the subject, the conclusion is the same: standardization is vital.
Low-resource languages
It’s all well and good to refer to a dictionary, but some languages don’t have such handy arbiters of spelling. Appen has worked with Australian and Papua New Guinean languages with no written tradition at all, with languages such as KiSwahili where many alternate spellings may be equally acceptable, and with languages where spelling reform is recent or incomplete. It can be difficult building a team to work in regions with fewer speakers, or less ready access to the Internet.
The key here is often working with university researchers and linguists. At the same time, it’s important to achieve consensus on acceptable spellings through consultation with speakers of the language living in their communities. You may find your database contributes to giving speakers of the language new access to writing resources!
Codepoints
Even when the spelling of the word is totally clear, we can run into trouble. Take a look at these two words:
café | саfé |
How many letters do these have in common? To a human, the whole thing. To a computer? Only the “f”! The “c” and the “a” on the right come from the Cyrillic alphabet, and the “é” on the right is made out of two characters instead of one.
Codepoint errors, as these are known, will look just fine when you’re reading them, but if you search your database, the text editor isn’t going to find all instances of the word you searched for. It’s even more trouble when your database is training data for automatic speech recognition or a speech synthesis program – the alternate spelling might not show up in your lexicon and the whole segment of audio could be discarded!
Okay, so it’s fairly unlikely that someone’s going to be entering Cyrillic characters in your Latin-alphabet database, but for some languages there really are ambiguous cases, identical to a human eye but distinct to a computer. That’s the case for “é” shown above, and it’s widespread in many other writing systems too. In Arabic, for example, every character in the Unicode range also has separate equivalent “presentation form” characters, so ‘beh’ may appear as ٻ or as ﭒ, and there will be the same invisible variation for every character in the Arabic alphabet.
So if these kinds of errors are so persistent, even to the human eye, what can be done to mitigate them? Standardization only works if everybody is working from the same sources. In cases like this, it’s a simple matter of some automatic computer scripts that add a little bit of artificially-intelligent flair to a human-centric process.
Quite a bit to take in, right?
These are just a few of the challenges to face when you’re working on transcripts and text databases. We hope you discovered some new things about the trials and tribulations of maintaining all this text. At Appen, we’ve helped clients all over the world tackle these issues. If you’d like to discuss how we can help you or your organization, we’d be happy to hear from you! Contact us here to get started.