Negative News Screening: Processing Foreign Languages

As we addressed in our latest adverse media post, there is an array of categories—and subcategories— of negative news screening that need to be properly screened for accurate and relevant results. But how do people look for and analyse adverse media in foreign languages? And why is this still so difficult?

Of course, not all firms need to process foreign languages when it comes to adverse media screening, but we have found it to be of the utmost importance to private banks and other companies whose client bases are international in scope.

Online translators, such as Google Translate, are definitely not perfect; their goal is to improve overall translation, not specialised topics that fall under many customers’ needs.

There is no comparison between using technology that simply detects a language and offers a translation with using a multilingual natural language processing (NLP) system. smartKYC’s multilingual NLP technology has the ability to catch nuances that other adverse media screening platforms miss.

If you’re considering other providers, let us explain smartKYC’s key point of differentiation. We reap the benefits of multilingual NLP and understand the gap between content translation and real meaning when another process, such as simple language detection and translation, is used. Our competitive advantage is that we do not have to rely on outside translators and therefore have the knowledge and control on how to improve our product efficiently.

Let’s look at some examples of situations that could have been overlooked if it was not for accurate multi-language processing.

Turkish Language

Turkish is a very tricky language for a variety of reasons; the word order can vary greatly and it is also a suffix-based language. Take a look at these examples:

Cumhurbaşkanı Erdoğan ve eşi Ayda’yı ziyaret etti.

Google translate: 

President Erdoğan and his wife Ayda visited.

Correct translation: 

President Erdoğan and his wife visited Ayda.

The Turkish language can adapt to dropping words from sentences. In this example, the name of President Erdoğan’s wife was dropped. Google translate interprets—incorrectly—‘Ayda’ as the president’s wife, not the name of the city they visited.

This is because English rules simply interpret these translations as “Ayda is a political entitiy’s wife”. This pattern is not caught in Turkish rules because it is ambiguous. This means that Google translate would miss a sentence like “Cumhurbaşkanı Erdoğan ve eşi Emine Ayda’yı ziyaret etti”. “President Erdoğan and his wife Emine visited Ayda”. It also means that false positives are not caught, either. 

Google translate can recognise ambiguous patterns and handle them in various ways. But the pattern is not ambiguous at all when translated, which leads to false positives.

O senede dayanılarak ihtiyati tedbir verildi

Google translate: 

Precautionary injunction was given based on that year

Correct translation: 

Provisional injunction was given based on that deed

Another issue that arises is that because Turkish is a suffix-based language, it can make parsing of words difficult at times. Suffixes are usually very small, mostly just one or two letters.

In the example above, the word “senede” could be “sene+de”, which is the Turkish word for “year” with the locative suffix or “senet+e”, the Turkish word for bond with the dative suffix. The online translators do not always make the correct translation in these cases.

Arabic Language

The Arabic language is very complex to translate because it has many different written and spoken varieties, which are all used for different social and cultural contexts. These dialects often show many inconsistencies between the spoken and written language. 

مدار الساعة – بركات الزيود- قررت الهيئة القضائية التاسعة والمختصة بالنظر في جنايات الفساد لدى محكمة بداية عمان وفي حكم جديد لها, إدانة رئيس مجلس إدارة شركة الفوسفات ومديرها التنفيذي السابق وليد الكردي, بجريمة استثمار الوظيفة في ستة عقود بمنجم الشيدية, وحبسه 18 سنة مع الأشغال الشاقة المؤقتة.

Google translation:

The Clock – Barakat Al-Zayoud- The ninth judicial body specialized in looking into corruption offenses at the Amman Court of First Instance decided, in a new ruling, to convict the Chairman of the Board of Directors of the Phosphate Company and its former CEO, Walid Al-Kurdi, with the crime of investing in six decades of employment in the Shaidiya mine, and imprisoned him for 18 years with works temporary hard.

Correct translation: 

The Clock – Barakat Al-Zayoud- The ninth judicial body specialized in looking into corruption offenses at the Amman Court of First Instance decided, in a new ruling, to convict the Chairman of the Board of Directors of the Phosphate Company and its former CEO, Walid Al-Kurdi, with the crime of abuse of office in six contracts of employment in Shaidiya mine, and imprisoned him for 18 years with works temporary hard.

Here, we see a series of incorrect interpretations within a single sentence. For example, what is meant to say “abuse of office,” is translated as “investing the job” with Google translate. This is due to differences in dialect in Jordan.

This happens again when  “عقود” is used. This word has two different meanings, depending on the dialect, and can translate to both “decades” and “contracts”. Google translate misses this and changes the duration of the crime to span sixty years, which is incorrect. 

The addition of “of employment” does not actually relate to the original snippet as it is understood in Arabic. 

These differences from the original meaning of this piece of content changes the severity of the crimes. This would mean that customers would be flagging something incorrectly if they went by the Google translation.

Issues with English Translation from Asian Languages

Latin languages are traditionally easier for AI to screen due to the patterns in structure. Running into issues consistently with non-Latin languages has become a common occurrence. Some major themes were found using English translations.

For example, with the Korean language there are two main difficulties:

  1. Word order often doesn’t inform the agent of an action and the recipient of the action in a sentence.
  2. Case markers assign these roles to words, but they are very frequently omitted, especially in article titles.

This can be seen in the following sentence: 

가사도우미 성추행‘ 김준기 회장 경찰조사 받는다

Google Translate:

‘Housekeeper Sexual Harassment’ Chairman Kim Jun-ki under investigation

Housekeeper sexually harassing chairman Kim Joon-ki under police investigation

Correct translation:

Chairman Kim Jun-ki who sexually harassed a housekeeper will be investigated by police (will receive police investigation).

The problem is that if a customer only relies on the English translation, this title loses the fact that it is the police who will be investigating the Chairman. Although this may seem obvious, it can lead to great issues when specific words are being looked for in negative news screening.

Many other issues come up in Thai, Chinese and Japanese. In the Japanese language, names are problematic, which tends to be an issue, in general, but is also very relevant for negative news screening. Most importantly, one string of Chinese characters forming a name can be read in more than one way in Japanese. And we get a variety of incorrect results due to this. The list of differences and examples could go on indefinitely.

There are countless examples to pull from but what is important to identify is that without using true Multilingual NLP, gaps in the quality of translations arise and change negative news screenings results. 

Many businesses do not understand translations as they do not know the languages they are trying to translate. Due to this language knowledge gap, businesses are often not aware of false positives or why they are happening.

The Road to Perpetual KYC

Many firms, and of course vendors, are talking about ‘Perpetual KYC’ these days, each with their own take on what it should look like. There is the looking for as-it-happens changes on any structured datasets like watchlists, corporate registers, etc. on the one hand, and looking for genuinely new and pertinent facts in unstructured adverse media on the other.

With the latter, it is impossible to do without using natural language processing and we would strongly argue to get it exactly right with less ‘noise’ and alerts, you must use true multilingual natural processing. To understand more about our ‘Perpetual KYC’ offering in negative news screening, book your demo with smartKYC today.

Discover smartKYC

smartKYC’s adverse media screening software is the world’s most advanced multilingual semantic search engine to machine read all online media content for potential negative news about your clients, improving KYC processes and reducing risks. If you’re interested in learning more about smartKYC’s industry-leading multilingual NLP and how it can transform the efficiency and effectiveness of your KYC operations, book your demo today.