When it comes to KYC screening, Entity Resolution is extremely important due to the amount of Open Source Intelligence (OSINT) that may be present about any individual or company in different data sources, be it structured or unstructured.
Challenges in using structured and unstructured data sets
Structured Data Sets
There are many different structured data sets a company may use for client KYC screening; aggregated watchlists, corporate registers and even internal lists. ER can be used to harmonise these databases and resolve if Client X really is the same person on numerous different data sets. The problem with merging all the information found lies in the fact that the data isn’t always displayed in the same way in the respective data sets. A watchlist may display my name as “Hugo G D Chamberlain” whereas Companies House may display it as “Hugo George David Chamberlain.”
Companies like smartKYC use clever tools and rules to train their entity resolution algorithms to a certain degree of sensitivity so that they can be sure that Client X on one data set is the same as Client X on another.
Sometimes single source data sets may even have numerous instances of the same individual but displayed as a completely separate entity. Companies House for instance has many duplicate, triplicate and other multiples of single entities which are yet to be resolved. As it is largely self declared by the individuals registering and filing their own companies, it can fall foul to many mistakes and needless repetition.
Unstructured Data Sets
Although performing entity resolution on structured data is not without its challenges and requires sophisticated systems to do it properly, the real challenges come when trying to perform ER in unstructured data, such as online media and web results.
This is where fields of artificial intelligence (specifically Natural Language Processing) are mandatory for such a task to be performed with any reasonable amount of sensitivity.
Although there will be articles published online which contain no other identifying attributes about an individual or company mentioned within it other than the name, many do and by using NLP, it is possible to structure this unstructured data and then use this structured information to perform the entity resolution.
For example, if an article contained a snippet which read: “Hugo Chamberlain (33), has been made a director of smartKYC,” NLP can be used to extrapolate facts from this unstructured data as:
- Name: Hugo Chamberlain
- Age: 33 (At the time of publication of said snippet)
- Related Company: smartKYC
The company or bank doing the KYC screening could then compare or resolve this against its own data it holds on the client, as follows:
- Name: Hugo George David Chamberlain (First and last name are a perfect match)
- DOB: 27/10/1981 – Age reference is 33, whileDate of Publication is November 2015, therefore this is a Reasonable Age match at time of publication
- Company Directorships: smartKYC Ltd – The company mentioned is a match.
In this simplistic example the AI system could reasonably infer that this is the same individual and the entity resolution is completed.
Of course though, unstructured data such as online media and news is the only OSINT which is truly fluid and changing constantly. Firms should be careful to fully review these processes when wanting to do it at scale, quickly, in multiple languages and to achieve the holy grail in KYC screening: True Perpetual KYC.