I have nothing against Harrogate, well not consciously at least. However, over recent months there has been much discussion about unconscious bias in many walks of life - including data used for algorithms. So let's consider the role that preparation of data for regulatory uses might have, and where issues exist, what I have been able to do to mitigate this impact in major data programmes I have run. For example, what are the inbuilt assumptions of the input data models and what impact does it have if these precepts and conditions are stretched, strained, or even broken.
To look at just one example of many, take the common approach of removing words such as Limited from company name data before performing a fuzzy match. The premise is that these words add little to the uniqueness of names, and if left in may make a character by character match seem far better than it is. An extreme example would be matching 'A Limited' and 'B Limited'. If the word Limited is left in, the algorithm will likely match the two names as most of the characters in name A match their counterparts in name B. In contrast the human observer will immediately note that they are probably totally different. So, removing the word in this case is a sensible approach and the Harrogate Limiteds will work okay. However, to get the full benefit of these techniques in other countries depends on having a relevant list of words to remove which will differ in different geographies. To apply the technique equally in a global programme needs an explicit effort and analysis - for example, I never thought I would have to learn the Vietnamese word for conglomerate! A similar approach is required for selecting relevant sets of abbreviations to expand or remove.
So issues of bias do exist although much can be done to resolve them, so long as you have the experience to recognise they exist and that their mitigation is planned early into the process. Otherwise, an attempt to simply roll out globally a previously successful fin crime model developed in Europe will not only suffer from bias, but also probably fail to identify the intended targets due to the large volume of data "noise".
Comments