There has been a lot of talk about new EMR access monitoring systems. These systems leverage various types of machine learning and artificial intelligence algorithms to identify and rank suspicious behavior. However, parsing their claims is often difficult for two primary reasons: (i) there is no shared data set to evaluate these methods, and (ii) claims are made using different evaluation metrics.
Putting aside the issue of a shared data set for now, lets consider some of the different metrics used today (e.g., false positive rates, false negative rates, true positive rates, true negative rates, recall, precision, and accuracy, among others), and if they tell the entire story about a system’s quality.
To do that, lets consider the following example and an auditing system that uses a Boolean model in which the system marks each access as suspicious or not (i.e., not a probabilistic model).
The system audits 100 accesses in a day.
The system marks 10 as suspicious.
Of the 10 suspicious, 5 are actually inappropriate and 5 are actually appropriate.
Of the 90 not marked as suspicious, 7 are actually inappropriate (and not detected)
Given this example, the system would have the following metric values:
True Positives: 5
True Negatives: 83
False Positives: 5
False Negatives: 7
Accuracy is defined as the total number accesses correctly classified as appropriate and inappropriate: (83 true positives + 5 true negatives)/ 100 = 88%
Recall is defined as the number of inappropriate accesses detected over all inappropriate accesses that occurred: 5/(5 + 7) = 42%
Precision is defined as the number of inappropriate accesses detected over all accesses the system thought are suspicious: 5/10 = 50%.
So how did the system do? Let’s compare it to a simple auditing system that never thinks any access is suspicious. It would have the following metric values:
True Positives: 0
True Negatives: 88
False Positives: 0
False Negatives: 12
Accuracy: (88 + 0)/100 = 88%
Recall: 0/12 = 0%
Precision: 0 / 0 or undefined
As this example shows, the simple auditing system has the same accuracy as the more advanced auditing system – even though it did not find any inappropriate activity. This result occurs because the prior distributions of the appropriate and inappropriate classes are not equal; there are many more appropriate accesses than inappropriate. This distribution skew can make simple (and bad) auditing systems look good. In the real world, the distributions are likely skewed even more (i.e. 99% to 1%), compounding this problem.
If accuracy is not a fair metric, what metrics should you consider? The combination of precision and recall, known as an F-1 score, is one good alternative. F-1 scores that are closer to a value of 1 mean the system is able to find most inappropriate behavior with good precision. In our example, the first auditing system has a better F-1 score than the simple system.
In the next post, we will discuss how to evaluate systems that use a probabilistic model to identify suspicious behavior (i.e., an access can be 70% suspicious and 30% not), and how the area under the receiver-operating characteristic (or AUC ROC) is a better metric and is robust to data skew.