from Microsoft Malware Protection Center
Most well-known antimalware tests today focus on broad-spectrum malware. In other words, tests include malware that is somewhat indiscriminate (isn't necessarily targeted), at least somewhat prevalent and sometimes very prevalent. Typically, tests are not focused on specialized threats that are highly targeted, and most avoid including programs that walk the line between good and evil, such as adware and other programs that we call unwanted software as opposed to malware. Files that are in most test sets are files that antimalware vendors agree a customer would never want and are generally pervasive in the ecosystem.
The traditional test score counts each file equally. That is, if there are 100 files, each file is worth 1% of the test. In the real world, however, people don't encounter malware at exactly the same rate. Some malware is incredibly prevalent while other malware families are not as pervasive. Likewise, some malware might focus on certain regional demographics or languages and not affect other parts of the world. When it comes to real customer impact, not all malware has the same distribution or prevalence. Yet, they are treated as such in traditionally scored tests.
Collaborating to create a more applicable scoring model
Microsoft has been partnering with AV-Comparatives to create a scoring model that incorporates prevalence to represent true customer impact. At Virus Bulletin (VB) this year, Peter Stelzhammer (AV-Comparatives co-founder) and I presented this model. Today, AV-Comparatives is releasing the prevalence-weighted results from the most recent file detection test. This test compares detection rates of vendors against a very comprehensive malware set – 166k Portable Executable (PE) files.
After working with AV-Comparatives for many years, I have personally developed a great respect for the way they curate files for their tests. They work diligently to select files that are relevant, are not in that "unwanted" category (which vendors would lobby to dispute out of their test), and they are able to source hundreds of thousands of recent files for the test. That said, one thing we found is that it is incredibly difficult, if you're using a traditional scoring model, to attempt to source a perfect number of files that represent ecosystem prevalence.
For one, many malware families rely on non-PE components to spread. Jenxcus is a good example – its VBS (Visual Basic script) component is one of the most frequently blocked files on our customers' computers. However, its PE component is seen comparatively rarely, so it's quite difficult to source enough Jenxcus PE files for a test to equate to that family's ecosystem prevalence. Samples from some families might be easier to source than others (more willing to be found or submitted to public sources). These constraints make it practically impossible to select a test set that perfectly equates to the ecosystem.
Looking at the prevalence model
Enter the prevalence model. AMTSO through the Realtime Threat List (RTTL) has been making strides lately to encourage vendors to share malware prevalence information with testers to help testers build better test sets. While we have been moving toward critical mass and getting closer to the needed features to make that project work, Microsoft offered to sponsor AV-Comparatives and provide telemetry details from over 200 million computers in over 100 different countries to them to create a prevalence-weighted model.
The following chart shows how the test set stacks up to the ecosystem (# of files in comparison to ecosystem prevalence).
Figure 1: In general, files selected for the most prevalent malware families (those in the high category) were underrepresented using the traditional method of scoring and those in the low category were overrepresented.
When you drill down a layer to the highly prevalent malware families, you can start to see why the numbers don't line up. Some of the file infector families, like Sality and Virut, had a very large sample set (a great representation of the family in fact). However, other prevalent families, like Gamarue, Dorv, Jenxcus, and Sventore were underrepresented. Sventore was new – there was only one file to represent that family. Gamarue, Dorv, and especially Jenxcus didn't have nearly enough recent PE files available for the test to allow them to equate to their ecosystem prevalence.
Figure 2: Another example of the test scores not lining up.
The prevalence-weighted model takes into account the prevalence of the tested file, the malware family associated with the file, and the malware family's partition (high, moderate, low, very low) to calculate each file's impact to the test which balances the score with the actual customer impact in the ecosystem.
For more details about the exact calculation method, you can see the AV-Comparatives report released today.
The charts above show how the prevalence model balanced test scores to make them more accurately represent a vendor's detection capabilities. In essence, missed files were scored to represent the malware people were more likely to encounter, which is good information for consumers. However, prevalence-weighting the score can mean that vendors (at least those who monitor malware prevalence) might have very similar test scores. Therefore, additional context is probably needed to help consumers make decisions.
Geolocation is one context we analyzed. In the report, we broke down vendor test scores by country using each country's malware prevalence profile. There are some examples of vendors that did great in some countries and not so great in others. Scores didn't always line up with vendors that were co-located in the target region. If you're interested in a specific country, be sure to check out AV-Comparative's regional maps in the report.
Organizations, especially those that might have special security concerns, might need other differentiation. After Peter and I presented at VB this year, we got lots of great feedback from people at the conference. One of the ideas we discussed was differentiating malware prevalence specifically affecting enterprises and even showing the differences between verticals. Other discussions centered around showing detection differences by type of threat – ransomware in comparison to information stealers, etc.
Prevalence is but one model that provides additional insight to help people make better-informed decisions when choosing their protection provider. Through partnerships like this one with AV-Comparatives and others in the industry, productive discussions and innovative models result in even better antimalware testing and provide greater benefits to consumers and enterprises alike.
Holly Stewart
MMPC