Getting a reputation: How SmartScreen looks at URLs

I'd like to talk a bit about how we determine the reputation of different URLs and IPs and use this to protect against spam, phishing, and other abuse in Internet Explorer and Windows Live Hotmail.

Let's start with a bit of background. When an abuser–a spammer, phisher, or malware distributer–attacks someone, they have to do two things. First, they deliver a communication (often a spam e-mail), that entices the victim. Second, they "seal the deal" by actually selling the product, stealing the personal information, or installing the malware. (The second part is sometimes referred to as "collecting the conversion.") Dick Craddock and I have talked about some of the steps we take to block abusers' initial communications in previous posts (Fighting the war on spam, Spam, phishing, and other annoyances, and Preventing spam and phishing using e-mail authentication). I'm going to talk about some of the work we do to keep abusers from "sealing the deal."

By far the most common way abusers collect their conversions is using webpages, like the ones shown here:


Sample malware webpage



Sample spam webpage


A number of technical steps go into displaying a webpage, and the reputation systems in SmartScreen® key in on all of them. Here's a quick rundown. Consider the webpage selling medications in the figure above; to visit it you can type the URL into your web browser (although the link is probably dead by now—SmartScreen forces abusers to move quickly):

http://canada-pharmacy.us/

Obviously SmartScreen's reputation systems learn that particular URLs are bad—that is the first step—but we go much further. Every URL is hosted on a domain. In this case the domain is "canada-pharmacy.us". Abusers will often host hundreds or thousands of individually abusive URLs on a single domain. With the right evidence, SmartScreen's reputation system will flag whole domains as abusive.

URLs and domains are concepts that let humans refer to computers. But every computer that's directly on the Internet also has a numeric code, called its IP address, that lets other computers refer to it. For example, 109.22.33.142 might be the IP address of the computer that's running the web server that's hosting the canada-pharmacy.us domain. SmartScreen's reputation system tracks these as well and will mark specific web server IP addresses as abusive. SmartScreen will also generalize to other computers "in the neighborhood" of known bad ones. For example, IP addresses are often allocated in blocks, and it's likely that the person who owns 109.22.33.142 also owns 109.22.33.143 and .144 and .145. We use knowledge about the way infrastructure blocks are allocated–into subnets, ASN (Autonomous System Number) blocks, the way message routing works, and more–to figure out what other computers the abusers own, and prevent those abusers from attacking Microsoft customers.

DNS servers are another key to SmartScreen's reputation system. DNS servers translate the URLs that you type into your browser into the IP addresses used by computers. SmartScreen assigns a lower reputation score to DNS servers that seem to know just a little bit too much about abusive domain names.

Making it too expensive to abuse

The point of building reputation on all of these different types of Internet infrastructure is that it costs abusers money. For example, when we find a DNS server that an abuser owns, we give it a bad reputation, and they will then need to invest in a new DNS server. When we find an IP address provider that works with abusers, we proactively find the IPs that they're registering and keep an eye on them. This figure illustrates the increasing costs that abusers incur as we dig deeper into their infrastructure.





Conceptual cost pyramid for Internet abuse


Our goal is to set up a situation where abusers don't make enough money to make it worth their time to attack Microsoft customers, where they find that getting their message in front of our users is hard, and collecting conversions is harder still.

Building and maintaining reputation

Let me now focus in on one specific piece of the reputation system behind SmartScreen: the URL-based reputation system used to fight phishing. Keep in mind that this is just one of over a dozen interrelated systems that work together to help SmartScreen do its job in protecting customers.


Conceptual architectural diagram of phishing reputation


SmartScreen's reputation systems begin with telemetry feeds: reports from end users, data from third parties, traffic from URLs showing up in e-mail, logs from our services, etc. Some of these feeds contain billions of URLs per day. Other feeds contain URLs that a third party has certified to be known phishing sites, and still others contain little more than the fact that an URL has appeared in spam e-mail messages.


Reporting phishing and malware from Internet Explorer


 
Reporting phishing and spam from Hotmail

But we don't assign a bad reputation based on just a single piece of feedback; any given piece of feedback may be from an abuser, from a competitor, or it may be incorrect. Instead, we use a series of algorithms that combine all the data we have to produce the most accurate and comprehensive reputation database possible. Every input feed is different, and each is handled differently, but in general, we take every URL in every feed and use machine learning to predict the probability that the URL is abusive. At a high level, this involves examining each URL for suspicious substrings (for example, the word "pharmacy" in the URL), looking up the history of the URL–its associated domain, IPs, DNS servers, routers, subnets, ASNs–and combining these into tens of thousands of potentially predictive features for the URL. We then apply models based in machine learning, which pore over these features and separate the abusive URLs from the honest ones.

Most of the time, we are confident enough in the findings of our machine learning engine that we can flag a URL as abusive based on this recommendation alone. Sometimes a URL is suspicious but we're not certain; we send many of these suspicious URLs to our analysts for final classification.

How SmartScreen reputation protects you

Conceptually, the work of SmartScreen's reputation systems results in a huge database of information about abuse on the Internet. We ship information from this database, on a near-real-time basis, into a large number of Microsoft products and services, including: Windows Live Hotmail, Internet Explorer, Bing, AdCenter, Exchange, Microsoft Security Essentials, and more. Each of these services implements some of their safety features based on SmartScreen's reputations.

In the case of Hotmail, the results are used to determine if incoming e-mail messages should be delivered to our customers. Our goal is that Hotmail customers never see messages linking to known phishing, malware, and spam sources. In other scenarios, like when a customer types the URL for a known malware site into the address bar in Internet Explorer, SmartScreen provides a visual warning.

 
Examples of SmartScreen reputation at work in Internet Explorer


False positives

It's worth noting that any nondeterministic filtering system can make mistakes. And, although they are rare, we take mistakes in SmartScreen very seriously, measuring them, managing them, and responding to them as quickly as possible. For more details on what to do if you think SmartScreen is making a mistake, try these resources:


Summary

SmartScreen's reputation systems bring together the telemetry, feedback, and protection of several of Microsoft's major Internet services and tools. As a result, each is safer than they would be if they had to fight abuse alone. For example, in the figure below, each color represents the size of the contribution of each different feed to SmartScreen's reputation database. Notice that no single feed accounts for more than about a quarter of the overall protection.


aggbug.aspx

More...
 
This is the kind of information I like to read and learn ore about. If not for people like yourself taking the time to post such information, I would for one never have taken the time to learn so much about how this works. Thanks. Fabe
 

My Computer

System One

  • OS
    Windows Seven Ultimate 64 /Xp Home sp3
    System Manufacturer/Model
    Thefabe
    CPU
    Intel Q6600 GO steppping 2.4 @ 2.63
    Motherboard
    Asus P5N-D
    Memory
    8 gig's OCZ Fata1ty 1066
    Graphics Card(s)
    EVGA GTX 260 Superclocked
    Sound Card
    Creative Labs Audigy 2zs
    Monitor(s) Displays
    HP 23in TFT
    Screen Resolution
    1920x1080p 60hz
    Hard Drives
    WD 200 Blue / Wd 500 Blue / Seagate 120gb 2.5 in External Drive / Seagate 80 gb 2.5 in External Drive / LG 10x BDRW / TSST DVDRW / Sony DVDRW
    PSU
    OCZ 700 w Gamextreme
    Case
    NZXT Apollo
    Cooling
    Masscool CPU / Dominator Memory/ 3 120mm
    Keyboard
    Logitech MX5000
    Mouse
    Logitech MX1000
    Internet Speed
    19.83mb/s DL 0.97 Mb/s up
    Other Info
    Logitech Z2300 2.1 speakers and Bose Comfort Headphones, Epson NX415 all in one printer, 4port powered USB hub PCAmp 40 x 40
Nice find and interesting reading, thanks.
Good to see there is a lot going on in the background to make our web visits safer.
 

My Computer

System One

  • OS
    Windows 8.1 Pro x64/ Windows 7 Ult x64
    Computer type
    PC/Desktop
    System Manufacturer/Model
    76~2.0
    CPU
    Intel Core i5-3570K 4.6GHz
    Motherboard
    GIGABYTE GA-Z77X UD3H f18
    Memory
    8GB (2X4GB) DDR3 1600 CORSAIR Vengeance CL8 1.5v
    Graphics Card(s)
    Sapphire HD 7770 Vapor-X 1GB DDR5
    Sound Card
    Onboard VIA VT2021
    Monitor(s) Displays
    22" LCD Dell SP2208WFP
    Screen Resolution
    1680x1050
    Hard Drives
    Samaung 840Pro 128GB, Seagate 500GB SATA2 7200rpm 32mb, Seagate 1TB SATA2 7200rpm 32mb,
    PSU
    Corsair HX650W
    Case
    Cooler Master Storm Scout
    Cooling
    Corsair H80 w/Noctua NF P12 12cm fan, case fans 2X14cm
    Keyboard
    Logitech Wave
    Mouse
    CM Sentinel
    Internet Speed
    Abysmal
    Browser
    Opera Next
    Other Info
    Dell Venue 8Pro: Baytrail Z3740D, 2GB Ram, 64GB HDD, 8" IPS Display 1280 x 800, Active Stylus.
    Haswell laptop: HP Envy 17t-j, i7-4700MQ, GeForce 740M 2GB DDR3, 17.3" Full HD 1920x1080, 16GB RAM, Samsung 840 Pro 128GB, 1TB Hitachi 7200 HDD,
    Desktop: eSATA ports,
    External eSATA Seagate 500GB SATA2 7200rpm,
Interesting. Thanks John.
 

My Computer

System One

  • OS
    Win7 Professional x64
    System Manufacturer/Model
    Jonathan King
    CPU
    AMD Athlon Dual Core Processor 4850e overclocked @ 2.92 GHz
    Motherboard
    Edit Value ASRock A780 FullDisplayPort
    Memory
    4.0GB Dual-Channel DDR2 290MHz Crucial Technology
    Graphics Card(s)
    ATI 3200 (onboard), nVidia 7200 GS (PCIe)
    Sound Card
    Realtek High Definition Audio
    Monitor(s) Displays
    17" Cybervison ds69T, 17" Starlogic
    Screen Resolution
    1280x960
    Hard Drives
    WD 320GB SATA, Hitachi 1TB SATA, Samsung 1.5TB SATA
    PSU
    Antec ea-430d 430W
    Case
    Antec 800
    Cooling
    stock cpu, 120mm rear, 140mm top
    Keyboard
    Microsoft Wired Desktop 500 (PS/2)
    Mouse
    Microsoft Wired Desktop 500 (USB)
    Internet Speed
    9.32 Mb/s download; 0.36 Mb/s upload
    Other Info
    Other OS's:
    Win7 Professional x86, Windows Vista Ultimate x64
Back
Top