Reconstructing Reconnaissance: HTTP Referers

by sconzo


With the shift to Intelligence Driven Security Operations, it’s important to understand the individual steps that are (generally) performed as part of a successful intrusion. Lockheed Martin has a great white paper on the subject of Intel Driven Defense. I’m not going to focus on why maturing from an alert-driven workflow to hunting and generating and deriving your own Intelligence is a “Good Thing"™, instead we’ll be taking a look at one of the many ways we can create actionable Intelligence. Specifically, detection, reconstruction, and inferring software, location, motivation and even capabilities of a potential adversary based on information gathered from their reconnaissance. This information is useful either before (as a Minority Report style precognition to an incident) or after a breach; to figure out target selection, determine the initial document selection for a phish, to see “passive” vulnerability detection/selection, or determine information leakage in general.

Everybody is aware that by visiting websites you leave pieces of information behind, and while it is possible to obfuscate this information it’s not commonly done. Because most reconnaissance is rarely camouflaged, we can use this to our advantage and put together a beginning picture of a potential adversary (or all information consumers visiting a public website). In this post we’ll take a look at the type of information we can learn from visitors based solely on information in the HTTP Referer Header (from a browser request). All of this information is gathered passively (web server logs, Bro, NetWitness, etc…), if you can actively interrogate a web browser you can gain much more information about the uniqueness of the person behind the software.

Let’s start by taking a look at a web request. We've done other profiling on request structure about some of the ways to determine if a user is trying to be deceitful about the software they’re using. Looking at this request:

Sample web request.

Sample web request.

We can see a Referer Header, this is generally present when a user (or software) “clicks” on a hyperlink; directing the browser to a new URI, and is used to notify the server where the client came from.  Some software will scrape search engine results and populate this header, but most times it’s legitimate browser traffic. This is the first point of interest; it can show some level of interactivity by a user, which could be more interesting than just software farming information (this depends on what you currently care about). Doing link analysis based off Referer values to see who is directing traffic towards your organization is valuable, but I’m going to stay focused around search engine driven traffic.

Search engine choice can be used to infer a bit of information about the user behind the software. What region is the search engine located in? Region specific versions of search engines can help indicate native language of the user. Did they choose Google HK over Google US? Maybe they’re a native Chinese speaker? Did the request come in with a Referer value that contains “yandex.ru”? Perhaps the user is a native Russian speaker. While I’m talking about language, did you know search engines leak some language information? Crazy? Yes. Awesome? You bet! For example Google is nice enough to populate the hl value in some of their queries. This is used to denote the language of the user interface (UI), and it’s used to “promote search results in the same language as the user’s input language”. So, if we see a request from google.com.hk with an hl value of “zh-CN”, guess what, here are 2 data points that indicate a native Chinese speaker. But there’s more! As an aside the c2off parameter is used to denote  Simplified or Traditional Chinese characters. There’s also the lr parameter used to restrict documents written in a specific language (it analyzes TLD, language meta tags, and primary and secondary languages used in the document), and cr, which restricts documents to a specific country (based on TLD and GEOIP location of server’s IP). What’s even more amazing than all of this; is Google isn't the only search engine to do this. Many other search engines also use parameters to help with language selection, but they often use the same ones! [Bing, Yahoo, etc..]

The follow example with help clarify and illustrate this points on language.

http://www.google.com.hk/search?as_q=marketing+reports&hl=zh-CN&lr=en&newwindow=1&num=10&btnG=Google+%E6%90%9C

It’s a search from google.com.hk with a query of “marketing reports”, from something that has a UI language of zh-CN and is only looking for results in English.  That’s a lot of information for 1 line of text! This can be paired with the Accept-Language value in the request to gauge the inferences of the above. If it’s also set to zh-CN (or has a high-preference) for it, this is another data point to help confirm that a native Chinese speaker is the user behind the keyboard.

Let’s switch gears and look at all the interesting stuff you can learn from the query. There’s the obvious keyword matching that we can do (maybe you’re really interested if anybody is searching for information on your Executive team). But what else is there? There are search engine specific operators, not limited to: intitle, inurl, filetype, site, allinurl, intext, and the list goes on and on. Many other search engines also implement the same or similar operators. Combinations of the operators may tell a very interesting (and potentially targeted) story. Perhaps you run across a query that looks like:

site:organization.com filetype:xls email organization.com partner1.com partner2.com

It looks like it’s searching for information in organization.com (and associated subdomains and hosts), looking for Excel files that (likely) contain email address from organization.com as well as two partner companies (partner1.com and partner2.com).  What about:

intitle:organization.com +”project x” roadmap

That one could be another way of searching pages on organization.com (based on site specific information in the HTML title section) and looking for information regarding a roadmap of project x. You can begin to see the capacity for pattern emergence of people looking for interesting and useful information on your external web presence.

Some other search operators (Google specific) of interest are:

  • output – Specifies the XML construction of the search results. This could point to programmatic access of search results.
  • num – The number of results to return. The default is 10, and the max is 20, any change in this might be interesting
  • start – This specifies the first result of the current page. So num=20 and start=10 means you’re looking at the 10th page of 20 results, or the first result being returned is the 200th result. High numbers here are very common in scraping traffic. As an aside, no more than 1000 results should be returned by Google, which means if num x start > 1000 you’ve got a defective scraper.
  • cx – Denotes that a custom search engine (within Google) was used.

Any of the keywords under https://developers.google.com/custom-search/docs/xml_results#Advanced_Search_Query_Parameters can be used instead of ‘-‘ (minus, does not contain), ‘OR’, and ‘+’ (must contain).

An excellent mapping is provided by Google:

 

(https://developers.google.com/custom-search/docs/xml_results#wsAdvancedSearch)

(https://developers.google.com/custom-search/docs/xml_results#wsAdvancedSearch)

The next hurdle with looking at this type of information is being able to operationalize it effectively. I've uploaded 2 different ways to make effective use of this in an operational setting. The first is a Bro script. This script analyzes the Referer header, and if the request is inbound to your organization (the variables must be set in your Bro config), and a search engine (expandable) appears in the referrer, a log entry is created. In addition, the log entry stores the referrer, search engine found, and if any search operators were used. This script, will also look for user-supplied keywords of interest and flag those queries as well. The other script provided is a NetWitness Parser that will create an entry in risk.info when it sees traffic that matches the above criteria in addition to alert (for easy reporting). You can edit and add context to the parser to flag on keywords and search operators. They add additional attributes in risk.warning and risk.suspicious based on what is present in the referrer. However, it’ll be up to you to create an Informer report to pull in the GeoIP information, and whether it was inbound to your organization or not. Both outputs make for a relatively digestible and scalable daily report.

As with anything, all feedback is welcome.

 


Passive Browser Fingerprinting

by sconzo in


browser fingerprint image

Is my network traffic lying to me? Most malware authors don’t seem to spend a lot of effort trying to blend into network traffic. I’m pretty sure the reason for this is “they don’t need to”. Why spend extra effort figuring out how to blend into network traffic when something as simple as the following HTTP request can sneak by undetected.

GET /statistics.html HTTP/1.1
Host: cuojshtbohnt.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1; sv:2; id: 1A698BE9-0211-5EB4-AFDC-644AA479D972) Gecko/20100101 Firefox/9.0.1

*Borrowed from: contagiodump.blogspot.com

That. works. ಠ_ಠ People had trouble finding it initially, people admit to be being hard to track and identify, and there was a lot of reliance in tracking it via User-Agent string or by domain. I’d be willing to place a beer (or two) on the line if anybody can successfully argue their way into that being any kind of legit Firefox User-Agent string. Compare that request with the following:

GET /logo.gif HTTP/1.1
Host: www.<redacted>.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/ 9.0.1
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

Yes, that is a legit Firefox connection and you’ll notice it looks absolutely nothing like the prior example. Aside from both possessing Host and User-Agent headers and being a GET request, they can’t be more different. It turns out people that write web browsers love reading RFCs and making them as valid and flexible as possible (thanks guys!). They also seem to favor consistency, except for the dudes writing MSIE. One of the things I (amongst many other people) have been preaching for years is identify stuff that’s legitimate and then don’t look there for anything malicious. What if we had the technology to determine valid HTTP client requests (which we do), and using that information begin looking at everything that wasn’t a valid request? This can be made a bit more interesting from the analytic side by looking at everything that says it’s browser X, but doesn’t appear to behave like it (X). The hard part is not coming up with a method of detection (there are quite a few), but it seems to be collecting the data to verify the detection mechanism.

Luckily there are a few projects that allow analysis of network traffic in such a way to allow for analysis/questioning like this to occur. I’m sure I’m leaving some out, but these are the few that I’m familiar with (short business plug: we use, develop content for, and help with analytics on): p0f, Bro IDS, and NetWitness. Each technology does things in a slightly different way but this still leads to a solid analytic experience to ask different questions about your network.

p0f

It’s back with a vengeance, and I’m late to the game posting about it. Whatever, let’s look at how it works (strictly functionality non-algorithmically). If you dig through the README you’ll find a note like this:

The signature will be matched even if other headers appear in
between, as long as the list itself is matched in the specified
sequence.

That’s key. In other words you specify which headers should appear in what order - relative to one another vs. the absolute position in the request. The logic determines if the request matches that order, and viola signature matched. There are several other really nice features of this implementation; you can look for inconsistencies between platform (OS) and browser, you have the ability to look for default values (this is important to gain accuracy), and the signature format is highly flexible. The only real drawback (as with most tools/products/whatever) is the lack of signatures. Overall one of the best, if not the best (or maybe only), out-of-the-box implementations for passive browser fingerprinting.

Note: The next two technologies do not include this functionality out of the box, but there is some sample content floating around to allow for it.

Bro IDS

Bro is also benefiting from a relatively new yet major release. While it doesn’t ship with the ability to profile HTTP connections out of the box Seth Hall has put some leg work in over the years to come up with a good way to do this, with a rewrite recently. This is a slightly different way of creating the request signatures, but still has the notion of required headers and relative order along with optional headers. This method doesn’t take default values for headers, but you can get a surprising amount of accuracy just using header order (although the ability to say ‘not this header’ is important). Once again, you’ve got to have data to come up with good (accurate) signatures but it’s refreshing that some are provided for you. The one downside to Bro, in it’s most current revision,is that it normalizes HTTP headers. While the normalization is useful for other analytics in this case being case sensitive is a really good thing as all major browsers upper-case the first letter of each word in the header name. Bro is proving to be an exceedingly agile platform for network monitoring in general and this could be a great place for analytics like this to live.

NetWitness

NetWitness is in the same boat as Bro in that it doesn’t provide the functionality or signatures within the product, however it’s another tool you can use to perform similar analysis of network traffic. Instead you (or somebody else) has to create the proper content to do the analysis. The really downside is there is no freely available reference to base this off of (for now, but keep reading).

Our Work

This has been an area that’s piqued my interest for quite a while, as I really like finding new ways to look interesting behaviors in network traffic. When we were looking to solve this problem we took a bit of a different approach. Since we’re not big on duplicating efforts we looks at the technologies we had available (at the time it was NetWitness and Bro) and figured out a way to solve this problem in those systems. Doing things a bit backwards we decided to pick the signature and then create the mapping from signature to potential match. We started with a list of headers that we saw the majority of major browsers user, and used their relative order to determine the browser creating the traffic. Using the following headers: Host, Accept, Accept-Language, Accept-Charset, Accept-Encoding, Connection, User-Agent, UA-CPU, XMLHTTPRequst and Keep-Alive and the case requirement we came up with a series of decision trees to determine the browser. We went with decision trees because it seemed a bit easier to manage than signatures, I’m not sure it will end up that way but it’s been really easy to manage thus far. The header choice seems to lend itself to profiling HTTP 1.1 connections but we’ve had some success with HTTP 1.0 requests as well. By adding more headers and and some default values you can get increasingly more accurate with identification.

In the meantime I’ve uploaded a reference for a NetWitness parser that takes care of some of the Opera signatures. If you’re not familiar with the NetWitness parsing language; the parser does a basic check to insure it’s in an HTTP session. The parser will then check for the presence of each header that we care about and note it’s relative position, when it encounters (what should be) the User-Agent header it checks to see if Opera/ is present and that is noted. At the end of header ([CR][LF][CR][LF]) there is logic to check the order of the headers and fall through to it being valid. If it doesn’t match the pattern (signature) for Opera, and it saw Opera/ in the User-Agent then it will cause an alert to be populated in the Alerts key.

It’s really awesome seeing other people looking at more flexible and creative methods of analytics, it makes me feel less crazy for trying stuff like this. The best part, in a couple cases, the signatures matched up with my tree (that sounds odd). What a cool verification point tho. I hope to get some more data under our analytics and begin to contribute more back to the community.


Come see VisibleRisk at The SANS DFIR Summit

by Rocky DeStefano


 VisibleRisk Sponsoring SANS DFIR 2012

VisibleRisk is always happy to support the local community by sponsoring events that meet both our intellectual needs and our need to be around insanly intelligent people.  SANS DFIR certainly accomplishes both of those directives.  This year we are sponsoring the DFIR event in a couple of ways.  For those of you attending the Summit we are sponsoring breakfast on day 1 of the summit.  Additionally, we will have an information table set up on the 26 and 27th.  We’ll be there to answer questions and simply contribute to the conversation.  No sales just peers - real practictioners to share what we can with you!  As an added incentive for you to visit our booth (and to make up for the fact that haven’t yet sold our souls and rented booth bunnies) we’ll have a give-away or two that you’ll enjoy.

Information Table - June 26-27 * 9:00am - 5:00pm
Sponsored Breakfast - June 26 * 7:00am

If you haven’t registered and want to attend let me know and we can get you a 10% discount.  It is always a worthwhile event.  It is probably one of the most deeply technical events I’ve ever attended as well as being attended by people most of us would love to work with on a daily basis.

 

Follow us on twitter @visiblerisk, @rockyd, @sooshie for more information during the Summit!

 

https://www.sans.org/forensics-incident-response-summit-2012/