Reconstructing Reconnaissance: HTTP Referers

by sconzo


With the shift to Intelligence Driven Security Operations, it’s important to understand the individual steps that are (generally) performed as part of a successful intrusion. Lockheed Martin has a great white paper on the subject of Intel Driven Defense. I’m not going to focus on why maturing from an alert-driven workflow to hunting and generating and deriving your own Intelligence is a “Good Thing"™, instead we’ll be taking a look at one of the many ways we can create actionable Intelligence. Specifically, detection, reconstruction, and inferring software, location, motivation and even capabilities of a potential adversary based on information gathered from their reconnaissance. This information is useful either before (as a Minority Report style precognition to an incident) or after a breach; to figure out target selection, determine the initial document selection for a phish, to see “passive” vulnerability detection/selection, or determine information leakage in general.

Everybody is aware that by visiting websites you leave pieces of information behind, and while it is possible to obfuscate this information it’s not commonly done. Because most reconnaissance is rarely camouflaged, we can use this to our advantage and put together a beginning picture of a potential adversary (or all information consumers visiting a public website). In this post we’ll take a look at the type of information we can learn from visitors based solely on information in the HTTP Referer Header (from a browser request). All of this information is gathered passively (web server logs, Bro, NetWitness, etc…), if you can actively interrogate a web browser you can gain much more information about the uniqueness of the person behind the software.

Let’s start by taking a look at a web request. We've done other profiling on request structure about some of the ways to determine if a user is trying to be deceitful about the software they’re using. Looking at this request:

Sample web request.

Sample web request.

We can see a Referer Header, this is generally present when a user (or software) “clicks” on a hyperlink; directing the browser to a new URI, and is used to notify the server where the client came from.  Some software will scrape search engine results and populate this header, but most times it’s legitimate browser traffic. This is the first point of interest; it can show some level of interactivity by a user, which could be more interesting than just software farming information (this depends on what you currently care about). Doing link analysis based off Referer values to see who is directing traffic towards your organization is valuable, but I’m going to stay focused around search engine driven traffic.

Search engine choice can be used to infer a bit of information about the user behind the software. What region is the search engine located in? Region specific versions of search engines can help indicate native language of the user. Did they choose Google HK over Google US? Maybe they’re a native Chinese speaker? Did the request come in with a Referer value that contains “yandex.ru”? Perhaps the user is a native Russian speaker. While I’m talking about language, did you know search engines leak some language information? Crazy? Yes. Awesome? You bet! For example Google is nice enough to populate the hl value in some of their queries. This is used to denote the language of the user interface (UI), and it’s used to “promote search results in the same language as the user’s input language”. So, if we see a request from google.com.hk with an hl value of “zh-CN”, guess what, here are 2 data points that indicate a native Chinese speaker. But there’s more! As an aside the c2off parameter is used to denote  Simplified or Traditional Chinese characters. There’s also the lr parameter used to restrict documents written in a specific language (it analyzes TLD, language meta tags, and primary and secondary languages used in the document), and cr, which restricts documents to a specific country (based on TLD and GEOIP location of server’s IP). What’s even more amazing than all of this; is Google isn't the only search engine to do this. Many other search engines also use parameters to help with language selection, but they often use the same ones! [Bing, Yahoo, etc..]

The follow example with help clarify and illustrate this points on language.

http://www.google.com.hk/search?as_q=marketing+reports&hl=zh-CN&lr=en&newwindow=1&num=10&btnG=Google+%E6%90%9C

It’s a search from google.com.hk with a query of “marketing reports”, from something that has a UI language of zh-CN and is only looking for results in English.  That’s a lot of information for 1 line of text! This can be paired with the Accept-Language value in the request to gauge the inferences of the above. If it’s also set to zh-CN (or has a high-preference) for it, this is another data point to help confirm that a native Chinese speaker is the user behind the keyboard.

Let’s switch gears and look at all the interesting stuff you can learn from the query. There’s the obvious keyword matching that we can do (maybe you’re really interested if anybody is searching for information on your Executive team). But what else is there? There are search engine specific operators, not limited to: intitle, inurl, filetype, site, allinurl, intext, and the list goes on and on. Many other search engines also implement the same or similar operators. Combinations of the operators may tell a very interesting (and potentially targeted) story. Perhaps you run across a query that looks like:

site:organization.com filetype:xls email organization.com partner1.com partner2.com

It looks like it’s searching for information in organization.com (and associated subdomains and hosts), looking for Excel files that (likely) contain email address from organization.com as well as two partner companies (partner1.com and partner2.com).  What about:

intitle:organization.com +”project x” roadmap

That one could be another way of searching pages on organization.com (based on site specific information in the HTML title section) and looking for information regarding a roadmap of project x. You can begin to see the capacity for pattern emergence of people looking for interesting and useful information on your external web presence.

Some other search operators (Google specific) of interest are:

  • output – Specifies the XML construction of the search results. This could point to programmatic access of search results.
  • num – The number of results to return. The default is 10, and the max is 20, any change in this might be interesting
  • start – This specifies the first result of the current page. So num=20 and start=10 means you’re looking at the 10th page of 20 results, or the first result being returned is the 200th result. High numbers here are very common in scraping traffic. As an aside, no more than 1000 results should be returned by Google, which means if num x start > 1000 you’ve got a defective scraper.
  • cx – Denotes that a custom search engine (within Google) was used.

Any of the keywords under https://developers.google.com/custom-search/docs/xml_results#Advanced_Search_Query_Parameters can be used instead of ‘-‘ (minus, does not contain), ‘OR’, and ‘+’ (must contain).

An excellent mapping is provided by Google:

 

(https://developers.google.com/custom-search/docs/xml_results#wsAdvancedSearch)

(https://developers.google.com/custom-search/docs/xml_results#wsAdvancedSearch)

The next hurdle with looking at this type of information is being able to operationalize it effectively. I've uploaded 2 different ways to make effective use of this in an operational setting. The first is a Bro script. This script analyzes the Referer header, and if the request is inbound to your organization (the variables must be set in your Bro config), and a search engine (expandable) appears in the referrer, a log entry is created. In addition, the log entry stores the referrer, search engine found, and if any search operators were used. This script, will also look for user-supplied keywords of interest and flag those queries as well. The other script provided is a NetWitness Parser that will create an entry in risk.info when it sees traffic that matches the above criteria in addition to alert (for easy reporting). You can edit and add context to the parser to flag on keywords and search operators. They add additional attributes in risk.warning and risk.suspicious based on what is present in the referrer. However, it’ll be up to you to create an Informer report to pull in the GeoIP information, and whether it was inbound to your organization or not. Both outputs make for a relatively digestible and scalable daily report.

As with anything, all feedback is welcome.