Title: ICTR Cloud Effects

Release Date: 2015-09-25

Document Date: 2011-07-01

Description: This GCHQ presentation from July 2011 discusses various technical aspects of “population-scale” datamining: see the Intercept article Profiled: From Radio to Porn, British Spies Track Web Users’ Online Identities, 25 September 2015.

Document: cloud-developers-exchange-july-2011-p1-normal.gif:
“ICTR Cloud Efforts”

developing “canonical” SIGINT
analytics, finding hard targets and
exploratory data analysis at scale

Data Mining Research - ICTR, GCHQ
Drcloud-developers-exchange-july-2011-p2-normal.gif:
Building a SIGINT toolbox for BIG DATA

• Cloud analytics for SIGINT canonical operations

- Aggregation - building Geo-Time profiles for Internet Presence

- All pairs association - alternate identifiers and Geo Associates

- Componentisation - identify interesting small or large groups

• Target discovery at population scale

- target discovery - discover unknown targets

- known target communications behaviour- modus operandi (MO)

- population scale bulk unselected events - all events for country or world

• Exploratory Data Analysis of Internet / Cyber Events

ICT

RESEARCH

This mformatxxi is exempt from aisclosuw Ildar tl» Fmnrtnm of Wotmaeon Act2000 and IMy be subject to exemption under ocher UK «formation tr^ntanm.

Refer disclosure requests to GCHQ ot

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p3-normal.gif:
GCHQ Cloud Analytic Development

LET'S IfAPLEfAENT
CLOUD COMPUTING SO
I HAVE SOMETHING TO
TALK ABOUT AT THE
EXECUTIVE MEETING.

TELL THEN WE'RE
EVALUATING IT. THAT
WAY NEITHER OF US
NEEDS TO DO ANY
REAL WORK.

m

www.dilbert.com/striDs/comic/2009-11 -18/

In last few years Data Mining Research at GCHQ have:

- developed new population scale analytics for multi-petabyte cluster

- evaluated cloud for data marting, bulk association, graph analytics

- delivered operational benefit - population scale target discovery

ICT

RESEARCH

This information is exempc from disctosuce tmder the Freedom of Wnrnmion Act2000 and may be subject to exemption under ocher UK i

Retec disclosure requests X> GCHQ on |

UK TOP SECRET STRAP1 5EYE3y~-J

i^NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p4-normal.gif:
Geo-Summaries for all Internet presence

• Building Geo-Time profiles for every Internet identifier we see

• Discovering targets using Modus Operandi

• Summarisation of “Geo Pattern of Life” for every Internet identifier

- Summarises how often each identifier seen in every country per week

- Massively reduces data volumes (trillions of events to billions of profiles)

Email=

Seen in: PK 17 times, UK 2 times

Week commencing Seen

29/06/2009 UK,1

06/07/2009 UK,1

12/10/2009 PK,9

19/10/2009 PK,8

Perfect for MapReduce
IP-Geo for all Internet presence
Note scale of resulting profiles .

ICT

RESEARCH

This information is exempt from tasdosure Ulcfar tl» Fmnrtnm of Wonwlfon Act2000 and nay be subject to exemption under other UK information legislation.

Refer disclosure requests so GCHQ oc

UK TOP SECRET STRAP1 SEYES^j

¿a NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p5-normal.gif:
Geo-profiling over all presence events

• Perfect for MapReduce

- counting the number of occurrences in a large collection of records

- “MapReduce: A Flexible Data Processing Tool" Dean and Ghemawat
Comms of the ACM January 2010 53(1) pages 72-77

• The Geo-Time summaries for all target identifiers can be used to
answer a number of questions:

- Where has this target identifier been?

- Which target identifiers match the following country travel pattern?

- Do anomalous Geo sightings indicate coordinated activity?

• When combined with domain knowledge, can be extremely powerful if
aggregated over all the data

ICT

RESEARCH

This information is exempt from tasdosure Ulcfar tl» Fmnrinm of fofonwMfon Act2000 and nay be subject to exemption under other UK information legislation.

Refer disclosure requests so GCHQ oc

UK TOP SECRET STRAP1 SEYES^j

¿ai NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p6-normal.gif:
EVERY ASSOC & BotGraph:
bulk pairwise associations
and graph componentizationcloud-developers-exchange-july-2011-p7-normal.gif:
Large-scale community detection toolbox

• All pairwise correlation/association - build your graph

- EVERY ASSOC for TDI alternate identifier scoring

- BotGraph for webmail spam - Zhao et al Botgraph [NSDI 09]

- PROBABILITY CLOUD for handset Geo-Association scoring

• Graph Componentisation

- GCHQ MapReduce or Bagel implementation

- Open source MapReduce implementations (CMU Pegasus)

• Analysis pattern to identify sub-sets for deeper analysis

- Simple approach to make sense of huge datasets

- Detect communities of potential interest from massive datasets

- Rarely sufficient but essential first step in data volume reduction

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

JÿSl NEXT GENERATION

rçr; eventscloud-developers-exchange-july-2011-p8-normal.gif:
Large networks are dominated by Giant
Connected Component: this can help you

Leskovec, Lang, Dasgupta and
Mahoney Community Structure in large
networks: Natural cluster sizes and the
absence of large well-defined clusters
arXiv:0810.1355 (2008)

• Loosely connected
periphery

• Relatively small number
of disconnected small
components

• Giant connected
component dominates
large networks

ICT

RESEARCH

This mformabon is exempt from dfedosuretmd»MliaF>WBrtnmalMcm«rtoaActMOOaBdB|i

Refer disclosure requests to GCHO oc

UK TOP SECRET STRAP1 SEYES^J

NEXT GENERATION

eventscloud-developers-exchange-july-2011-p9-normal.gif:
Target Discovery at Population Scale

• We are describing a target discovery technique based on
known target communications behaviour applied to
population scale bulk unselected events

• target discovery - discover unknown targets

• known target communications behaviour- modus
operandi (MO)

• population scale - all the events we have for a country

• unselected events - not seeded on targets

ICT

RESEARCH

This mformalton b exempt from tüsctosurg indar the Fmnrtnm of Wonwlion Act2000 and nay be sutrect to exemption under a

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

nr; eventscloud-developers-exchange-july-2011-p10-normal.gif:
Caveat Emptor

• Method has shown promise to discover phone groups of interest
undiscoverabie by traditional analysis.

• “Find adversaries through their behaviour"

• Initial identification of candidates is pure target discovery

- not seeded on targets

- search for behaviour in massive events

• BUT it can only be used to effect if it is tied in with analyst
knowledge of other patterns of behaviour, possibly geo-related.

ICT

RESEARCH

This information is exempt from WBrtnHiBfWowM be sutfcct lo exemption under other UK in

Refer disclosure requests so GCHQ or

UK TOP SECRET STRAP1 SEYErfT^J

NEXT GENERATION

nS eventscloud-developers-exchange-july-2011-p11-normal.gif:
Critical Success Factors

• Technical expertise in data mining (ICTR)

• Good understanding of target MO and ability to follow up
new leads which are generated (Ops CT Analysts)

• Supporting IT infrastructure (SILVER LINING)

• Bulk access to relevant data sets (SILVER LINING)

- ICTR lacks bulk access to CUL T WEA VE - had snapshot In 2007

- There were promising research lines: see SD conference 2007

ICT

RESEARCH

This information b exempt from drectowjre mcfar ft» Fmnrtnm o< WonwMion Act2000 and may be subject to exemption under o

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p12-normal.gif:
Operational Data Mining - Key message

• A combination of technical data mining experts, SIGINT
developers, Operations analysts, appropriate data access
and suitable IT is needed to make target discovery happen

• In our experience to date, it’s not about tool development
but the development of new (and fragile) data mining
techniques by a critical mass of suitably skilled people!

• There are a set of cloud analytics that should form part of
a toolbox but even then their successful application is
likely to be as a result of collaboration with analysts



ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

¿ai NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p13-normal.gif:
KARMA POLICE - correlation between
websites and internet IDs

Internet ID - IP - Web address:
correlation scored on statistics of IP

- KARMA POLICE QFD from ICTR

- EVERY POLICE QFD on cloud

Internet ID-website correlations form
a weighted bi-partite graph

- Links are weighted by KARMA
POLICE correlation scores

- Example graph showing correlations
between Internet IDs and websites

ICT

RESEARCH

This information is exempt from disclosure untfor the Freedom olMcimHion Act2000 and may be subject id exemption under other UK information IngirtMfon.
Refer disclosure requests to GCHQ <

UK TOP SECRET STRAP1 5EYE

UVMNEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p14-normal.gif:
AWKWARD TURTLE - Cloud QFD

• What is a recommender system?

- Netflix - subscribers who like film X also like film Y

- Amazon - customers who like book X also like book Y

- GCHQ - Terrorists who like website X also like website Y

• MapReduce - vector of TDI scores for every website

- Vector dot product - “cosine similarity” measure

- Maximum degree TDI cut-off

- Target activity Is being used as similarity measure

• Website-website correlations - found previously unknown
file hosting

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p15-normal.gif:
Recommender Systems

• We have currently only used very simple techniques

• Body of active research

- Netflix prize stimulated ©

• Interested in seeing more statistical inference and large-
scale modelling

- Potential for long term research

• Behavioural targeting

- Cf Google and Yahoo ad serving to subscriber profile

ICT

RESEARCH

This mtermason is exempt from ifcdosuw«mdyjliaF>WBrtnmalMcmsrtBnActMOOaBdB|i

UK TOP SECRET STRAP1 SEYErfT^J

NEXT GENERATION

nS eventscloud-developers-exchange-july-2011-p16-normal.gif:
Query term graph

• Given a search term, which other search terms are
related?

• Build Query term graph (MapReduce):

- Nodes are queries

- Directed edges between nodes if a machine searches for one term
then the other within 5 minutes

- Edge weighted according to frequency of search pattern

• Boldi, Bonchi, Castillo, Donato, Gionis and Vigna The
query-flow graph: Model and applications Cl KM 08

• Gionis Efficient Tools for Mining Large Graphs MLG 10

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYErfT^J

NEXT GENERATION

¡nr, eventscloud-developers-exchange-july-2011-p17-normal.gif:
Ranking in Query Term graph - PageRank

• Small component from
full query term graph

All terms to do with
different types of
antiques

Red nodes are top 5
PageRank scorescloud-developers-exchange-july-2011-p18-normal.gif:
Personalised PageRank (PPR)

• Red node is seed
node - Victorian Card
Tables

Yellow nodes are top 5
Personalised
PageRank scores

Nodes with high PR
score also score highlycloud-developers-exchange-july-2011-p19-normal.gif:
Normalised PageRank

► Want to find nodes with high Personalised PageRank score, q
compared to its PageRank score, p

► p and q are both (stationary) probability distributions on the
same set so KL-divergence comes to mind

KL{q||p) = ^ <7; log —

/ Pi

► We can rank the nodes based on their contribution to this
sum, q, log ^

► This is the Normalised PageRank score

ICT

RESEARCH

This information is exempt from disclosure mefar Ea Frnwrtnm of Wormation Act2000 and may be subject to exemption under ocher UK information Iwgirtition.
Refer disetosure requests so GCHQ <

UK TOP SECRET STRAP1 5EYE

[S*NEXT GENERATION

rçr, eventscloud-developers-exchange-july-2011-p20-normal.gif:
Normalised PageRank score

• Red node same seed

Yellow nodes are top 5
Normalised PageRank
scores

Nodes with very high
PageRank scores no
longer dominate

ICT

RESEARCH

NEXT GENERATION

UK TOP SECRET STRAP1 5EYE»K eVeiltScloud-developers-exchange-july-2011-p21-normal.gif:
Comments on Normalised PageRank

• Could go N-hops from seed node

- Have to set pizza node degree limit

- N-hop with pizza limit is standard contact chaining method

• Normalised PageRank deals with high degree nodes

- High degree nodes tend to have high PageRank

- Must score very highly on PPR to score well in Normalised PR

• Shown results within small component

• Evaluate Normalised PR for seed term in Giant Connected
Component of Query Term Graph using Bagel

ICT

RESEARCH

UK TOP SECRET STRAP1 5EYE3yT

NEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p22-normal.gif:
“GCHQ” seed query term

Rank Query NPR
1 Free People Check 0.791
2= Jobs At Chanel 0.721
2= Peter Wright (Arabic) 0.721
4 GCSE Bitesize Science 0.670
5 MI6 0.652
20 SKS 0.038
22 Foreign & Commonwealth Office 0.034
37 MI5 0.010
47= MI6 James Bond 0.009
47= Mr 0.009
47= MI8 0.009
72 KGB 0.008
110 Wikileaks 0.003

ICT

RESEARCH

This information is exempt from fflsctosure urxfer the Freedom of Information Act 2000 and may be sufcjec! to exemption under other UK wi

Refer disclosure requests to GCHO o<

UK TOP SECRET STRAP1 5EYE3^j

NEXT GENERATION

eventscloud-developers-exchange-july-2011-p23-normal.gif:
Comments on Query Term Graph

• Query term graph is very noisy, as are all our Internet
Events meta-data graphs

• Some promising results in finding similar queries but
essential that results are interpreted by analysts

• Large amount of research to do

- Clustering / Sessionising /... [lots of commercially motivated work]

- Query chains - Banana -> Apple different intent to iPod -> Apple

- Understanding the search behaviour of targets

• Normalised PageRank insights may be generally usefulcloud-developers-exchange-july-2011-p24-normal.gif:
Exploratory Data Analysis of Large-Scale
Internet Events - gap in understanding

• Relevance to Cyber and SIGINT - what is normal in the
statistics of internet behaviour at large scale?

- Can we measure or model the salient features of large-scale
Internet communications meta-data?

- Can we Identify behaviours associated with target activity (be that
human, machine or collective BotNet activity) that are detectable?

• GORDIAN KNOT (Network Defence) vs SIGINT feeds

- Understand the potential of GORDIAN KNOT for Cyber EDA

- What’s the gap between GORDIAN KNOT and SIGINT data?

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

¿ai NEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p25-normal.gif:
Internet/Cyber EDA - FY 11/12

• Fingerprint web browsing sessions

- Can we ID a user based on their browsing habits?

• Is the Internet Regional?

- Hypothesis: “Internet is becoming more regionalised. Any
machines communicating overlong distances are of greater
interest"

- Does the data support this?

- Can we characterise the activity and significance of long distance
communications?

- Applications to Cyber, but also potentially to other Intelligence
questions

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p26-normal.gif:
Internet/Cyber EDA - FY 11/12

• Attempt to identify malicious sites in the HTTP graph

- "BadRank- given set of known “bad” web sites, can we identify
associated sites that either point in same direction, or are reached
from initial sites

- Identify loosely connected components - bits that aren’t closely
tied in by association with Google et al.

- Subgraph detection - if we have an approximate idea of how a user

reaches a malicious web site, can we identify this pattern and
similar others in the HTTP graph? [ SANDIA work]

ICT

RESEARCH

UK TOP SECRET STRAP1 5EYE3yT

¿ai NEXT GENERATION

nrj eventscloud-developers-exchange-july-2011-p27-normal.gif:
Internet/Cyber EDA - FY 11/12

• FIVE ALIVE - carry out EDA on the netflow dataset
created by TR-FSP

- FIVE ALIVE is a bulk store of IP flow records, coupled with some
very simple analytics that summarize and visualize IP activity

- The main challenge here is to deal with the size of the dataset;
current work in TR-FSP has revolved around looking at subsets of
the data but it would be interesting to work on the dataset as a
whole

ICT

RESEARCH

This information g exempt from etoctosura>«|dBfthgFrwWf1nmc
Reler disclosure requests so GCHQ oc

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

nS eventscloud-developers-exchange-july-2011-p28-normal.gif:
Log 10 number of IPs wiih given m-degree

BLOOD HOUND - ICTR-NE

• Detect electronic attack - aim to detect
distributed and automated behaviour

• Idea from IDA/CCS SCAMP 2009

- 'Using degree distributions to detect
internet traffic anomalies' Scheinerman

• Detect multiple IPs with same degree:

- in-degree {distributed hacking/port
scanning)

- out-degree {DDOS/bot tasking)

• Graph: peak at in-degree - 10*1.8 = 63

- Appears to be some sort of hacking
activity

- Dictionary attack: cycling through range
of IPs on network, making 63 GET
requests to each

- Trying 63 combinations of URl, with the
intent of getting a MySQL setup script
(basic exhaust)

This mkxmatKjn is exempt from disclosure tenter the Freedom of fotncmteinn Act MOO and may be subject to exemption under other UK information tpryitarrm.

Retcr disclosure requests » GCHQ on |

UK TOP SECRET STRAP1 5EYE!

NEXT GENERATION

eventscloud-developers-exchange-july-2011-p29-normal.gif:
Summary

• Pattern-based data mining - unknown target discovery

- Bulk unselected events - population scale - all events for country

- Operational data mining - hard target discovery - real results

- Target modus operandi - behavioural based discovery

• Selector-based data mining - unprecedented scale

- Relationship scoring within multi-modal communications network •

• Exploratory Data Analysis of Large-Scale Internet Events

- Gap in understanding of events at Internet Scale

- How can BIG DATA analytics contribute to Cyber target discovery?

ICT

RESEARCH

UK TOP SECRET STRAP1 SEYES^j

NEXT GENERATION

fir; events






























e-Highlighter

Click to send permalink to address bar, or right-click to copy permalink.

Un-highlight all Un-highlight selectionu Highlight selectionh