Title: GCHQ Analytic Cloud Challenges

Description: This GCHQ presentation dated 14 May 2012 describes the agency receiving “more than 50 billion events [metadata records] per day” and some of the tools available for analysing that mass of data: see the Intercept article Profiled: From Radio to Porn, British Spies Track Web Users’ Online Identities, 25 September 2015.

Document: gchq-analytic-cloud-challenges-p1-normal.gif:
GCHQ Analytic Cloud Challenges

Innovation Lead for Data, Analytics &
Visualisation Engineering

This information is exempt from Disclosure under the Freedom of Information Act 2000 and may be subject
to exemption under other UK information legislation. Refer disclosure requests to GCHQ on I

14/5/2012

UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p2-normal.gif:
Approximately 50 Billion Events Per
Day from 250+ xlOG bearers

• Scale Target for May 2012 is
approximately 320 x 10G with
a capacity for 100 BEPD

Volume by site

Typical retention is six
months, sometimes reduced
to cope with volume increase

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

2gchq-analytic-cloud-challenges-p3-normal.gif:
U« KOHI Itiw-

TDB Events PC OV-1 • Events High Level Concept Graphic

Purpose

A high level view on
GCHQ event (unselected
meta-data) capabilities
and uses

OV-1« Hgt» L«««l Oj-»*•«»:««•! CofK«|rt ******** VJ 14*1 o«m«f tewagai
IM*C**» »IHM« gchq-analytic-cloud-challenges-p4-normal.gif:
Query Focused Dataset

What are QFDs

• A database designed for answering a
single (or few) analytic question

• Initially adopted from
research, engineering tasked to
mature and scale ASAP

• The engineering has been
incremental based on priority

• Library of components and design
patterns developed

• Scale using appropriately sized
database instances, federation
using service tiers

• Deployment of new instances and
updates streamlined

• Corporate support

• New QFDs still being produced by
engineering and others using the
standard components

• Some now provided interactive
access to datasets generated by
HADOOPS batch analytics

QFD Facts and Figures

• Approximately 100 QFD instances
deployed for 16 different QFDs ♦
more for Ul etc

• Each with events (or results)
appropriate for the questions they
answer

• Hardware shared depending on
QFD, all storage is shared

• Driven by the need for a flexible
platform at large scale with low
overall cost

• Redhat Enterprise Linux, Oracle DB
(DB needs are simple)

• Latest generation HP BL456c 67
blades, 24 core, 64GB RAM, EVA
storage "200TB usable

• Most QFD instances have 70TB
usable capacity

• Total storage 15PB raw, 11PB usable

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

4gchq-analytic-cloud-challenges-p5-normal.gif:
A selection of the major QFDs

Name Description Questions Answered Physical &ude*Cheh
AUTOASSOC BuU unsdccted TDI-TDI correlations with confidence scores. What other TC*s belong to your target ? What technologies your target is using ? 2»1 instances, each SO70TB storage
Evolved Mutant Broth Identify »hen certain TOIs appear in traffic which indicate target usage and their location. Telephony and C2C data provide a converged view. Where has my target been? What kind of communications devices has my target been using? 1D»S instances, each 70TB storage
Hard Assoc Provide strongly correlated selectors for both C2C and Telephony traffic taken from TDIs appearing n the same packet Are there any alternative C2C or Telephony selectors for my target? 3»2 instances, each 70TB storage
HRMap Host-rcferrer relationships • information about how people get to websites, including Inks folowed and direct accesses. How do people get to my website of interest and where do they go to next ? What wcbsrtcs have been visaed from a gwen IP? S*3 instances, each 70TB storage
KARMA POUCE Whtch TDIs ha»« been seen at approximately the same time, and from the same computer, as voits to websites. Which websites your target visits, and »hen/where those visits occurred. Who visas suspicious websites, and when/wticre those visits occurred. Which other websites are vented by people who visit a suspicious website. Which IP address and web browser were being used by your target when they vented a website. 11»7 instances, each 70TB storage, 3»1 correlator instances
SOCIAL ANTHROPOID Converged comms events allowing you to see who your targets have communicated with via phone, over the nternet, or usng converged channels (e.g. sending emails from a phone or makng voice cals over the internet). What commiaiications your target is engaged in. Who has your target been communicating with. What commimications have occurred using a particular locator (IP address, cell tower, etc). &»3 instances, each 70TB storage

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

5gchq-analytic-cloud-challenges-p6-normal.gif:
HADOOP Clusters

Facts & Figures

• Three batch analytic clusters

(HADOOP clouds)

• Each contains unique, unselected
events, reference and results data

• Redhat Enterprise Linux, Apache
HADOOP (Map Reduce)

• Each with ^900 data nodes, plus
Ul/ingest/egress servers

• Each node 8 core, 64GB
RAM, 6x1TB

• Each 6PB raw storage, 1.5PB for
events (3x replication and results)

• The three clouds could, in

theory, store 24 Trillion events @140

BEPD and 6 months retention period

Usage

• 200 capability developer accounts

• Many more users able to access
applications

• Guiding light > 150 users

• First Contact, Cloudy Cobra > 50

• Jobs during working hours

• Operationally ad-hoc
(experimental, search, one time
use)

• Development of new capability

• Schedules Jobs run overnight

• Sustained, operational applications

• Increasing expectation for fully
supported, automated jobs such as
Rumour Mill

26/3/2012 Ref: 18171S07

UK TOP SECRET STRAP 1

6gchq-analytic-cloud-challenges-p7-normal.gif:
Physical Overview

Private Cloud

• Data Nodes

• HADOOP Master Nodes (Job Track & Name Node)

Edge Nodes connect to Computer Hall Network

• Ingest Nodes

• Application Nodes (provide user interface - web)

• Innovation Nodes (cap dev login to servers)

26/3/2012 Ref: 18171S07

UK TOP SECRET STRAP 1

7

ID^o j>4>gchq-analytic-cloud-challenges-p8-normal.gif:
Data Ingest

Ingest Design Goals

• Store all data received, i.e.
don't risk losing important
fields by normalising to single
file format.

• Partition by directory structure

• Partition by data type to
simplify common queries.

• Partition by ingest date to
allow incremental analytics.

• Partition by security
compartment.

• Horizontal scaling by adding
new hardware.

Data Formats

• TLV- Tag Length Value

• Comma Delimited

• Fix Position

• Single Line & Multi-Line

• Multiple Variations

• Actor Action

• Many sources and many types

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p9-normal.gif:
Analytic use of data

Security

• Data access controlled across 2
dimensions:

• Usability: Operational (standard
events) or Controlled (special
purpose e.g. test)

• Classification: Multiple buckets -
Open (TSS2), a few compartments,
general CIOs Sensitive

• Capability developer access
restricted by file/directory
permissions

• User facing applications apply fine
grained security on per record basis

Silver Library

• Abstraction layer to separate analytic
development from storage
dependencies

• Pluggable parsers for new formats

• Record parsers for event data held in
HDFS

• Applications still need to know field
labels (function calls)

• Applications often add additional
abstractions

• Input and Output handlers to read
and write data based on type

• Utility functions to aid development

• Simple search and filter routines

• Events Product Centre maintained

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

9gchq-analytic-cloud-challenges-p10-normal.gif:
Example Analytics

•Glorified grep driven by GUI - find eventi that contain
uler search term

Guiding Light

• Ml information types/vdumesof traffic on bearer

Golden Axe

•Generated list of suspected clone mobSe phones (IMEI
grey list)

Tribal Carnem

•Uses Radius logs to identify & collect activity for IP session

Public Anemone

•Geolocation based on web-based map searches

Epic Fail

•Identifies careless use of TOR networks

Sterling Moth

• IP summarisation tool using c2C presence events

Foghorn

• find non-targets using targets machines

And many more....

26/3/2012 Ref: 18171S07

UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p11-normal.gif:
Rich Visualisation

Desktop applications for rich
visualisation and analysis - Eclipse RCP
LOOKING GLASS client platform, FIRE
ENGINE question-based federated
access to events and reference data
sources
Pilots

- NGCC Contact chaining with
C2C/converged event data

- QFD Federator

- NG Geo: Converged geo events
analysis

Next stage to learn from pilots and
build the tool for everyone - ALPHA
CENTAURI

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

11gchq-analytic-cloud-challenges-p12-normal.gif:
Rumour Mill - Push Analytics

Rumour Mill is a dashboard that
will:

- Enable analysts to prioritise new
work as it arrives from customers by
easily finding out "what does GCHQ
already know"

- Enable analysts to monitor existing
work to spot when something
happens that would change their
priorities

First level results, list of questions
against identifiers. Simple yes/no
answers

Click on a yes to the second level -
drills into the detail
Many questions are derived from
cloud based analytics run each day
against the current identifier list

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

12gchq-analytic-cloud-challenges-p13-normal.gif:
Sharing & Collaboration

Other SIA and foreign partners

Data (bulk & query) and technology exchange

Two major components:

• Web user interfaces (VAIL) on GCHQ servers but
accessible from the partner site. Interactive query of
QFDs. Allow exposure of GCHQtradecraft.

• Brokering services. Sustained access for interactive query
of GCHQ data integrated into partner tools.

26/3/2012 Ref: 18171507

UK TOP SECRET STRAP 1

13gchq-analytic-cloud-challenges-p14-normal.gif:
Future Options for Event
Processing

26/3/2012 Ref: 18156927

UK TOP SECRET STRAP 1

14gchq-analytic-cloud-challenges-p15-normal.gif:
Key Challenges

Affordable, continuing
scale of our capabilities

• All dimensions, cost, power, cooling, space, storage, bandwidth, processing etc

• As we share with more partners, the access demands increase

• Enable more delivery by collaborating with others

Enabling our analysts to
cope with complexity
and volume of data

• Need to streamline understood workflows to save analysts time

• Need to push data to analysts, sometimes with low latency

• Enable next level analysis, behavioural, pattern of life, noticing
change, predictions

Complexity and pace of
analytic development

• Suitably skilled people are hard to find m-house and externally

• Must be able to cope with mixed maturity capabilities

• Workload separation, processing contention, cross site analytics

Agility, resilience and
maturity of platforms

• Keep pace with development, respond to community demands

• Understand and match the evolving business expectations of maturity

• Support appropriately, agreed strategy for resilience

26/3/2012 Ref: 181S6927

UK TOP SECRET STRAP 1

ISgchq-analytic-cloud-challenges-p16-normal.gif:
Approach to challenges

Deploy more of the
same while maturing

Explore new
technology

Know the value of our
data

Understand usage of
capabilities

Stay flexible and
increase collaboration

• Suitably refreshed, will answer Immediate scaling challenge

• Reduce support burden and streamline new capability development

• Implement Improvements to existing architecture

• Non-relational, distributed databases; QFD consolidation & convergence

• Research initiatives (A)SEM/HAKIM/STREAMING, and commercial offerings

• Collaboration opportunities CLOUDBASE/ACCUMULO

• Don't generate/ingest data we don't use, filter or de-dupe upstream

• Be smart about retention, be smart about need for Interactive availability

• Distil the raw data to generate rich information sets for analysts

• Automate the simple workflows with summaries and push analytics

• Optimise capabilities for their use and access load

• Mature, provide resilience and support as appropriate

• Enable collaboration and lower the entry bar for capability developers

• Stay vendor neutral, continue to use open standards, open software

• Enhance the benefit for both research & engineering of working together

26/3/2012 Ref; 181S6927

UK TOP SECRET STRAP 1

16gchq-analytic-cloud-challenges-p17-normal.gif:
Understand usage of capabilities
Use Case Class Mapping

Large proportion of queries for
most of the analysts

Target, Unknown
Query

Target Based Discovery

TugattaMd

Development

Target Tracking

Small proportion of queries for
some of the analysts

Target, Unknown
Query

Anomaly Detection

Behaviour-based
Discovery

Target. Known

Query

• Developed jointly between GCHQ
and NSA to understand

- the benefits of our current
capabilities

- where respective strengths and
weaknesses exist

• Provides a clear set of drivers for
architectural evolution

- Missing capabilities

- Suboptimal use of capabilities

- Opportunities for collaboration and
reuse

26/3/2012 Ref: 18156927

UK TOP SECRET STRAP 1

17gchq-analytic-cloud-challenges-p18-normal.gif:
v—

Existing coverage of use cases

Known Target, Known
Query

Streaming. Rumour Mill and QFD capabilities (possibly assisted by cloud analytic)
Could select data subset based on target, could pre-calculate results
Large usage and automation suggests need for optimised capability

Known Target, Unknown
Query

Not possible with existing QFD capabilities

Needs new analytic or more Indexes on target selected data

Large usage and needs suggest a new capability

Unknown Target, Known
Query

Core QFD capability, possibly populated by cloud analytic
1 Full unselected data set required but what level of Interactive query?
1 Is It acceptable that older data be made available non-lnteractively?

Unknown

Target, Unknown Query

1 Use batch analytic platform for low level search or new analytic
Could a heavily indexed store provide a responsive capability?

• Possibly only over a subset of data - recent data only?

26/3/2012 Ref: 18156927

UK TOP SECRET STRAP 1

18gchq-analytic-cloud-challenges-p19-normal.gif:
Value assessment

What measures of value?

• Duplicated over time or across bearers

• Value decreases with age

• Value decreases after processing (SPAM, summaries)

• Use by analysts and analytics

What actions to take?

• Do filtering upstream or don't generated

• Investigate use of streaming capabilities for filtering or sampling

• Discard immediately after first stage processing

• Age off and discard more selectively

• Periodically evaluate data types to understand usage

Need to agree possibilities with operations and experiment

26/3/2012 Ref: 181S6927

UK TOP SECRET STRAP 1

19gchq-analytic-cloud-challenges-p20-normal.gif:
Technology candidates
QFD consolidation and HAKIM

Existing QFD drawbacks

• Some duplication between QFDs (~10%)

• Need new QFDs to answer new questions

Need a consolidated database with multiple indexes and flexible additions

• HAKIM is a research prototype to do just that

• Unification of data, associated data kept together

• Quick and flexible addition of new data types and indexes

• Scalable and cost effective

Other candidates exist and some HAKIM components could be replaced

• Oracle DB or distributed, non-relational database?

• Convergence with HADOOP stack HBASE/ACCUMULO

Engineering is working with research to develop to the next stage

26/3/2012 Ref: 18156927

UK TOP SECRET STRAP 1

20gchq-analytic-cloud-challenges-p21-normal.gif:
Technology candidates
The "skinny" cloud (Laurel)

What is it?

• A batch analytic HADOOP cluster

• Contains all the data but for a short retention period

What would it be used for?

• An ideal place to do incremental analytics

• Answers the cross-site analytic problem

• Could be dedicated to sustained usage

• Provides some resilience by duplicating recent data

What are the challenges?

• Can we transfer and ingest all the data

Engineering could potential build this using on-order kit

26/3/2012 Ref: 181S6927

UK TOP SECRET STRAP 1

21gchq-analytic-cloud-challenges-p22-normal.gif:
Technology candidates
The GCHQ "core" (Hardy)

What is it?

• A HADOOP cluster with Map/Reduce and interactive query/analytics capabilities

• Probably using CLOUDBASE/ACCUMULO and reusing NSA knowledge

What would it be used for?

• A place for data and summaries promoted from the bulk stores

• Known Target, Know Query and some Known Target, Unknown Query

• Optimised for major use case and suitable for data sharing

• Provides some resilience by duplicating important data

What are the challenges?

• GCHQ has limited expertise with CLOUDBASE/ACCUMULO this technology

• The promotion analytics and criteria are not developed

Engineering could potential build this using on-order kit

26/3/2012 Ref: 18156927

UK TOP SECRET STRAP 1

22gchq-analytic-cloud-challenges-p23-normal.gif:
SM TO* NCXr

TDB Events PC SV-1 • Skinny cloud, GCHQ Core and QFD consolidation



Clc-KJ
I AH tt* data at

ttW* MWJld

Q -rv »UfOM ,

Bu»W« 1"

IftHAButW. w c*«a H **£

E»tntt

AJ E*««9 k*

Scociv E«#N T,i#a* C«nUd«MM Qlt>*
Cp«r«to'A Trial XARIMi

On* #o«*n

Ot«Y

HiAZU

Palpóse

Sbcwa possitk (KW a'ch tortee V* GCHQ Everra
He«* conpororrs

Sknrry douJ afi ir* dsra *cm Cor üw bu*, beW rv ***
Un* A |im:m tú do dttly. «»uwfmuUl arutplc» en bufe
e*«ni5 Penen so*/. on>y &*>wanpacr/ckefe a u**6t* COpy c/ Un dilA fur a Wo»! 3tym pacied
QMng SOTO remenee

GCHQ Co«* - VJuatfe «ven» ar* seccied |NSA A
GCHQ MMk) tf/ ¿a* )laa ¿nd ocekd «rito ir* co* Tr*
con» e fKt irtnwvy a p9U? *H toveb anar* 1c*. r*rt>ec an
hUracf «* du«> focotfrxy Ir» AAJtkn to ir* f*w w\íftt

wrmíTfi W b?4d and sbp«*d » USA lor1mygtwn

hio tbíar icott Al tbo inpcrjfi: ova«* día rad Vw tr*
nxms rvtjnxr pened pPOttfng ro« lenco.

Censo Malee Qf D • TN> non stojo Qf Os. ow
moro teaiei« aro rrctowtu heneen

8V1 «no*w *r*e A
M(>llWd U4I7*1} 14 4* t*

0*mi

Document Date: 2012-05-14

Release Date: 2015-09-25

Document Path: https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p1-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p2-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p3-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p4-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p5-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p6-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p7-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p8-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p9-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p10-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p11-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p12-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p13-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p14-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p15-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p16-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p17-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p18-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p19-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p20-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p21-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p22-normal.gif
https://edwardsnowden.com/wp-content/uploads/2015/10/gchq-analytic-cloud-challenges-p23-normal.gif

Article Link: https://theintercept.com/2015/09/25/gchq-radio-porn-spies-track-web-users-online-identities/

Links

#1 https://theintercept.com/2015/09/25/gchq-radio-porn-spies-track-web-users-online-identities/ Show in Doc Search Show in New Window

e-Highlighter

Click to send permalink to address bar, or right-click to copy permalink.

Un-highlight all Un-highlight selectionu Highlight selectionh