Title: GCHQ Analytic Cloud Challenges
Release Date: 2015-09-25
Document Date: 2012-05-14
Description: This GCHQ presentation dated 14 May 2012 describes the agency receiving “more than 50 billion events [metadata records] per day” and some of the tools available for analysing that mass of data: see the Intercept article Profiled: From Radio to Porn, British Spies Track Web Users’ Online Identities, 25 September 2015.
Document: gchq-analytic-cloud-challenges-p1-normal.gif:
GCHQ Analytic Cloud Challenges
Innovation Lead for Data, Analytics &
Visualisation Engineering
This information is exempt from Disclosure under the Freedom of Information Act 2000 and may be subject
to exemption under other UK information legislation. Refer disclosure requests to GCHQ on I
14/5/2012
UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p2-normal.gif:
Approximately 50 Billion Events Per
Day from 250+ xlOG bearers
• Scale Target for May 2012 is
approximately 320 x 10G with
a capacity for 100 BEPD
Volume by site
Typical retention is six
months, sometimes reduced
to cope with volume increase
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
2gchq-analytic-cloud-challenges-p3-normal.gif:
U« KOHI Itiw-
TDB Events PC OV-1 • Events High Level Concept Graphic
Purpose
A high level view on
GCHQ event (unselected
meta-data) capabilities
and uses
OV-1« Hgt» L«««l Oj-»*•«»:««•! CofK«|rt ******** VJ 14*1 o«m«f tewagai
IM*C**» »IHM« gchq-analytic-cloud-challenges-p4-normal.gif:
Query Focused Dataset
What are QFDs
• A database designed for answering a
single (or few) analytic question
• Initially adopted from
research, engineering tasked to
mature and scale ASAP
• The engineering has been
incremental based on priority
• Library of components and design
patterns developed
• Scale using appropriately sized
database instances, federation
using service tiers
• Deployment of new instances and
updates streamlined
• Corporate support
• New QFDs still being produced by
engineering and others using the
standard components
• Some now provided interactive
access to datasets generated by
HADOOPS batch analytics
QFD Facts and Figures
• Approximately 100 QFD instances
deployed for 16 different QFDs ♦
more for Ul etc
• Each with events (or results)
appropriate for the questions they
answer
• Hardware shared depending on
QFD, all storage is shared
• Driven by the need for a flexible
platform at large scale with low
overall cost
• Redhat Enterprise Linux, Oracle DB
(DB needs are simple)
• Latest generation HP BL456c 67
blades, 24 core, 64GB RAM, EVA
storage "200TB usable
• Most QFD instances have 70TB
usable capacity
• Total storage 15PB raw, 11PB usable
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
4gchq-analytic-cloud-challenges-p5-normal.gif:
A selection of the major QFDs
Name Description Questions Answered Physical &ude*Cheh
AUTOASSOC BuU unsdccted TDI-TDI correlations with confidence scores. What other TC*s belong to your target ? What technologies your target is using ? 2»1 instances, each SO70TB storage
Evolved Mutant Broth Identify »hen certain TOIs appear in traffic which indicate target usage and their location. Telephony and C2C data provide a converged view. Where has my target been? What kind of communications devices has my target been using? 1D»S instances, each 70TB storage
Hard Assoc Provide strongly correlated selectors for both C2C and Telephony traffic taken from TDIs appearing n the same packet Are there any alternative C2C or Telephony selectors for my target? 3»2 instances, each 70TB storage
HRMap Host-rcferrer relationships • information about how people get to websites, including Inks folowed and direct accesses. How do people get to my website of interest and where do they go to next ? What wcbsrtcs have been visaed from a gwen IP? S*3 instances, each 70TB storage
KARMA POUCE Whtch TDIs ha»« been seen at approximately the same time, and from the same computer, as voits to websites. Which websites your target visits, and »hen/where those visits occurred. Who visas suspicious websites, and when/wticre those visits occurred. Which other websites are vented by people who visit a suspicious website. Which IP address and web browser were being used by your target when they vented a website. 11»7 instances, each 70TB storage, 3»1 correlator instances
SOCIAL ANTHROPOID Converged comms events allowing you to see who your targets have communicated with via phone, over the nternet, or usng converged channels (e.g. sending emails from a phone or makng voice cals over the internet). What commiaiications your target is engaged in. Who has your target been communicating with. What commimications have occurred using a particular locator (IP address, cell tower, etc). &»3 instances, each 70TB storage
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
5gchq-analytic-cloud-challenges-p6-normal.gif:
HADOOP Clusters
Facts & Figures
• Three batch analytic clusters
(HADOOP clouds)
• Each contains unique, unselected
events, reference and results data
• Redhat Enterprise Linux, Apache
HADOOP (Map Reduce)
• Each with ^900 data nodes, plus
Ul/ingest/egress servers
• Each node 8 core, 64GB
RAM, 6x1TB
• Each 6PB raw storage, 1.5PB for
events (3x replication and results)
• The three clouds could, in
theory, store 24 Trillion events @140
BEPD and 6 months retention period
Usage
• 200 capability developer accounts
• Many more users able to access
applications
• Guiding light > 150 users
• First Contact, Cloudy Cobra > 50
• Jobs during working hours
• Operationally ad-hoc
(experimental, search, one time
use)
• Development of new capability
• Schedules Jobs run overnight
• Sustained, operational applications
• Increasing expectation for fully
supported, automated jobs such as
Rumour Mill
26/3/2012 Ref: 18171S07
UK TOP SECRET STRAP 1
6gchq-analytic-cloud-challenges-p7-normal.gif:
Physical Overview
Private Cloud
• Data Nodes
• HADOOP Master Nodes (Job Track & Name Node)
Edge Nodes connect to Computer Hall Network
• Ingest Nodes
• Application Nodes (provide user interface - web)
• Innovation Nodes (cap dev login to servers)
26/3/2012 Ref: 18171S07
UK TOP SECRET STRAP 1
7
ID^o j>4>gchq-analytic-cloud-challenges-p8-normal.gif:
Data Ingest
Ingest Design Goals
• Store all data received, i.e.
don't risk losing important
fields by normalising to single
file format.
• Partition by directory structure
• Partition by data type to
simplify common queries.
• Partition by ingest date to
allow incremental analytics.
• Partition by security
compartment.
• Horizontal scaling by adding
new hardware.
Data Formats
• TLV- Tag Length Value
• Comma Delimited
• Fix Position
• Single Line & Multi-Line
• Multiple Variations
• Actor Action
• Many sources and many types
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p9-normal.gif:
Analytic use of data
Security
• Data access controlled across 2
dimensions:
• Usability: Operational (standard
events) or Controlled (special
purpose e.g. test)
• Classification: Multiple buckets -
Open (TSS2), a few compartments,
general CIOs Sensitive
• Capability developer access
restricted by file/directory
permissions
• User facing applications apply fine
grained security on per record basis
Silver Library
• Abstraction layer to separate analytic
development from storage
dependencies
• Pluggable parsers for new formats
• Record parsers for event data held in
HDFS
• Applications still need to know field
labels (function calls)
• Applications often add additional
abstractions
• Input and Output handlers to read
and write data based on type
• Utility functions to aid development
• Simple search and filter routines
• Events Product Centre maintained
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
9gchq-analytic-cloud-challenges-p10-normal.gif:
Example Analytics
•Glorified grep driven by GUI - find eventi that contain
uler search term
Guiding Light
• Ml information types/vdumesof traffic on bearer
Golden Axe
•Generated list of suspected clone mobSe phones (IMEI
grey list)
Tribal Carnem
•Uses Radius logs to identify & collect activity for IP session
Public Anemone
•Geolocation based on web-based map searches
Epic Fail
•Identifies careless use of TOR networks
Sterling Moth
• IP summarisation tool using c2C presence events
Foghorn
• find non-targets using targets machines
And many more....
26/3/2012 Ref: 18171S07
UK TOP SECRET STRAP 1gchq-analytic-cloud-challenges-p11-normal.gif:
Rich Visualisation
Desktop applications for rich
visualisation and analysis - Eclipse RCP
LOOKING GLASS client platform, FIRE
ENGINE question-based federated
access to events and reference data
sources
Pilots
- NGCC Contact chaining with
C2C/converged event data
- QFD Federator
- NG Geo: Converged geo events
analysis
Next stage to learn from pilots and
build the tool for everyone - ALPHA
CENTAURI
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
11gchq-analytic-cloud-challenges-p12-normal.gif:
Rumour Mill - Push Analytics
Rumour Mill is a dashboard that
will:
- Enable analysts to prioritise new
work as it arrives from customers by
easily finding out "what does GCHQ
already know"
- Enable analysts to monitor existing
work to spot when something
happens that would change their
priorities
First level results, list of questions
against identifiers. Simple yes/no
answers
Click on a yes to the second level -
drills into the detail
Many questions are derived from
cloud based analytics run each day
against the current identifier list
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
12gchq-analytic-cloud-challenges-p13-normal.gif:
Sharing & Collaboration
Other SIA and foreign partners
Data (bulk & query) and technology exchange
Two major components:
• Web user interfaces (VAIL) on GCHQ servers but
accessible from the partner site. Interactive query of
QFDs. Allow exposure of GCHQtradecraft.
• Brokering services. Sustained access for interactive query
of GCHQ data integrated into partner tools.
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
13gchq-analytic-cloud-challenges-p14-normal.gif:
Future Options for Event
Processing
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
14gchq-analytic-cloud-challenges-p15-normal.gif:
Key Challenges
Affordable, continuing
scale of our capabilities
• All dimensions, cost, power, cooling, space, storage, bandwidth, processing etc
• As we share with more partners, the access demands increase
• Enable more delivery by collaborating with others
Enabling our analysts to
cope with complexity
and volume of data
• Need to streamline understood workflows to save analysts time
• Need to push data to analysts, sometimes with low latency
• Enable next level analysis, behavioural, pattern of life, noticing
change, predictions
Complexity and pace of
analytic development
• Suitably skilled people are hard to find m-house and externally
• Must be able to cope with mixed maturity capabilities
• Workload separation, processing contention, cross site analytics
Agility, resilience and
maturity of platforms
• Keep pace with development, respond to community demands
• Understand and match the evolving business expectations of maturity
• Support appropriately, agreed strategy for resilience
26/3/2012 Ref: 181S6927
UK TOP SECRET STRAP 1
ISgchq-analytic-cloud-challenges-p16-normal.gif:
Approach to challenges
Deploy more of the
same while maturing
Explore new
technology
Know the value of our
data
Understand usage of
capabilities
Stay flexible and
increase collaboration
• Suitably refreshed, will answer Immediate scaling challenge
• Reduce support burden and streamline new capability development
• Implement Improvements to existing architecture
• Non-relational, distributed databases; QFD consolidation & convergence
• Research initiatives (A)SEM/HAKIM/STREAMING, and commercial offerings
• Collaboration opportunities CLOUDBASE/ACCUMULO
• Don't generate/ingest data we don't use, filter or de-dupe upstream
• Be smart about retention, be smart about need for Interactive availability
• Distil the raw data to generate rich information sets for analysts
• Automate the simple workflows with summaries and push analytics
• Optimise capabilities for their use and access load
• Mature, provide resilience and support as appropriate
• Enable collaboration and lower the entry bar for capability developers
• Stay vendor neutral, continue to use open standards, open software
• Enhance the benefit for both research & engineering of working together
26/3/2012 Ref; 181S6927
UK TOP SECRET STRAP 1
16gchq-analytic-cloud-challenges-p17-normal.gif:
Understand usage of capabilities
Use Case Class Mapping
Large proportion of queries for
most of the analysts
Target, Unknown
Query
Target Based Discovery
TugattaMd
Development
Target Tracking
Small proportion of queries for
some of the analysts
Target, Unknown
Query
Anomaly Detection
Behaviour-based
Discovery
Target. Known
Query
• Developed jointly between GCHQ
and NSA to understand
- the benefits of our current
capabilities
- where respective strengths and
weaknesses exist
• Provides a clear set of drivers for
architectural evolution
- Missing capabilities
- Suboptimal use of capabilities
- Opportunities for collaboration and
reuse
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
17gchq-analytic-cloud-challenges-p18-normal.gif:
v—
Existing coverage of use cases
Known Target, Known
Query
Streaming. Rumour Mill and QFD capabilities (possibly assisted by cloud analytic)
Could select data subset based on target, could pre-calculate results
Large usage and automation suggests need for optimised capability
Known Target, Unknown
Query
Not possible with existing QFD capabilities
Needs new analytic or more Indexes on target selected data
Large usage and needs suggest a new capability
Unknown Target, Known
Query
Core QFD capability, possibly populated by cloud analytic
1 Full unselected data set required but what level of Interactive query?
1 Is It acceptable that older data be made available non-lnteractively?
Unknown
Target, Unknown Query
1 Use batch analytic platform for low level search or new analytic
Could a heavily indexed store provide a responsive capability?
• Possibly only over a subset of data - recent data only?
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
18gchq-analytic-cloud-challenges-p19-normal.gif:
Value assessment
What measures of value?
• Duplicated over time or across bearers
• Value decreases with age
• Value decreases after processing (SPAM, summaries)
• Use by analysts and analytics
What actions to take?
• Do filtering upstream or don't generated
• Investigate use of streaming capabilities for filtering or sampling
• Discard immediately after first stage processing
• Age off and discard more selectively
• Periodically evaluate data types to understand usage
Need to agree possibilities with operations and experiment
26/3/2012 Ref: 181S6927
UK TOP SECRET STRAP 1
19gchq-analytic-cloud-challenges-p20-normal.gif:
Technology candidates
QFD consolidation and HAKIM
Existing QFD drawbacks
• Some duplication between QFDs (~10%)
• Need new QFDs to answer new questions
Need a consolidated database with multiple indexes and flexible additions
• HAKIM is a research prototype to do just that
• Unification of data, associated data kept together
• Quick and flexible addition of new data types and indexes
• Scalable and cost effective
Other candidates exist and some HAKIM components could be replaced
• Oracle DB or distributed, non-relational database?
• Convergence with HADOOP stack HBASE/ACCUMULO
Engineering is working with research to develop to the next stage
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
20gchq-analytic-cloud-challenges-p21-normal.gif:
Technology candidates
The "skinny" cloud (Laurel)
What is it?
• A batch analytic HADOOP cluster
• Contains all the data but for a short retention period
What would it be used for?
• An ideal place to do incremental analytics
• Answers the cross-site analytic problem
• Could be dedicated to sustained usage
• Provides some resilience by duplicating recent data
What are the challenges?
• Can we transfer and ingest all the data
Engineering could potential build this using on-order kit
26/3/2012 Ref: 181S6927
UK TOP SECRET STRAP 1
21gchq-analytic-cloud-challenges-p22-normal.gif:
Technology candidates
The GCHQ "core" (Hardy)
What is it?
• A HADOOP cluster with Map/Reduce and interactive query/analytics capabilities
• Probably using CLOUDBASE/ACCUMULO and reusing NSA knowledge
What would it be used for?
• A place for data and summaries promoted from the bulk stores
• Known Target, Know Query and some Known Target, Unknown Query
• Optimised for major use case and suitable for data sharing
• Provides some resilience by duplicating important data
What are the challenges?
• GCHQ has limited expertise with CLOUDBASE/ACCUMULO this technology
• The promotion analytics and criteria are not developed
Engineering could potential build this using on-order kit
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
22gchq-analytic-cloud-challenges-p23-normal.gif:
SM TO* NCXr
TDB Events PC SV-1 • Skinny cloud, GCHQ Core and QFD consolidation
Clc-KJ
I AH tt* data at
ttW* MWJld
Q -rv »UfOM ,
Bu»W« 1"
IftHAButW. w c*«a H **£
E»tntt
AJ E*««9 k*
Scociv E«#N T,i#a* C«nUd«MM Qlt>*
Cp«r«to'A Trial XARIMi
On* #o«*n
Ot«Y
HiAZU
Palpóse
Sbcwa possitk (KW a'ch tortee V* GCHQ Everra
He«* conpororrs
Sknrry douJ afi ir* dsra *cm Cor üw bu*, beW rv ***
Un* A |im:m tú do dttly. «»uwfmuUl arutplc» en bufe
e*«ni5 Penen so*/. on>y &*>wanpacr/ckefe a u**6t* COpy c/ Un dilA fur a Wo»! 3tym pacied
QMng SOTO remenee
GCHQ Co«* - VJuatfe «ven» ar* seccied |NSA A
GCHQ MMk) tf/ ¿a* )laa ¿nd ocekd «rito ir* co* Tr*
con» e fKt irtnwvy a p9U? *H toveb anar* 1c*. r*rt>ec an
hUracf «* du«> focotfrxy Ir» AAJtkn to ir* f*w w\íftt
wrmíTfi W b?4d and sbp«*d » USA lor1mygtwn
hio tbíar icott Al tbo inpcrjfi: ova«* día rad Vw tr*
nxms rvtjnxr pened pPOttfng ro« lenco.
Censo Malee Qf D • TN> non stojo Qf Os. ow
moro teaiei« aro rrctowtu heneen
8V1 «no*w *r*e A
M(>llWd U4I7*1} 14 4* t*
0*mi

Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p1-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p2-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p3-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p4-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p5-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p6-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p7-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p8-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p9-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p10-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p11-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p12-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p13-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p14-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p15-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p16-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p17-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p18-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p19-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p20-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p21-normal.gif)
Download Document
GCHQ Analytic Cloud Challenges (gchq-analytic-cloud-challenges-p22-normal.gif)