Graph-Based Entity Resolution

Sep 4

Graph-based Entity Resolution (ER) in cybersecurity is a sophisticated data management and analysis technique focusing on identifying, linking, and merging disparate data records that refer to the same real-world "entity" within a cybersecurity context. Unlike traditional database matching, which might struggle with variations, inconsistencies, and the sheer volume of data, graph-based ER excels at uncovering hidden relationships and building a unified, accurate view of these entities.

What is an "Entity" in Cybersecurity?

In cybersecurity, an "entity" can be almost anything that generates or is involved in data:

Individuals: Users, administrators, threat actors, employees, and customers. Usernames, email addresses, real names, IP addresses, device IDs, physical addresses, etc., can represent these.
Devices: Laptops, servers, mobile phones, IoT devices, network routers, firewalls, and endpoints might be identified by MAC addresses, IP addresses, hostnames, serial numbers, operating system versions, and more.
Software/Applications: Specific applications, processes, malware strains, vulnerabilities (CVEs).
Network Elements: IP addresses, domain names, subnets, open ports, network segments.
Indicators of Compromise (IOCs): Malicious file hashes, command and control (C2) server addresses, phishing URLs.
Threat Actors/Groups: Named APT groups, ransomware gangs, and individual cybercriminals.
Attack Techniques: Specific tactics, techniques, and procedures (TTPs) defined by frameworks like MITRE ATT&CK.

How Graph-based Entity Resolution Works:

The core idea is to represent all the collected cybersecurity data as a graph.

Nodes (Entities): Each distinct piece of information (e.g., an IP address, a username, a file hash) becomes a "node" in the graph.
Edges (Relationships): The connections or relationships between these pieces of information become "edges." For example, an edge might link:

A user to the device they logged into.
A device to an IP address it used.
An IP address to a malicious domain it communicated with.
A malware sample to a known threat actor group.
A vulnerability to a compromised device.

The Resolution Process:

Graph-based ER then uses a combination of techniques to determine if different nodes or sets of nodes refer to the same underlying real-world entity:

Data Ingestion and Normalization: Data from diverse, often disparate sources (e.g., SIEM logs, endpoint detection and response (EDR) systems, threat intelligence feeds, HR databases, identity management systems) is collected and standardized to a standard format.
Feature Extraction: Relevant attributes are extracted from each record. These could be names, addresses, timestamps, device IDs, login patterns, etc.
Matching Algorithms: This is where the "resolution" happens. Graph-based methods go beyond simple exact matches, using:

Deterministic Matching: Rule-based approaches where if specific attributes match exactly (e.g., two records have the same email address and date of birth), they are considered the same entity.
Probabilistic Matching: Statistical methods that assign a probability score to the likelihood that two records refer to the same entity, even with variations (e.g., "John Smith" vs. "J. Smith," or slight variations in IP address timestamps). Machine learning models are often used here.
Graph Algorithms: This is the distinguishing factor. Graph traversal and analysis algorithms are applied to identify connections that indicate shared identity. For example:

Community Detection: Highly interconnected Grouping nodes, suggesting they belong to the same entity or a closely related cluster (e.g., multiple accounts sharing the same unique device ID or login pattern).
Pathfinding: Tracing connections to see if seemingly unrelated nodes are linked through a series of intermediaries (e.g., an alert on one device, then a series of network hops, leading to a compromised server – all potentially linked to the same attack campaign).
Centrality Measures: Identifying "highly connected" nodes that might be critical to an entity's profile or attack chain (e.g., a specific C2 server with which many compromised devices communicate).

Merging and Unification: Once identified as the same entity, the various records are linked or merged to create a comprehensive "golden record" or unified entity profile. This profile is continuously enriched as new data comes in.

Benefits in Cybersecurity:

Holistic Threat View: Connects seemingly disparate alerts, logs, and threat intelligence into a cohesive narrative, making it easier to understand the full scope of an attack.
Improved Threat Detection: By unifying identities, security teams can detect subtle patterns of malicious activity that would be missed if data was analyzed in silos (e.g., an attacker using multiple temporary accounts or devices).
Faster Incident Response: Provides a complete picture of compromised assets, affected users, and attack pathways, enabling quicker and more effective containment and remediation.
Enhanced Fraud Detection: Crucial for identifying fraudulent accounts, transactions, or identities by linking various data points that might appear different but belong to the same fraudulent actor.
Better Asset Management: Creates an accurate, real-time inventory of all devices, users, and applications within an organization, improving vulnerability management and compliance.
Contextualized Threat Intelligence: This technology integrates external threat intelligence (IOCs, known threat actor TTPs) with internal data, providing richer investigation context.
Reduced False Positives: The system can better differentiate between legitimate and malicious activity by connecting more data points, reducing analyst alert fatigue.
Predictive Security: By understanding historical attack patterns and linked entities, organizations can better predict and proactively defend against future threats.

Graph-based entity resolution transforms raw, fragmented cybersecurity data into actionable intelligence by revealing the underlying relationships and ensuring that "who" or "what" is involved in a security event is accurately identified, regardless of how many different data points refer to it.

ThreatNG, as an all-in-one external attack surface management, digital risk protection, and security ratings solution, inherently uses graph-based entity resolution to view an organization's external security posture comprehensively. Although not explicitly detailed as "graph-based entity resolution" in the provided document, the functionalities described strongly indicate its underlying use. By performing purely external unauthenticated discovery without connectors, ThreatNG acts as a powerful data collector, feeding a graph model where various data points become nodes and their observed relationships form edges.

ThreatNG's External Discovery and Graph-based Entity Resolution

ThreatNG's ability to perform external discovery serves as the initial data ingestion phase for its underlying graph. Every discovered asset, whether a domain, subdomain, IP address, mobile app, or code repository, becomes a node. The process of linking these discovered elements directly demonstrates entity resolution in action. For example:

Domains and Subdomains: When ThreatNG discovers a primary domain, it enumerates its subdomains. Each subdomain is a node, and the link to the primary domain is an edge, establishing a clear hierarchical relationship. This helps resolve the issue of blog.example.com and shop.example.com being part of the example.com entity.
IP Addresses and ASNs: Discovered IP addresses are linked to their respective Autonomous System Numbers (ASNs) and country locations. This resolves the issue of multiple IP addresses belonging to the same organizational network entity.
Mobile Apps and Marketplaces: ThreatNG discovers mobile apps in various marketplaces. Despite being on different platforms, an app found on both Google Play and Apple App Store is resolved as the same mobile application entity. The link between the app and its presence in multiple marketplaces forms the edges.

External Assessment and Graph-based Entity Resolution

ThreatNG's external assessment capabilities rely on connecting these disparate nodes to form a complete picture of an entity's risk. This is where the graph becomes crucial for understanding relationships that inform various susceptibility scores:

Web Application Hijack Susceptibility & Subdomain Takeover Susceptibility: These scores are derived from analyzing external attack surface and digital risk intelligence, including Domain Intelligence. This involves linking discovered web applications and subdomains (nodes) to their DNS records, SSL certificate statuses, and other factors (more nodes). The graph reveals that a particular web application entity is associated with specific DNS configurations, and vulnerabilities found within those configurations are linked to that web application entity, impacting its susceptibility score. For example, suppose staging.example.com has an outdated DNS record pointing to a deprovisioned server (a dangling DNS entry). In that case, the graph links this specific subdomain entity to the outdated record entity, resolving the subdomain takeover susceptibility for the example.com entity.
BEC & Phishing Susceptibility: This score integrates Domain Intelligence (DNS Intelligence, Domain Name Permutations, Web3 Domains, Email Intelligence) and Dark Web Presence (Compromised Credentials). The graph reveals that a domain entity is associated with particular email security configurations (DMARC, SPF, DKIM records). If compromised credentials related to that domain are found on the dark web, these are linked as attributes of the same organizational entity, increasing its phishing susceptibility. For instance, if example.com is found to have poor SPF records and multiple employee credentials for example.com are found in DarCache Rupture, the graph connects these findings to the example.com entity, driving a higher BEC & Phishing Susceptibility score.
Data Leak Susceptibility: This score combines Cloud and SaaS Exposure, Dark Web Presence, Domain Intelligence, Sentiment, and Financials. The graph links detected open cloud buckets or compromised credentials to the owning organization's entity. If a specific AWS S3 bucket belonging to example.com is found to be publicly exposed, the graph resolves this bucket as an asset of example.com, contributing to its data leak susceptibility.
Mobile App Exposure: ThreatNG discovers mobile apps in marketplaces and investigates their content for exposed access, security credentials, and platform-specific identifiers. The graph reveals that a mobile app in a marketplace is an entity, and any sensitive data like an embedded AWS Access Key ID discovered within its code is linked as a critical vulnerability to that app entity, and by extension, to the organization that owns it. Suppose the mobile app for 'Example Retail' found on Google Play contains a hardcoded AWS Access Key ID. In that case, the graph resolves this access key as a sensitive data point tied to the mobile app entity, which is tied to the 'Example Retail' organization, thus increasing 'Example Retail's' mobile app exposure score.

Reporting and Continuous Monitoring

The unified view provided by graph-based entity resolution is fundamental to ThreatNG's reporting. The various reports (Executive, Technical, Prioritized, Security Ratings, Inventory, Ransomware Susceptibility, U.S. SEC Filings) are not just lists of findings but present an interconnected view of risks, precisely because the underlying data has been resolved into coherent entities. Continuous monitoring leverages this graph by constantly updating the nodes and edges as new data is collected, identifying new relationships or changes to existing ones, such as a newly exposed sensitive port or a recent ransomware event tied to a monitored entity.

Investigation Modules and Graph-based Entity Resolution

ThreatNG's investigation modules are prime examples of how graph-based entity resolution powers detailed cybersecurity investigations:

Domain Intelligence: This module gathers extensive data, including DNS records, email intelligence, and WHOIS information. The graph links all these attributes to a specific domain entity. For example, if investigating example.com, the graph would resolve all its associated subdomains, their respective HTTP responses, server headers, known vulnerabilities on specific ports (e.g., exposed SSH or RDP ports ), and any Web Application Firewalls discovered. This holistic view is impossible without resolving all these discrete pieces of information into a single, cohesive entity profile. Suppose an analyst investigates example.com and sees that its admin.example.com subdomain (resolved as part of example.com) has an exposed RDP port (determined as a vulnerability of that subdomain). The graph links these findings to the central example.com entity, allowing for targeted investigation and remediation.
Sensitive Code Exposure: This module discovers public code repositories and their contents for sensitive data like API keys, access tokens, and cloud credentials. Each discovered repository, file, and sensitive credential becomes a node. The graph links these sensitive findings directly to the code repository entity and, by extension, to the organization that owns it. Suppose ThreatNG discovers a GitHub repository belonging to "Example Corp" (resolved as a "code repository" entity for "Example Corp") containing an AWS Secret Access Key (resolved as "sensitive credential" entity). The graph links these in that case, immediately highlighting a severe code secret exposure for "Example Corp."
Mobile Application Discovery: Beyond finding apps, this module delves into the app's contents for exposed credentials. The graph identifies the mobile app as an entity. Then it links any discovered sensitive access credentials (e.g., a Facebook Access Token) or security credentials (e.g., an RSA Private Key) directly to that app entity. This allows analysts to see not just that an app is exposed, but how it's exposed due to sensitive data.

Intelligence Repositories (DarCache) and Graph-based Entity Resolution

ThreatNG's DarCache repositories are pre-populated graphs of known threat intelligence, which are then integrated and resolved against an organization's discovered assets.

DarCache Vulnerability (NVD, EPSS, KEV, Verified PoC Exploits): This repository provides a comprehensive graph of known vulnerabilities. Each CVE's CVSS score, severity, attack complexity, interaction, vector, and impact are nodes. The relationships between these vulnerability attributes and their associated EPSS scores (likelihood of exploitation), KEV status (actively exploited), and verified PoC exploits are established. When ThreatNG discovers a vulnerability on an organization's asset (e.g., an exposed server with a known CVE), the system resolves this finding against the DarCache Vulnerability graph to enrich the context. For instance, if ThreatNG identifies CVE-2023-XXXX on a web server belonging to "Example Corp", it checks DarCache Vulnerability. If this CVE has a high EPSS score and is listed in KEV, the graph-based entity resolution links these critical threat intelligence attributes to the "Example Corp" server entity, immediately flagging it as a high-priority risk due to active exploitation and high likelihood of being weaponized.
DarCache Rupture (Compromised Credentials): This is a graph of compromised credentials. When ThreatNG performs external discovery and finds email addresses or usernames associated with an organization, it resolves these against DarCache Rupture. If a match is found, the compromised credential entity is linked to the organizational user entity, directly impacting scores like BEC & Phishing Susceptibility and Data Leak Susceptibility.
DarCache Ransomware: This tracks over 70 ransomware gangs and their activities. If ThreatNG discovers evidence of ransomware activity or mentions related to a monitored organization on the dark web, the system resolves this information against DarCache Ransomware, linking the specific ransomware gang entity to the organization's entity, thus increasing its Breach & Ransomware Susceptibility.

Synergies with Complementary Solutions

While ThreatNG is a powerful standalone solution, its graph-based approach to entity resolution allows it to work effectively with other cybersecurity tools, enhancing the overall security ecosystem.

Security Information and Event Management (SIEM) / Security Orchestration, Automation, and Response (SOAR) Platforms: ThreatNG's resolved entity information and risk scores could feed into a SIEM/SOAR. For example, if ThreatNG identifies a high-risk mobile app exposure with exposed API keys, this resolved entity and its associated risk could trigger an alert in the SIEM. A SOAR playbook could then automatically initiate actions like scanning internal code repositories for similar keys or revoking the exposed keys, leveraging the entity resolution provided by ThreatNG. Suppose ThreatNG identifies a critical vulnerability (resolved from DarCache Vulnerability ) on a public-facing server belonging to a "critical asset" entity (resolved by ThreatNG's discovery). In that case, this high-priority alert can be fed into a SIEM. The SIEM, now having a richer, resolved context, can correlate it with internal logs from an EDR solution to see if any internal systems have attempted to exploit this specific vulnerability or if anomalous activity originated from that server.
Vulnerability Management Platforms: A vulnerability management platform could ingest ThreatNG's detailed vulnerability intelligence, enriched by EPSS and KEV data. This allows the platform to prioritize remediation efforts based on the external attack surface context provided by ThreatNG's entity resolution. For example, suppose ThreatNG resolves a publicly exposed web server entity to have a vulnerability actively exploited in the wild (from KEV ). In that case, the vulnerability management platform can use this context to elevate its priority over an equally severe but less externally exposed vulnerability.
Identity and Access Management (IAM) Solutions: ThreatNG's ability to discover compromised credentials (DarCache Rupture ) and email intelligence can inform IAM solutions. Suppose ThreatNG resolves the issue of a specific user's credentials being found on the dark web. This information can be pushed to the IAM system to trigger a password reset or multi-factor authentication enforcement for that user entity.
Network Detection and Response (NDR) Tools: ThreatNG's IP and Domain Intelligence can provide external context to NDR tools. Suppose an NDR tool detects suspicious traffic to an IP address that ThreatNG has resolved as a known C2 server (from its intelligence repositories). In that case, it has immediate external threat intelligence to enhance its detection capabilities.

ThreatNG's graph-based entity resolution transforms raw data into a coherent, interconnected map of an organization's external cybersecurity landscape. This allows for unparalleled visibility, more accurate risk assessment, and a stronger defense against external threats.

Graph-Based Entity Resolution

Threat NG Staff

Graph-Based Entity Resolution

GraphQL API

Green IT