Primary Source Collection
What is Primary Source Collection in Cybersecurity?
Primary source collection is the process of gathering raw, original, and uninterpreted data directly from the point of origin within a digital environment or the public internet. In cybersecurity, this involves capturing information directly from systems, network traffic, hardware sensors, or human actors without relying on secondary analysis, third-party reports, or "repackaged" intelligence.
The goal of primary source collection is to establish a "ground truth." By looking at the original evidence, security professionals can ensure the highest level of forensic integrity and accuracy, reducing the risk of making decisions based on outdated, biased, or incorrectly attributed information.
Types of Primary Sources in Cybersecurity
To build a complete picture of an organization’s risk, security teams collect data from several different primary "vantage points."
Endpoint Telemetry: Raw data harvested directly from workstations, servers, and mobile devices. This includes process logs, file system changes, and registry modifications.
Network Traffic (PCAP): The direct capture of data packets as they move across a network. This is the ultimate primary source for understanding how an attacker moved laterally or what data was exfiltrated.
System and Application Logs: Original audit trails generated by operating systems, databases, and web servers that record exactly who accessed what and when.
Malware Samples: The actual malicious code found on a compromised system. Analyzing the original binary is a primary source activity used to reverse-engineer an attacker's tactics.
External Reconnaissance Data: Information gathered from the public internet using unauthenticated discovery. This includes DNS records, open ports, and leaked credentials found on the dark web.
Human Intelligence (HUMINT): Direct interactions or observations of threat actor behavior in forums, chat rooms, or through social engineering "stings."
Why Primary Source Collection is Critical for Security
Relying on primary sources rather than secondary summaries provides several strategic advantages for a modern Security Operations Center (SOC).
Forensic Admissibility: For legal or regulatory proceedings, primary data—such as an original disk image or an unedited log—is often required to prove a chain of custody.
Eliminating Third-Party Bias: Secondary sources often interpret data based on their own algorithms or risk models. Primary collection allows an organization to apply its own specific business context to the raw facts.
Real-Time Detection: Waiting for a third-party report can take days or weeks. Collecting primary data directly from the attack surface allows for immediate alerts the moment a configuration changes or a new vulnerability appears.
Accurate Attribution: By looking at raw network headers and original code snippets, investigators can more accurately identify the specific "fingerprints" of a threat actor group.
Common Challenges in Primary Source Collection
While primary data is the most accurate, it is often the most difficult to manage due to the sheer scale of modern digital environments.
Data Volume and "Noise": Raw logs and network captures generate massive amounts of data. Organizations must use sophisticated filtering to find the "signal" within the noise.
Storage Costs: Keeping primary data for long-term forensic analysis can become expensive, requiring a strategy for tiered storage and data aging.
Privacy and Compliance: Collecting primary data, especially from endpoints or human sources, can trigger privacy concerns and must be handled in accordance with regulations like GDPR or CCPA.
Attribution Accuracy: Without "Legal-Grade Attribution," primary data from the public web can sometimes be misattributed to the wrong organization, leading to false positives.
Frequently Asked Questions
What is the difference between a primary and secondary source in cybersecurity?
A primary source is the raw evidence, such as a server log or a malware file. A secondary source is a report or an article written by someone else who analyzed that primary source. For example, a CISA alert about a vulnerability is a secondary source, while the actual exploit code is a primary source.
Is OSINT a primary source?
It depends on how it is gathered. If you are looking at a raw DNS record or a post on a hacker forum yourself, you are performing primary source collection. If you are reading a summary of that forum post in a weekly threat report, you are using a secondary source.
How does primary source collection help with "Shadow IT"?
Primary source collection is often the only way to find Shadow IT. By using unauthenticated external discovery to scan the internet for your organization's brand or IP ranges, you find assets that aren't in your official internal inventory.
Do I need an agent to collect primary data?
Not always. While endpoint telemetry often requires an agent, much of the most valuable primary data—such as DNS records, cloud storage configurations, and public-facing subdomains—can be collected using agentless, "outside-in" discovery methods.
Why is primary data important for SEC reporting?
New regulations often require organizations to disclose "material" risks quickly. Relying on primary sources ensures that the information provided to the board and regulators is current and technically validated, rather than based on old or third-party estimates.
Establishing Ground Truth: ThreatNG and Primary Source Collection
Primary source collection is the foundation of a proactive security strategy, ensuring that decisions are based on raw, uninterpreted data directly from the attack surface. ThreatNG provides an all-in-one platform for External Attack Surface Management (EASM), Digital Risk Protection (DRP), and Security Ratings, automating the collection and validation of this original evidence. By viewing the organization through an adversary's lens, the platform provides the "ground truth" needed to secure the modern digital estate.
External Discovery: Mapping the Raw Attack Surface
The core of a primary source collection is finding every asset associated with an organization. ThreatNG uses a purely external, unauthenticated discovery engine that requires no internal agents or connectors, allowing it to find 65 percent of the digital footprint typically missed by internal inventories.
Recursive Attribute Extraction: The discovery engine starts with a primary domain and iteratively extracts associated attributes, such as IP ranges, subdomains, and cloud-hosted assets. This ensures every Fully Qualified Domain Name (FQDN) is accounted for.
Shadow IT Identification: The platform finds "unknown unknowns," such as forgotten development sites or rogue marketing portals, directly from public records and global scans.
Multi-Cloud and SaaS Discovery: ThreatNG actively hunts for unmanaged cloud storage (AWS S3, Azure Blobs) and unsanctioned SaaS applications used by employees, providing primary evidence of unauthorized data silos.
External Assessment: Detailed Validation of Primary Risks
ThreatNG goes beyond simple identification by conducting in-depth, automated assessments to determine whether a primary finding is truly exploitable. These technical validations are translated into objective A-F security ratings.
Subdomain Takeover Susceptibility: The platform identifies "dangling DNS" records where a CNAME points to an inactive third-party service. For example, if test.example.com points to a deleted GitHub Pages site, ThreatNG performs a specific validation check to confirm if an attacker can currently claim that resource. If successful, it provides primary proof of a hijackable domain.
Web Application Hijack Susceptibility: The system assesses subdomains for the presence of critical security headers, such as Content-Security-Policy (CSP). A detailed example includes finding a production portal missing CSP; ThreatNG validates that this absence allows a malicious script to exfiltrate data to an external domain, creating a verified path for cross-site scripting (XSS).
Non-Human Identity (NHI) Exposure: This assessment quantifies the risk from high-privilege machine identities. For example, the platform identifies exposed system credentials or API keys that allow an attacker to bypass traditional authentication layers.
High-Fidelity Investigation Modules
Specialized investigation modules allow security teams to perform granular forensic reconnaissance into specific primary sources, such as public code repositories and social media chatter.
Sensitive Code Exposure: This module scans public repositories, such as GitHub, for leaked secrets. A critical example is finding hardcoded database connection strings or RSA private keys accidentally committed to a public project. This is primary evidence of a "master key" leak that an attacker would use for initial access.
Social Media Investigation Module (SMIM): This module investigates the "Human Attack Surface." For instance, it identifies employees most susceptible to social engineering or monitors public forums like Reddit for chatter about internal security flaws, providing primary intelligence on adversary intent.
Technology Stack Investigation: ThreatNG uncovers nearly 4,000 unique technologies used across the attack surface. A detailed example is identifying an outdated Nginx version or a vulnerable WordPress plugin, enabling teams to prioritize remediation based on the specific software version currently running.
Intelligence Repositories: The DarCache Ecosystem
Primary source collection is supported by the DarCache, a series of intelligence repositories that provide real-world context to the discovered technical findings.
DarCache Rupture: This repository stores organizational email addresses from third-party data breaches. It provides primary evidence of which accounts are most at risk of credential stuffing and account takeover.
DarCache Ransomware: This engine tracks the tactics of over 100 ransomware gangs. It allows an organization to see if its exposed ports or technologies match the preferred entry points of active adversary groups.
DarCache Vulnerability: This strategic risk engine correlates discovered technologies with the Known Exploited Vulnerabilities (KEV) list and verified exploits to prioritize remediation on threats that are actively being weaponized.
Continuous Monitoring and Strategic Reporting
Because the attack surface changes in seconds, ThreatNG provides ongoing vigilance and executive-ready reporting to ensure the security posture remains defensible.
Real-Time DarcUpdates: The platform monitors for "configuration drift" 24/7. If a new open port is detected or a security header is removed from a production site, the system issues an immediate alert based on the latest primary data.
SEC Filing Report: This capability automatically parses Form 10-K and 8-K filings to extract and benchmark an organization’s cybersecurity risk disclosures against its actual external security posture.
External GRC Assessment: Technical findings are mapped directly to compliance frameworks like NIST CSF, ISO 27001, and GDPR. For example, an open database port maps to the "Protect" and "Detect" functions of the NIST framework, demonstrating continuous due diligence.
Cooperation with Complementary Solutions
ThreatNG acts as an external intelligence layer that enhances the effectiveness of other security investments through proactive cooperation and the sharing of primary data.
Complementary Solutions for SIEM and XDR: Validated external intelligence—such as a confirmed dangling DNS record or a leaked administrative credential—is fed into a SIEM. This allows internal analysts to prioritize alerts related to those specific at-risk assets, reducing the "hidden tax" of false positives.
Complementary Solutions for SOAR: A high-priority finding, such as an active phishing domain, can trigger an automated SOAR playbook to block the domain’s IP address at the firewall and simultaneously alert the brand protection team to initiate a legal takedown.
Complementary Solutions for CASB and IAM: When the SaaSqwatch module identifies an unsanctioned cloud application, this primary data is used by a CASB to enforce security policies. If an admin credential appears in the DarCache Rupture database, a complementary IAM solution can automatically force a password reset.
Common Questions About Primary Source Collection
How does ThreatNG find assets without internal agents?
The platform uses a purely external, unauthenticated discovery process that mimics an attacker's reconnaissance steps. It scans public records, domain registries, and open cloud buckets to find every host associated with an organization.
Why is primary source data better than a third-party report?
Third-party reports often use "repackaged" or old data that may lack context. ThreatNG collects raw, original evidence directly from your specific attack surface, ensuring that the information is current, accurate, and relevant to your actual environment.
What is the benefit of the DarChain Attack Path Modeling?
DarChain transforms isolated technical vulnerabilities into a narrative. Instead of a flat list of bugs, it visually illustrates how an attacker could chain an abandoned subdomain to a leaked API key to gain initial access to mission-critical systems.
Can ThreatNG assist with SEC reporting mandates?
Yes. The platform correlates primary technical data with public disclosures in SEC filings. This ensures that the cybersecurity narrative provided to the board and regulators is technically validated and accurate.

