Data for
AI
Security

High-quality labeled datasets, adversarial samples, and structured threat intelligence — purpose-built for teams training the next generation of AI security models.

Scroll to explore
50M+
Labeled Samples
12
Threat Categories
99.4%
Label Accuracy

AI security
teams are
data-starved

Building threat detection models requires massive amounts of labeled security data — the kind that's nearly impossible to collect, clean, and annotate at scale without specialized expertise.

Most teams resort to synthetic data, small academic benchmarks, or proprietary siloes that don't generalize. ShieldSet solves this.

01
Scarce Ground Truth
Real-world malware, phishing, and intrusion samples are rare and fragmented across dozens of siloed repositories.
02
Poor Label Quality
Academic benchmarks contain errors, bias, and stale threat signatures that degrade model performance in production.
03
Adversarial Gaps
Models trained without adversarial samples fail the moment threat actors adapt. Defense requires knowing how attackers think.

What
We Offer

03 Categories

How
It
Works

From raw threat telemetry to model-ready datasets — ShieldSet handles the entire pipeline so your team can focus on building, not wrangling data.

01
Ingest & Normalize

We continuously collect threat telemetry from honeypots, dark web monitoring, partner feeds, and proprietary sensors — then normalize everything into consistent schema.

STIX/TAXIIJSONParquet
02
Expert Annotation

Every sample is reviewed by a team of certified security researchers. Multi-pass labeling with adversarial disagreement resolution ensures >99.4% accuracy.

MITRE ATT&CKCVEMulti-label
03
Versioned Delivery

Datasets are versioned, checksummed, and delivered via API or direct download. Every update includes a diff log so your pipelines stay reproducible.

REST APIS3Versioned
04
Continuous Updates

The threat landscape never sleeps. ShieldSet datasets are refreshed with new samples weekly — your models stay current without rebuilding from scratch.

Weekly RefreshDelta Feeds

Threat
Categories
Covered

01
Malware Classification
Ransomware, trojans, worms, rootkits — PE binaries with dynamic + static features.
8.2M samples
02
Phishing Detection
URLs, email headers, HTML content, and screenshot features for phishing identification.
12.4M samples
03
Intrusion Detection
Network flows and host logs from simulated and real-world intrusion events.
6.1M events
04
Vulnerability Intel
CVE-linked exploit samples, patch diffs, and severity context at machine scale.
3.8M records
05
Social Engineering
Spearphishing lures, pretexting scripts, and vishing transcripts with intent labels.
2.1M samples
06
Supply Chain
Dependency confusion, typosquatting, and package tampering indicators.
940K packages
07
C2 Traffic
Labeled command-and-control communication patterns across known RAT families.
4.4M flows
08
Threat Actor TTPs
MITRE ATT&CK-aligned TTP chains mapped to nation-state and eCrime groups.
6,200 groups

Cybersecurity
Datasets

Expert-labeled, production-ready datasets for AI security teams. Log in to view full metadata and download options.

Malware &
Threat
Detection

Comprehensive labeled datasets for training malware detection and classification models — sourced from active honeypots, dark web monitoring, and proprietary sensors, continuously refreshed.

Malware Classification
PE Binary Analysis
Ransomware, trojans, worms, rootkits with dynamic and static feature extraction. 8.2M labeled PE binaries with behavioral tags.
8.2M samples
Adversarial Samples
Evasion-Tested Examples
Adversarial examples crafted to bypass production detection systems, enabling robust model training against adaptive threat actors.
22M+ samples
MITRE ATT&CK Mapped
TTP-Aligned Intelligence
Every record mapped to MITRE ATT&CK tactics and techniques, enabling precise model training for structured threat detection.
6,200 groups
Threat Intelligence
APT & IOC Feeds
Structured intelligence across APT campaigns, indicators of compromise, and nation-state TTP chains — refreshed weekly.
18M+ records

Network
Intrusion
& Traffic

Normalized pcap-derived features, labeled flow records, and C2 traffic patterns for network anomaly detection models — covering lateral movement, exfiltration, and C2 beaconing.

Network Intrusion
Labeled Intrusion Events
Network flows and host logs from real and simulated intrusion events covering lateral movement, privilege escalation, and data exfiltration.
6.1M events
C2 Traffic Analysis
Command & Control Patterns
Labeled C2 communication flows across known RAT families — enabling detection of beaconing, tunneling, and encrypted C2 channels.
4.4M flows

Phishing
& Fraud
Data

Multi-modal phishing datasets spanning URL analysis, email headers, HTML content, and screenshot features — plus social engineering lures for spearphishing and BEC campaigns.

Phishing Detection
Multi-Modal Phishing Data
URLs, email headers, raw HTML, and rendered screenshot features — enabling multi-modal phishing classifiers with 12.4M labeled samples.
12.4M samples
Social Engineering
Spearphishing & Pretexting
Spearphishing lures, pretexting scripts, and vishing transcripts with fine-grained intent labels for social engineering detection models.
2.1M samples

Vulnerability
& Exploit
Data

CVE-linked exploit samples, patch diffs, and severity context at machine scale — enabling AI models that predict exploitability, prioritize patching, and detect active exploitation.

Vulnerability Intel
CVE-Linked Exploit Data
3.8M CVE-linked records with exploit proof-of-concept samples, patch diffs, and CVSS-enriched severity context for vulnerability prioritization models.
3.8M records

Supply
Chain
Risk Data

Indicators of supply chain compromise spanning dependency confusion attacks, typosquatting packages, and package tampering events — enabling AI-driven software supply chain security.

Supply Chain
Package Risk Intelligence
Dependency confusion, typosquatting, and package tampering indicators across 940K malicious or suspicious packages — tagged by attack type and severity.
940K packages

Built for
Teams
Moving Fast

You're building the next generation of security tooling — but acquiring and labeling threat data shouldn't eat your runway. ShieldSet gives early-stage security companies immediate access to production-grade datasets so you can focus on building your product, not your data pipeline.

Ship faster
Skip months of collection
Get API access in minutes and integrate labeled threat data directly into your training pipeline. Skip the collection, cleaning, and annotation work entirely.
Stay lean
Pay-as-you-go pricing
Pay-as-you-go and Growth plans scale with your usage. No enterprise contracts, no procurement headaches, no minimums.
Stay current
Weekly dataset refreshes
Weekly dataset refreshes mean your models train on the latest threat signatures — without rebuilding your data infrastructure from scratch.

Rigorous Data
for Serious
Research

Academic benchmarks in cybersecurity are often outdated, small-scale, or contaminated with labeling errors. ShieldSet provides research labs with large-scale, reproducible datasets that reflect the current threat landscape.

Reproducibility
Versioned & checksummed
Every dataset is versioned and checksummed. Cite a specific version and other researchers can reproduce your exact experimental setup.
Label Quality
>99.4% accuracy
>99.4% label accuracy across all datasets — validated by certified security researchers with multi-pass review and disagreement resolution.
Documentation
Publication-ready
Datasets come with full schema documentation, label taxonomies, and provenance metadata — everything you need for a rigorous methods section.

Enterprise-
Grade Data
at Scale

Large security platforms need data that scales with their operations — custom schemas, high-frequency updates, SLA guarantees, and dedicated support. ShieldSet Enterprise is purpose-built for exactly this.

Customization
Custom datasets
Don't see exactly what you need? Our team builds custom datasets to your specification — mapped to your threat model and label taxonomy.
Freshness
15-minute updates
Enterprise customers receive data refreshes every 15 minutes — keeping detection models current against rapidly evolving threats.
Reliability
SLA guarantee
Uptime and delivery SLAs backed by contract. Your data pipelines are mission-critical — we treat them that way.
Support
Dedicated team
A named support team that knows your use case — not a ticket queue. A direct line to people who understand your data architecture.

Data for
National
Security

Government agencies and defense contractors require data that meets strict provenance, compliance, and operational security requirements. ShieldSet works with public sector teams to deliver structured threat intelligence that meets federal standards.

Compliance
Provenance documentation
Full chain-of-custody documentation for every dataset — supporting compliance and audit requirements in regulated environments.
Interoperability
STIX/TAXII compatible
Data delivered in STIX/TAXII formats compatible with existing government threat intelligence platforms and SIEM integrations.
Delivery
Custom delivery options
Air-gapped delivery, on-premise licensing, and custom ingestion pipelines available for sensitive operational environments.

Quick
Start

Get your first dataset in under 5 minutes. No setup required — just an API key and a few lines of code.

Step 1 — Get your API key
Sign up for a free account and copy your API key from the dashboard. No credit card required.
Step 2 — Install the SDK
pip install shieldset
Step 3 — Pull your first dataset
import shieldset as ss

# Initialize with your API key
client = ss.Client(api_key="your_api_key_here")

# List available datasets
datasets = client.datasets.list()

# Pull a dataset
df = client.datasets.pull(
    dataset_id="malware-classification-v3",
    format="parquet",
    limit=10000
)

print(df.head())
REST API — curl
curl -X GET https://api.shieldset.com/v1/datasets \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Docu
mentation

Complete reference for the ShieldSet REST API and Python SDK.

Contents

Authentication

All API requests must include a valid API key passed as a Bearer token in the Authorization header.

Authorization: Bearer ss_live_xxxxxxxxxxxxxxxx

Generate and manage API keys from your . Keys are prefixed by plan: ss_free_, ss_live_, or ss_ent_.

Never expose your API key in client-side code or public repositories. Rotate compromised keys immediately from your dashboard.

Key types

PrefixPlanPermissions
ss_free_FreeRead — limited to 5 pulls, 1 dataset
ss_live_Pay As You Go / GrowthRead — all datasets
ss_ent_EnterpriseRead + custom dataset access + SLA

Simple,
Transparent
Pricing

Start free, scale when you're ready. No surprises, no lock-in.

Free
$0
Forever
Instant API access, no credit card required. Perfect for exploring our datasets before committing.
  • 5 dataset pulls
  • Access to 1 dataset
  • Instant API key on signup
  • Community support
Pay As You Go
$10
Per dataset pull
Only pay for what you use. No monthly commitment. Ideal for variable or unpredictable data needs.
  • Unlimited dataset pulls
  • Access to all datasets
  • Instant API key
  • Email support
Enterprise
Custom
Pricing
Purpose-built data solutions for large security platforms. Custom datasets, volume pricing, and dedicated support.
  • Unlimited dataset pulls
  • Custom datasets to spec
  • 15-minute data updates
  • SLA guarantee
  • Dedicated support

All plans include access to ShieldSet's high-quality labeled cybersecurity datasets, adversarial samples, and structured threat intelligence — purpose-built for AI security teams.

Welcome back
Log in to your ShieldSet account
or
Don't have an account?
Create your account
Start with 5 free dataset pulls. No credit card required.
or
Already have an account?
Dashboard
U
Welcome back, User
Free Plan · API key active
Account Tier
Free
Current plan
API Requests
0
All time
Dataset Pulls
0 / 5
Used this month
Downloads
0 / 5
Used this month
API Key Status
Active
No issues detected
API Usage — Last 7 Days
Default API Key
ss_free_••••••••••••••••
Recent Activity
DatasetTimestampStatus

Get
In
Touch

Whether you're evaluating ShieldSet for your team or ready to get started, we'd love to hear from you. We respond to all serious inquiries within one business day.

Message sent — we'll be in touch shortly.
General
hello@shieldset.com
General inquiries, partnerships, and media.
Sales
sales@shieldset.com
Enterprise pricing and custom dataset discussions.
Support
support@shieldset.com
Technical support for API and dataset access.
Privacy
privacy@shieldset.com
Data privacy and compliance inquiries.

We Make
Threat Data
Work Harder

ShieldSet is a remote-first data engineering company on a mission to eliminate the data bottleneck holding back AI security teams. We continuously collect, annotate, and deliver production-grade cybersecurity datasets — so the teams building the next generation of threat detection can spend their time building, not wrangling data.

We're a small team with deep roots in both data engineering and offensive security. We've seen firsthand how poor data quality kills model performance in production, and we built ShieldSet to fix that — with expert annotation, rigorous versioning, and a relentless focus on keeping datasets current.

The definitive source for cybersecurity training data.

We exist to give every AI security team access to the reliable, up-to-date data they need to build models that work in production. We take data quality personally, and are proud to be the platform the industry turns to for cybersecurity AI data.

Remote First
Outcomes, not offices
We believe great work happens when talented people have autonomy and flexibility. We hire for excellence and trust our team to deliver — regardless of where they're based. What matters is the work.
Learning & Growth
Always getting sharper
The threat landscape evolves constantly, and so do we. We invest in continuous learning, share knowledge openly across the team, and genuinely believe our collective expertise is our biggest competitive advantage.
Inclusion
Small team, mighty culture
We're a diverse team that treats everyone with respect and takes inclusion seriously — not as a policy, but as a foundation for building something we're all proud of. Every voice here matters.
Health coverage
Health, dental, and vision insurance for you and your dependents.
Flexible time off
Paid time off, sick leave, and company holidays — with no pressure to not use them.
Remote-first
Fully remote roles. Work from wherever you do your best work.
Performance bonus
Bonus plan tied to individual and company goals — we share in the wins together.
ShieldSet swag
Quality gear for the team. Because working in security should feel like it.
Sales Engineer
Remote · Full-time · $120k–$160k

Join the
ShieldSet
Team

We're a remote-first company hiring people who care deeply about data quality, security, and building things that work. If that's you, we want to hear from you.

Don't see your role?
We're always open to great people
Send us your resume and tell us how you'd contribute.

Insights
from the
Team

Perspectives on AI security, data quality, and the evolving threat landscape — written by the people building ShieldSet.

Let's Build
Something
Together

ShieldSet partners with organizations at the intersection of cybersecurity and data. Whether you produce threat intelligence that complements our datasets, or you're a security platform looking to integrate our data, we want to talk.

Data Providers
Sell your data through ShieldSet
If you produce proprietary threat telemetry, malware samples, or network intelligence, we can help you structure, annotate, and distribute it to AI teams — with revenue sharing that works for you.
Technology Partners
Integrate ShieldSet into your platform
Security platforms, SOAR vendors, and AI tooling providers can integrate ShieldSet data directly via API or white-label licensing. We offer flexible arrangements for platform-level partnerships.
Research Partners
Collaborate on the hardest problems
We partner with research institutions and national labs on data collection, annotation methodology, and dataset benchmarking initiatives. Joint publications and dataset co-creation welcome.
Tell us about your organization and what kind of partnership you have in mind. We respond to all serious inquiries within two business days.

Privacy
Policy

Last updated: June 1, 2025

Terms of
Service

Last updated: June 1, 2025

Complete your profile
Help us personalize your experience. Takes 30 seconds.
ShieldSet Admin
Content management & user administration

Apply for
Role

Fill out the form below. We review every application carefully and respond within 5 business days.