Inside Mitiga’s Forensic Data Lake: Built for Real-World Cloud Investigations

By

,

‍

Most security tools weren’t designed for the scale or complexity of cloud investigations. Mitiga’s Forensic Data Lake was.

It collects and retains over 1,000 days of cloud, SaaS, and identity logs without agents, without SIEM dependencies, and without moving your data out of region. It also enriches those logs with configuration snapshots to keep context accurate, even as environments change.

In this walkthrough, Mitiga Field CISO Brian Contos explains how the system works, what types of logs are collected, and why object-level logging in services like AWS S3 is essential for understanding attack flow and breach impact.

‍

Watch the video or read the transcript below.

‍

Transcript

‍

Welcome back to Mitiga Minutes.

Let’s talk about the Mitiga Forensic Data Lake. First, it’s a completely distributed data lake where we aggregate, enrich, and store data. It’s highly scalable and extensible, and the architecture is built on top of Databricks. We’ve optimized it for about a thousand days of storage per customer.

Our data is collected in an agentless capacity, meaning we utilize APIs and webhooks across cloud, identity, and SaaS. We do integrate with solutions such as EDRs that have their own agents for workload monitoring, for example, but we don’t require agents ourselves.

Also, we have no SIEM dependencies for collection, storage, or analysis. Plus, you keep your data where it is—it’s in your regions. You don’t need to worry about regulatory issues related to moving data to different regions or the egress costs of moving data outside your environment.

Our advanced analytics are applied to your data using a combination of AI and human intelligence gleaned from our incident response teams. This includes data correlation, thousands of purpose-built breach detections, anomaly detection across both human and non-human behaviors, volumetric and temporal analysis, and pattern discovery—along with fully integrated threat hunting and a ton of other capabilities covered in different videos.

You can see a bunch of the adapters we have in this current environment. But if I want to add a new adapter, it’s as simple as clicking “Add Adapter.” Let me search for a few here—for example, maybe I want to add Dropbox. Or maybe I’d like to add Wiz. And maybe a SIEM like Splunk. It’s that easy. Search for it, select it, install it.

Let’s talk a little bit now about some of the adapters like AWS. We’ve collected 1.6 terabytes since it’s been connected to Mitiga. Azure: 557.7 gigabytes. Azure Active Directory: 60.8 gigabytes. Box: just under 500 megabytes. Chronicle—we haven’t received anything from it yet.

Let’s swing back over to AWS. It says we have seven resources and twelve data flows. What’s that all about?

Let’s go into Cloud Resources. This is a collection of cloud resources gathered across all your accounts. What’s really cool is it’s completely automatic. As your environment changes, Mitiga will automatically detect new accounts and onboard them. The result is: you always have the data you need.

Let’s search our cloud resources for AWS. We pull up our seven rows. Within those rows, we can see different areas where the information is coming from—staging, sandbox, and production.

Now let’s move from Cloud Resources to Data Flows. Within Data Flows, I can look at a few different areas: just AWS, Office 365, Azure, a combination of those flows—or in this case, I’ll just stick with the twelve AWS data flows.

I want to dive into the various log sources we’re getting from AWS. Looking from the bottom up, we have Cloud Discovery Services. This is Mitiga’s proprietary configuration collection. Before we even collect logs, we take ongoing snapshots of your configuration.

This allows us to contextualize all the rest of the logs, and we keep these snapshots because configurations are constantly changing. Context must be relevant to the time of the log. If you try to contextualize logs from a year ago—or even a few days ago—based on today’s configuration, it would be a useless mess. It’s kind of like using a GPS to navigate Las Vegas today based on your location details from Los Angeles last week.

Next, we’ve got CloudTrail logs. These are basic baseline logs. Most folks working in cloud security are aware of CloudTrail and are probably collecting it. CloudTrail contains AWS activity and very basic security logs—things like API calls, caller, time of call, source IP, request parameters, response elements, and user actions from the AWS Management Console, SDKs, or command-line tools.

While important, CloudTrail does not include a wealth of detail in advanced activity logging—like access-level logs, which are 100% needed for detection and investigation of access.

Next are DNS logs. These are Route 53 resolver query logs—in AWS speak, Route 53, because DNS is port 53. These are DNS queries originating from within your Amazon Virtual Private Cloud, or VPC.

A couple other log flavors here: EKS, like Elastic Kubernetes Service. This tracks your Kubernetes control plane and cluster activities. Then we’ve got VPC Flow Logs—all the IP traffic entering and exiting your VPC.

The last one I want to hit on is S3 Object-Level Logs. These are API actions like GetObject, PutObject, DeleteObject—all within an S3 bucket. These are really important.

Let’s go a little deeper on this one. First of all, why do we want to collect object logs at all? Well, it’s the only way to see what a threat actor has accessed. Without S3 object-level logs, if you suspect a breach, you have zero visibility to understand the actual impact.

Consider extortion. If you have an extortion incident, you have no way to know if the attacker really has all your data, some of your data, or none at all. And sometimes, they lie when they’re extorting you. You need evidence-based results—not assumption-based.

S3 object-level logs are also useful for detecting attacks in the early stages, as you’ll have visibility into suspicious activity.

Now, that was a lot of information about logs—and that’s just some of the AWS logs Mitiga stores and analyzes. They can also include metadata, GuardDuty, Elastic Load Balancing—not to mention logs from GCP like activity audit logs, Azure Blob Storage logs, identity, SaaS—you get the idea.

There are a ton of configuration, log, and supporting data sources, types, and formats that are all collected, stored, enriched, and analyzed by Mitiga in the distributed Forensic Data Lake.

For more information, check out mitiga.io and request a demo.

LAST UPDATED:

August 22, 2025

Don't miss these stories

Now You See Me: Workday Logs

Discover what you’re missing in Workday logs. Learn how to detect threats, spot blind spots, and make Workday a core part of your SaaS security strategy.

November 12, 2025

ShinyHunters and UNC6395: Inside the Salesforce and Salesloft Breaches

Mitiga Labs began investigating a series of suspicious activities targeting Salesforce environments well before the news broke publicly. It all started with traffic from Tor exit nodes interacting with Salesforce via an app called Drift. Is this normal behavior? What is Drift? And how do we assess its legitimacy? This is where the challenge of shadow IT surfaces – security operations teams are often left scrambling to determine whether such activity is authorized or a sign of compromise.

October 10, 2025

Defending SaaS & Cloud Workflows: Supply Chain Security Insights with Idan Cohen

Discover how attackers exploit SaaS and CI/CD pipelines and learn best practices to strengthen supply chain security. Get expert insights from Mitiga’s Idan Cohen.

August 25, 2025