Build a fast, free, and effective Threat Hunting/Incident Response Console with Windows Event Forwarding and PowerBI
Monitoring your network and gathering massive amounts of data has become easier and easier. Many guides exist on how to gather data, and lots of companies have "enterprise grade" Security Information and Event Management products that can ingest terabytes of data. But what seems to be missing from most environments is the ability to apply context to the data they get, or a knowledge of why certain artifacts that get gathered are important.
Doing Incident Response as a consultant gives you a unique perspective on monitoring, as you get to experience a wide variety of networks in varying stages of overall "health" during what is the worst possible time for owners. Things that normally can slide under the radar like slightly unhealthy patch management systems, or systems that have operational issues with monitoring agents suddenly become insurmountable problems to diagnosing and solving the security issue on the network. Another common finding is that while the company may be collecting data from servers and Domain Controllers, due to licensing costs they are not getting data from desktop endpoints, which are often times both the entry point in an intrusion and where a lot of the data access/exfiltration and lateral movement story happens. These issues are universal in companies big and small, with massive investments or minimal investments in their IT departments, and often aren't revealed until an incident occurs. This leaves the IT and Incident Response team in a position where they have to correct visibility issues when their infrastructure may not be working optimally, and rapidly be able to triage the data from potentially hundreds of thousands of endpoints while learning how to apply context to the events they are seeing.
Learning from these incidents, and the requirements inherent to them (ability to deploy tools and get data rapidly, use only built in tools, has to be usable and deployable by people who probably haven't slept in a week) I developed an Incident Response dashboard that I liked so much I personally used it to "hunt" on all the engagements in the later part of my Incident Response Consultant tenure. Many of the customers liked it so much that they have kept it in their environments to use for proactive threat hunting and log analysis. Using the built in Windows Event Forwarding components of Windows, some PowerShell scripts, and PowerBI desktop, you can create a fast, free, and effective console for diagnosing problems and finding Indicators of Attack in your network. My hope is to make security investigation much more accessible to everyone by showing how to do this quickly and without a massive monetary investment.
We'll go in to detail of how to deploy it and what to look for below, but here's a preview of what you get at the end to hopefully convince you to finish reading and deploy.
Console with PowerBI data slicers of the collected events:
Rapid ability to zoom in on selected indicators and remove other data:
If you've been lucky enough to not yet have to work on a major Incident Response, I can't stress enough how valuable the ability to remove non-relevant data and quickly get to which machines require deeper level analysis is. If you're working an incident where the adversary is still live on the network and fighting the "Attacker's Time to Goal" metric, or just trying to get the issue as fully investigated as possible so you can move to an enlightened remediation effort, being able to start with a very wide net of key indicators can save days of investigation time - especially if the incident is on a globally distributed network were getting "hands on keyboards" in a remote office may be difficult.
Event logs may be older technology, and the world certainly has more advanced ways to analyze a system but logs still remain the first place most investigators look on a system, and one many organizations still don't have fully under control. A great example of this is NotPetya malware outbreak. NotPetya caused massive amounts of economic damage and downtime to companies, but it also used an incredibly low tech old trick of running wevtutil cl security to clear event logs. This means that the computers that were about to go offline forever would have been sending a very strong signal in an Event ID 1102 Event Log Cleared event to any central logging that that there was something going on that needed more investigation. Even when we talk about "advanced" and "fileless" malware, there are event log traces - some of the most advanced fileless malware out there is still leaving a persistence point in the form of a new service or scheduled task which is going to leave us a log entry to investigate. Advanced attackers can bypass or tamper with logs, but that doesn't absolve you from monitoring them. Learning what to look for and making sure you have common techniques well monitored and practiced with your Red Team is like leveling up in a game or a martial arts rank - you still have to practice the fundamentals. A trend I see all too often is that attackers will just go ahead and be noisy because they know nobody is watching or that it will take long enough to start a response effort that they will have already finished their task. Using the fact Event Logs are ubiquitous and will be encountered by the attacker at every stage of their operation can potentially allow us to (along with some other hardening techniques) use the entire network as a giant sensor. The attacker will have to make a choice between leaving logs or creating a signal by removing or editing them, which can give network defenders hope in a battle that often seems stacked against them.
Setup and Configuration
This solution got named WEFFLES (Windows Event Logging Forensic Logging Enhancement Services) when I first created it and (perhaps unfortunately) the name has stuck, so we'll be using that term to refer to it in the rest of the post. WEFFLES is designed to be small and lightweight, both for speed of getting something deployed during an Incident Response and also for the sake of being sustainable in an environment going forward. It's not necessary to be familiar with the underlying technology of Windows Event Forwarding to set up the solution as it's scripted out of you, but if you would like to learn more about it before starting you can read my "Monitoring What Matters" post or watch my Microsoft Virtual Academy session on the topic.
Requirements for deploying WEFFLES :
- Active Directory - we need to be able to create and link a GPO that will apply to all of the machines we want in scope of monitoring. I would hope this would include desktops, servers, and domain controllers for the sake of completeness, but the flexibility to link the GPO that enables Windows Event Forwarding to a testing Organizational Unit is also a great way to start.
- A server to act as the Windows Event Collector - I recommend using a dedicated server as the collector, for performance and security reasons. The server does not have to be massive in spec though, even if you have a lot of endpoints you plan to have checking in to it. The log data should not go over 10GB for even large organizations (500k endpoints is my biggest WEFFLES deployment so far) and the solution exports data to CSV files for safekeeping, which are quite small. The main performance need on a collector is memory to hold the log files. We scope the size of the event log as 1GB as it acts as a holding place only before the events get exported to CSV in this solution, but the general rule of thumb is if you wanted a larger event log you need : amount of memory required to run windows and do things like backups + specified event log size.
- PowerBI Desktop - The console/data slicer itself is built using PowerBI Desktop. If you'd rather use another data slicer or the most widely used incident response tool on the planet (Microsoft Excel) the output weffles.csv file can be loaded into many different tools. There is a pre-built weffles.pbix PowerBI Desktop file in the GitHub repo that allows you to use the same data slicer console view I show in this post.
All the build files are located in the WEFFLES GitHub repo, and I'll explain how to use them in the rest of the post.
The first thing we need is a Group Policy object to configure Windows Event Forwarding. You will need to change the "SubscriptionManagers" Fully Qualified Domain Name setting to the FQDN of your collector server, unless of course your collector server is named wec.contoso.com. The rest of the URL is telling the clients which port to check into, and how often to do it. The Refresh=60 setting on the end of the URL is telling clients to call back to the Windows Event Collector server every 60 seconds to see if there are any changes to subscriptions. I've deployed this to large networks with that setting and not had issues, but you can feel free to scale it back to 5 minutes (360 seconds) or whatever works for your investigation to save on traffic. The "Log Access" portion of the GPO is to give the Network Service account read access to the Security Event Log. We do this so we can run WEF in a least privilege mode where nothing runs as administrator. To figure out what your Log Access setting should be, run wevtutil gl security on example machines in your domain. Copy the existing entries if you don't want to change anything, and then append (A;;0x1;;;NS) to the end. This setting replaces what is existing on the system currently, so if you have an account granted access you don't want to break, take the extra time to run the wevtutil gl security command first. (these steps are explained in great detail in my MVA session on WEF.)
The next setting we need in the GPO is to start the Windows Remote Management service. There's a difference between starting this service and configuring a listener - just starting the service doesn't make port 5985 available for attackers, remember we're setting up the systems in "push" mode where they just send data to WEF, nothing is reaching in to them. We need this GPO, as the service was not automatically started on 2008R2 and Windows 7 machines. If you have a more modern environment where it's running everywhere already you may still want to use a GPO to ensure it stays that way.
There are two options presented here for setting the start type of the service, one is Group Policy Preferences, the other is using the System Services setting. I've usually used Preferences in production deployments to accommodate legacy systems, but I tested the other way in my lab and it did work if you want to go that way.
Once you have your GPO created and linked where you need it to be, copy the downloaded files from the WEFFLES GitHub Repo to your collector server. Execute the wefsetup.ps1 script, which will take care of everything for you - including importing the coreevents.xml subscription.
WEFFLES uses the EventLogWatcher script from CodePlex to output the CSV file, and it's kicked off via ScheduledTask as system startup, so reboot the machine now. The next part takes a while to "cook" so have patience and maybe walk away for 10 minutes as the subscriptions start to work and the script starts to parse the logs.
After the machine has rebooted, you can check the status of Windows Event Forwarding by opening Event Viewer and checking the "Subscriptions" area. You should see the "CoreEvents" subscription we imported, and if you right click and check runtime status you can see which machines have checked in.
If you navigate to the Forwarded Events Log, you should see some of those Core Events in the log.
Browse to the c:\weffles directory, and you should see a bookmarks.stream file and weffles.csv - that means everything is working!
At this point you could just open weffles.csv in Excel and use the Data features there to filter and search, as a lot if IR professionals will be used to doing.
I prefer copying weffles.csv to my own laptop however, and using PowerBI. If you download create a c:\weffles directory on your machine and copy the weffles.pbix from the GitHub repo and the weffles.csv from your environment to it, you should be able to open weffles.pbix (assuming you installed PowerBI Desktop) and click "Refresh" and it will pull the data from your environment into my example slicers.
What we're looking for, and why
The goal is to collect very specific and purposeful pieces of data, so we can triage them effectively. Have a well-defined goal in mind behind each piece of data you are collecting. Here are the ones I like to use to start with.
Indicators of Attack/Persistence Creation - the "Core Events" subscription
Security Event Log 1102 - Security Event Log was cleared
There are some events that should always escalate to a deeper level forensic analysis, and clearing the Security Event Log is one of those. You will likely encounter some Benign Positives as you do this that are admins doing it for one reason or another, but the benefit far outweighs the annoyance. This can also be an effective way to "reeducate" individuals with administrative access who may be practicing unsafe behaviors.
System Event Log 7045 - New Service creation event
One of the most basic and popular ways for malware to become persistent on a system is to create a new service - even in the case of "fileless" and "in memory" malware. A very common example of this is new service creations that launch a PowerShell cradle to download code from a web server. New service creations can lead you to "possibly unwanted programs" as well, which in the case of an investigation means remote administration tools - used by the attacker. Psexec and several screen sharing tools are well documented examples of tools used by attackers just as much as authorized admins.
Security Event Log 4720 - New local user created
Why bring malware when you can just bring a new user? Since most networks haven't implemented lateral movement mitigations, adding a new local account on something like a SQL server means that the attacker can use access to any system on the network and RDP directly to their goal, and not utilize a trojan or malware that may be more likely to be discovered than an account named "sqladmin."
Scheduled Task Operational Log 106/200 - Scheduled Task was registered or executed
Just like services, the scheduled task is a very popular persistence method - but it also has the bonus capability of lateral movement as tasks can be registered remotely. You can see a lot of interesting attacker behavior in scheduled tasks, including using rundll32.exe to execute code.
Why these versions of the Event IDs when there may be more verbose or newer ones? I picked the Event IDs that are most often on by default and available in the most networks, as the original goal was Incident Response. 7045s for service creation are a great example of that - there's a far more verbose and useful version of that event in Event ID 4697, however I have never seen that reliably available "in the wild." (As a good example of that, my work laptop which should be the best possible scenario has lots of 7045 events but no 4697s.) There's no Time Machine for IR yet, so we have to go with what is most likely on already without an Audit Policy change. If your environment reliably has the more verbose events available and you're not taking the solution from company to company doing IR, feel free to substitute them in as they will help you investigate.
If you're a Red Team reading this, you may have noticed some of these things look like your Tools, Techniques, and Procedures - and the real attackers know that. If you have finished an operation please make sure you communicate which systems you accessed and clean up after yourself. I have investigated several cases where the initial response to my questions was "that was the red team" until we looked closer and it was in fact an attacker who had piggybacked on an artifact left by the Red Team on the machine. (Blue Teams also need to make sure they don't get used to "Red Team Tricks" and spend time deconflicting activities, as more and more attackers are emulating Penetration Tests and this can waste valuable investigation and attacker disruption time.)
Whether you are investigating a compromise, attempting to figure out "what does that service account that has been in Domain Admins for 10 years actually do," or troubleshooting operational issues in your environment, you're likely to have already defined a list of accounts or computers that are "interesting." In an IR event, this is usually a compromised high value account (such as a Domain Admin level service account) that has been identified as behaving suspiciously, and the goal is to see what it has accessed, and what other credentials or systems it may have gained access to. Versus gathering all the logon events of every account, we can build and later modify Windows Event forwarding subscriptions with the accounts that are "in scope" for particular logon events.
4624 - Successful Logon
This event is a treasure trove of information. 4624 events show you not only who logged in, but where they did it from, what process was associated with it, and what type of authentication was used. By looking at 4624 events you're going to learn a lot of things about credential exposure and unsafe admin behaviors as well as being able to track attackers.
Knowing your Logon Types helps a ton here, they are a numerical value in the field that will tell you how the action occurred.
- Logon Type 2 : Interactive, aka a full session often a hands on keyboard kind, except it can also mean RunAs.
- Logon Type 3: Network logon, like file shares
- Logon Type 4: Batch or Scheduled Task logon (this not only leaves credentials in memory, but on disk in the LSA secrets)
- Logon Type 5: Service logon (this not only leaves credentials in memory, but on disk in the LSA secrets)
- Logon Type 7: Unlocking the system
- Logon Type 8: Cleartext logon, which means something like Plaintext Auth to an IIS server or it can mean CredSSP logons - meaning the Cleartext is local to the machine, not over the network. Investigate the package type and the source for this one to figure out what it really means.
- Logon Type 10: Remote Desktop logon - if you see a service account doing RDP logons you either have an attacker, or some admins that need threat landscape education.
4625 - Unsuccessful Logon
Failed logons are not normally something I track as related to attacker behavior (in the day and age of Mimikatz, we just assume they have the credentials they want versus brute forcing them) and is more something to gather in preparation for remediation activities - what machines have failed logons when that mystery account in Domain Admins finally has it removed. There are some exceptions to this rule however, like the Qakbot malware that gets discussed a bit later in the post.
4648 - Logon was Attempted with Explicit Credentials
This event is why I cite that you must get logs from workstations as well as servers - if a 4648 Logon with Explicit Credentials is performed it gets logged on the originating machine. Meaning that if an attacker runs maliciousscript.cmd with a RunAs against a Domain Controller you would likely just see a network logon from the account on the DC - something that blends in far too easily. This will also allow you to see things like mapped drives with admin credentials, which is a popular way of exploring and exfiltrating data among attackers.
Once you define who is "interesting" for your investigation, you can edit the InterestingAccounts.xml subscription to add their account names to it and then import it with wecutil cs "InterestingAccounts.xml"
Now that we have our data, we can start to "hunt" in the environment.
Here are some examples of things I would look for in an investigation.
By clicking the radio button next to the 1102 event, I can filter out the machines that have had their Security Event logs cleared. I would want to flag these for deeper analysis.
Here I am filtering by Logon Type 5, "Log on as a Service." This Is a service running as Bob's domain admin account, which is a bad sign from a credential exposure perspective but also because we already knew Bob's account was compromised.
This is a good example of an Extremely Shady machine. It has a PowerShell cradle as a persistence mechanism, and two not-really-legit ups services on it.
Incident Response is always a bit more art than science, and practice makes perfect. You'll probably make some assumptions in the beginning that might lead to false positives and make you feel silly, but don't worry, and don't lose your enthusiasm. Use the console to understand how your environment works, and make it a regular activity versus just an IR activity. Data is power, and you can help operations teams as well as security teams with this.
Flexibility: add what you need, when you need it.
Let's say you're having an outbreak of Qakbot, a malware which can cause mass lockouts on your network as it steals and brute forces credentials and performs lateral movement over WMI. You'll have scheduled task and service creations to show you the persistence points, but ultimately you're going to need to track down which accounts are being locked out and from where to solve the mystery. We can add a new subscription for account lockouts, and have it propagate to the entire domain and get us information in a matter of minutes - just from importing a new subscription. We're using an account lockout as an example here, but the steps apply to almost all Event IDs - meaning you can add what you care about tracking, depending on the circumstances you're trying to diagnose.
We start with an event log entry for a lockout - I just generated this one by doing Run as a Different User and failing logons for Bob, so you don't have to be having an active malware outbreak to have an event log entry to practice from.
We need to pull data from a few fields to get our subscriptions to work.
The easy pieces of data to gather are which Event Log the message you're looking for lives in, and which Event ID.
We 're going to need to pull out EventID 4740 from the Security log via Windows Event Forwarding Subscription - this one is easy, so feel free to make it via the GUI if you want, but we're going to get more flexibility and the ability to pull existing events if we write it manually. This is what the subscription looks like when made manually. You can see the relevant parts - which log it came from, what EventID, and that we want the existing events as well as new one. (Don't worry the .xml is in the GitHub, you don't have to type it from the screenshot.)
After a few minutes, if everything goes correctly you'll start to see 4740 Account Lockout events in Forwarded Events .
Our next step is to get them into our weffles.csv file so we can see them in the PowerBI Console. To do that, we're going to need to investigate the XML view to decide what parts of the Event Record we want to add to our CSV, so click on the Details tab and then the XML view radio button. Lockouts will be fairly straightforward, as they share Event Data with events we're already collecting, but here you can see the "guts" of how Event Viewer knows what to display for you. (My goal in walking you through this manually is to teach you how the underlying technology works, but if you're in a hurry or have a very complicated event Kurt Falde has written a script to automate some of the subscription creation for you.)
The relevant parts from an investigation standpoint art the TargetUserName (who was being locked out) and SubjectUserName (who performed the locking out.)
Sidenote: see how the Subject User is a computer? That's because in AD Computers are people too, or at least Users. The "Domain Users" group contains computer objects as well as User objects, which is an interesting thing to know from a security perspective. An interview question I like to ask is "what is the difference between Domain Users, Authenticated Users, and Everyone groups?"
We need edit the wefflesreporting.ps1 script, and then first place we need to go is the $EventLogQuery line - since you've seen Xpath filters it should be clear what this is, but this is where we are cherry picking the Event IDs from the Forwarded Events Log we'd like to have in the CSV. We need to tack the 4740 EventID on the end of that.
After that, scroll down a bit and you can see a large block of $EventObj | Add-Member lines. These are where the relevant parts of Event Logs get selected as areas we are able to add as text to our event log entries in the CSV file. Since the TargetUserName and SubjectUserName are needed for other events we are already collection, we can see they are already available for us.
We need to add a new entry for the 4740 events to the script so they show up in the CSV, so scroll down to the area where we're adding new $EventRecordXml items to our CSV. You'll see a lot of them here in my original file, so you can feel free to cut and paste (I'm not judging) just be careful to only have $EventRecordXml items that actually exist in the EventID you are looking for to avoid errors. (And also avoid cut and paste errors that wind up with you having 17 ElseIf statements and other errors like I did before.)
So now we have the item in the script, which is being run by our scheduled task. Since the scheduled task runs at System Startup, we need to get it to reload to recognize that we made the change.
You can either go old school and just reboot the machine, or from an elevated PowerShell prompt run Stop-ScheduledTask -TaskName "WEF Parsing Task" followed by Start-ScheduledTask -TaskName "WEF Parsing Task" to restart the task.
After that, give it a few minutes to catch up with adding the new events, and you should be able to open up the weffles.csv file and see the new Event IDs in it.
And if we refresh the data in PowerBI, we can now rapidly get to which accounts are being locked out on which machines - which in the scenario of a Qakbot infection, is an important thing to do quickly.
The process is really simple and useful for many different scenarios. Maybe you have a particular application with an error code that matters to you, or maybe you want to enhance the WEFFLES logging to include things like Event ID 400 for PowerShell downgrade attacks to detect things like the PS>Attack Tool. The idea is that you have the power of Windows Event Forwarding applying to massive amounts of endpoints, with the ability to rapidly deploy new detections, and the great data slicer that is PowerBI to make hunting through the stacks of data easier. If you need some more ideas of things you might want to monitor, check out my Monitoring What Matters and Tracking Lateral Movement posts.
- I talk about Event ID 1102 a lot. So much that when the Bad Rabbit malware that was very similar to NotPetya happened, my mum emailed me simply "Event ID 1102" So if you learn nothing else from this blog, my mum wants you to monitor 1102 Events.
- Kurt Falde for helping me find the right tools for my crazy idea
- Sean Metcalf, Valentine Reid, Devon Kerr, Lee Holmes, Carlos Perez, Jacob Soo, and Swift On Security for testing/reviewing my code
- The companies I have visited who gladly tested my code in production but would probably prefer to remain nameless. :)
- Greg Linares for reviewing the very first draft of this all the way back in 2016 and encouraging me that it was a good idea