Large PCAP File Analysis 101 with Gigasheet, GreyNoise, and Google
Imagine it's your first day on the job as a junior security analyst and your assignment is to analyze a large packet capture (PCAP) file that was collected from a monitoring port configured on one of the core switches at a remote site. The company you work for has not made significant investments in security technology, so you don't have a lot of enterprise-grade tools in your arsenal to begin your assignment. All you know is that an employee’s machine has been behaving abnormally for a few days but the antivirus software running on the employee’s computer has not detected any malicious files or programs. The employee does not want to have his computer re-imaged because he is concerned about losing important files, so your boss decided to collect network traffic from the local area network where the employee works to try to identify the root cause of the problem.
Your boss asks you to begin analyzing the packet capture which he stored in your team’s network file share. You scan the file share, find the PCAP file, move it to your laptop, and open it with Wireshark. You are ready to put your packet analysis skills to work when suddenly Wireshark crashes. After several attempts to open the file, your computer keeps freezing and crashing. You conclude that the file is too large to open with Wireshark and make your way to Google to look for other alternatives. You learn about Tshark, the command-line version of Wireshark.
You could hunt around for the right commands to read the PCAP file using Tshark, but results will be endless lines of text on your screen. Tshark filters can help you make sense of the data, but by now you've spent several hours just to open and analyze a large PCAP file.
Luckily for all you junior security analysts out there, there is a simpler way to analyze packet captures; one that does not require learning complicated command-line tools and syntaxes. Any security analyst, regardless of their level of experience, can apply these techniques.
In this blog post, we will show you how to analyze large PCAP files using Gigasheet, the big data spreadsheet built for cybersecurity, and GreyNoise a provider of internet-wide scan and attack data. Here we illustrate the power of Gigasheet by analyzing a sample packet capture file from Stratosphere Lab, which contains network traffic associated with malware.
Step 1: Convert PCAP file to CSV
UPDATE: Gigasheet now supports raw PCAP file analysis! Upload a big PCAP and Gigasheet will extract some standard fields from it into a clean sheet.
Gigasheet allows you to upload and analyze huge csv's and log files (you can request your free account here). The first step in the process is to convert the PCAP file to CSV format. In this example, we use Tshark to export all packets in a 274 MB PCAP file, gigasheet.pcap, into a CSV file.
The Tshark commands below read the gigasheet.pcap file and extract the packet number, timestamp, source and destination IP addresses, protocol, length, and other OSI-Layer 7 information to the gigasheet-csv.csv file.
The resulting CSV file is 208 MB and contains over 2 million rows!
Step 2: Upload CSV File to Gigasheet
The next step is to log in to Gigasheet, upload the CSV file, and begin analyzing the data. We do not know much about the specific malware contained within the PCAP file. All we know is that the file contains traffic associated with malware, but we don’t know the malware type, ports, or protocols used to communicate outbound, or the IP address(es) of the infected system(s).
Upon uploading and processing the CSV file, Gigasheet displays seven columns:
- Column A: Packet number
- Column B: Timestamp
- Column C: Source IP address
- Column D: Destination IP address
- Column E: Protocol
- Column F: Packet length
- Column G: Information
Gigasheet makes it easy to convert time to different formats, such as Universal Time Coordinate (UTC). By default, Wireshark displays all timestamps in absolute time (in seconds) since the beginning of the capture, therefore, we need to normalize the time displayed in column B, Timestamp, which Gigasheet can do using the Time Cleanup function.
Gigasheet allows you to enrich data with intelligence from popular threat intelligence providers. In this example, we'll use GreyNoise (you can sign up for a free API token here). We'll run the enrichment feature on Column D, the destination IP addresses, to identify any IP's that may have been observed by GreyNoise in the past as being malicious, noisy, or suspicious.
After enriching Gigasheet creates a new column, H which contains the GreyNoise response. We'll use the Group feature on Column H, to bucket the data. Here we see three unique values:
We can expand the Invalid and see that it contains private IP addresses (RFC 1918) while the “never_observed” is self-explanatory: the IP address can be assumed to be benign. The “noise_old” value is of interest, so we expand that group to try to identify anything interesting that may help with our analysis (afterall our sample file is from 2015).
We can now see that 22.214.171.124 is the IP address that GreyNoise classified as "noise". We can further group by Column D, or destination IP address, to identify other interesting IP addresses that we may need to consider in our analysis, but there are no other IP addresses in this group.
Since we hit a dead end, we reset all the columns and start from the beginning. The next step we can take is to identify all the protocols that were captured by the PCAP by grouping by column E, or protocol which shows the following:
Let’s start by looking at DNS traffic by expanding the DNS group. We can now see some odd DNS queries against a Google DNS server for a .info domain, most of which are originating from the same system, 10.0.2.107.
We can further group by Column C, or source IP address, to identify any other hosts that may be communicating with internal or external DNS servers:
Doing so revealed two internal hosts communicating with DNS servers:
Let’s go back to the DNS group and filter Column C, source IP address, for a value equal to 10.0.2.2. Let’s expand the DNS group to see the actual connections:
We deduce that 10.0.2.2 is an internal DNS server that is responding to DNS queries from 10.0.2.107, which narrows down our search.
Now, let’s reset Column C and focus our attention on the DNS group. The next step we could take is to split column G, or Information, using a space as a separator which will enable us to apply more granular filters. Gigasheet can do this in just two steps:
Apply the Split Column function
Enter a space as the character to split by
The Information column is now split across columns H through L. Next, we want to identify all the DNS queries that the suspecting victim attempted to resolve. To do this, we can filter Column C for any value equal to 10.0.2.107, which displays all connections originated by the victim machine.
Now, let’s focus our attention on column L. We can see that the DNS queries are for the same domain (info.com) and the subdomains appear to be randomly generated seven-letter names.
A quick Google search for one of these subdomains (qmfgvms[.]info) returns many interesting results. One of the top results is from Joesandbox.com – a popular malware sandboxing service. Here we can see that the subdomain was identified as being associated with a malicious Windows executable:
Scrolling down, we can see some familiar domain names:
If we go back to the Google search results, we can see another interesting website:
This website includes a list of subdomains consistent with the ones identified in Gigasheet. Furthermore, the author of this website identifies these subdomains as “Shifu”.
A quick Google search on shifu leads to a write up and analysis by FireEye's research team. We can confirm now that Shifu is a Trojan and that the host with IP address 10[.]0[.]2[.]107 was infected at the time the packet capture was recorded.