No-Code Data Science For Threat Hunting

In its simplest form, data science is the process of extracting meaningful and actionable information from massive sets of structured, semi-structured, and unstructured data. While the discipline has been around since the early 1960s, it has increased in popularity in recent years with the advent of powerful cloud-based data analysis tools, such as Tableau and Microsoft Power BI. Many industries rely on data science to make business decisions, such as finance, payments, and transportation. Similarly, in cybersecurity, data science can complement rules- and behavioral-based analysis tools to help detect more covert patterns or behaviors and allow for quicker response and more thoughtful decision-making.

No Code Data Science for Threat Hunting

Many data analysis and manipulation tools are available on the market today. Some are commercial, others are open source, some are easy to use, and some have a significantly steep learning curve. This blog shows you how to use Gigasheet for performing basic threat hunting. It demonstrates Gigasheet's easy-to-use data analysis features and functions by completing a Pandas data science lab developed by Mike Sconzo and David Dorsey, all in a few simple steps.


Pandas is an open-source tool for data analysis, manipulation, and visualization built on Python. Developed in 2008 at AQR Capital Management, the project has had over 2,000 contributors since its inception and is widely used in academic and commercial domains. In its most straightforward application, the tool works by ingesting an input file (i.e., CSV, JSON, or HTML) from which it creates a 2-dimensional data frame or table to store the file's data. It then applies some user-defined processing and returns the result on the screen as either text or plot.


If you want to follow along, you can sign up for a free Gigasheet account, or you can see it in action by watching this short tutorial.




File Sampling, Creation, Input, and Verification

The lab intends to analyze a large conn.log Zeek (formerly Bro) file using Pandas. But before you can start analyzing data, the lab instructions direct you to create a sample of the conn.log file because it is too large to fit in memory. The tool reads about 300k randomly selected lines and writes them to a new file.


Next, Pandas imports the new file into memory to create a data frame and verifies that the data was loaded correctly using some head() and tail() commands.


Pandas Threat Hunting File Import

Pandas Zeek Logs
Figure 1 - File Sampling & Creation with Pandas

Pandas for Threat Hunting

Figure 2 - File Input & Verification with Pandas

No Code With Gigasheet

First off, with Gigasheet we don't need to sample the input file. We don't even need to download the file locally and unzip it before importing it to Gigasheet. We can copy and paste the file's URL into Gigasheet's link import function, and Gigasheet does the rest. The file we will analyze in Gigasheet contains 22 million lines (as opposed to 300k analyzed with Pandas).


File Upload & Verification with Gigasheet
Figure 3 - File Upload & Verification with Gigasheet

Data Summarization and Types

Next, the lab instructs you to run a describe() function to obtain a numeric summarization of all numerical columns and a dtypes function to see the type of information stored in each column.

Threat Hunting Data Summarization and Types with Pandas
Figure 4 - Data Summarization and Types with Pandas

How do we do the same in Gigasheet? We open the uploaded file to see all the columns and rows. That's it!


Data Summarization In Gigasheet
Figure 5 - Data Summarization and Types with Gigasheet

Converting Column Types, Exploring Data Values, and Remove Columns


Next, the lab directs you to change the ts (timestamp) column to a readable date-time format. The timestamp in Bro files is in Unix time format, which is difficult for humans to discern. In Pandas, you change the ts column format using the to_datetime() function. However, the lab instructions do not specify how to select a time zone. It then uses the describe() function again to verify that the format conversion was successful and removes the original ts column from the data frame.

Convert Column Types and Remove Columns with Pandas
Figure 6 - Convert Column Types and Remove Columns with Pandas

In Gigasheet, we apply the Cleanup Unix Time to the ts column (or column labeled A in Figure 7) and select a timezone, which creates a new column with the desired time format and zone. We can then uncheck the ts column to remove it from view or delete it forever.


 Convert Unix Time To UTC with Gigasheet
Figure 7 - Convert Column Types with Gigasheet

Row Selection

Next, the lab instructs you to run several functions to select all the rows that contain SSL traffic and identify unique originator bytes values. It does this in three steps using conn_df and conn_df.head functions.


Row Selection In Pandas
Figure 8 - Row Selection in Pandas

In Gigasheet, we can apply a filter to the service column to look for all values matching SSL and then group by the origin_bytes column to see all the unique originator bytes values. We can even sort in ascending or descending order with one click.

Column Filters
Figure 9 - Row Selection in Gigasheet

Final Exercise


For the final exercise, the lab challenges you to use all the techniques described to find the unique source ports and IP addresses associated with all HTTP connections over any ports other than 80. It even gives bonus points for the number of connections for each source IP address.


In Gigasheet, we can apply a filter to the service column to look for values equal to HTTP. We can then use a second filter on the id.resp_p column (or responding host's port) to look for values NOT matching port 80, which will display all HTTP connections over any port except 80. To find all unique source ports, we can group the connections by the id.orig_h column (or originating host's IP address), and lastly, we can re-group by the id.orig_port column (or originator's source port) to find all unique source ports.


Final Exercise in Gigasheet
Figure 10 – Final Exercise in Gigasheet