• Pete Hugh

How To Open Large CSV Files

For years now, security analysts have relied on spreadsheets to work with security data and analyze large CSV files.

Sure, there are other tools out there. SIEMs, BI solutions, and other tools (security-related and otherwise) promise to help analysts process security data. But as valuable as these tools are, it’s hard to beat the simplicity and flexibility of a good ol’ fashioned spreadsheet.

Unfortunately, there’s a problem. Analysts rely on spreadsheets for a lot of their work... but the sheer volume of security data has risen to the point where standard office software can no longer cope.

What does an analyst do when their preferred method no longer works? They look for a workaround. (Spoiler: you can create a free Gigasheet account to try your own file!)

How to Open Big Files in Excel

15 years ago, Excel spreadsheets were limited to 65,536 rows. Then, with the release of MS Office 2007, the number of rows supported rose to 1,048,576.

And then… nothing.

For the last 12 years, as the volume of data that security analysts work with has risen exponentially, the number of rows supported by Excel hasn’t budged an inch. So, for practically every security analyst on the planet, opening large security CSVs in Excel has become a huge headache.

How to open a large csv file

So, how do you open large CSV files in Excel? Essentially, there are two options:

Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or,

  1. Find an Excel add-in that supports CSV files with a higher number of rows.

Not a great choice, is it?

Some years ago, Microsoft maintained a product called Power Query, which made it possible to open CSV or XLS files of up to 50 million rows. Unfortunately, the add-in came with its own limitations:

  • It required a lot of setup time and forced analysts to learn its proprietary M formula language.

  • It was only available for analysts working on Windows operating systems. If you needed to open large CSV files on a Mac, you were out of luck.

  • It was only available for MS Excel 2010 and 2013 and is no longer supported or updated by Microsoft.

Finally, and perhaps most damningly of all, it still only allowed security analysts to open CSV files of up to 50 million rows. In today’s big data society, that just isn’t enough.

Convert CSV to MS Access

If your CSV is too big for Excel, what’s the next step? Naturally, to use a CSV database program — most likely, MS Access — to open big CSV files.

Opening large CSV files in MS Access is about as easy as it gets:

  1. Create a new database file.

  2. Name the database and save it somewhere appropriate.

  3. Choose File → Get External Data → Import.

  4. Select your CSV file.

  5. Click import.

And that’s it, your CSV file is open.

But here’s the problem. Opening a CSV file in MS Access doesn’t allow a security analyst to work freely with data as they would in Excel. It’s much harder to ‘eyeball’ data for insights and trends, and you don’t have access to simple spreadsheet functions like filters and lookups. Instead, you’re forced to use a more complicated set of features and formula syntax to create queries.

Not only do these skills take time to learn, they are also prone to errors. And, unlike a spreadsheet, it’s often hard to tell when you’ve made a mistake in a query when working with a database. In the worst cases, an incorrectly written query may give you the ‘wrong’ answer — and you’ll never even realize you made a mistake.

And, yet again, we run into a size limitation. MS Access can open much larger CSVs than a spreadsheet can, but its capacity isn’t unlimited. The maximum size of an individual ACCDB file is 2GB.

For security analysts working with datasets that run tens to hundreds of millions of rows, this may not be enough.

Import CSV Files CSV Database Software

If you don’t like using office-style databases — or don’t have access to them — another option is to import large CSV files to a relational SQL database.

SQL stands for Structured Query Language, a standard programming language used to query data across one or more databases. SQL databases are tremendously powerful and boast impressive performance even at a very large scale. However, they come with their own set of challenges.

Building your own SQL database requires a specific set of skills that are far from easy to develop. Unless you are a programmer on the side, undertaking this type of project is often out of reach for a busy security analyst. You could wait for a data science or other engineering team to help with the project, but that’s also not ideal - especially if you’re responding to a security incident. You could also hire a development firm to build the SQL database for you, but it’s likely to be a costly project.

And here’s something else to consider. Even if you did build a perfect SQL database, you would still run into performance issues once you start working with seriously big CSV files. You could solve that problem by hosting your database using a business cloud service, but then the project becomes even more costly.

And, really, none of this solves the original problem. Security analysts use spreadsheets to open large CSV files because it affords them a lot of flexibility. Ultimately, no matter how powerful an SQL database you build, it will never be as easy to work with as a spreadsheet.

What About Business Intelligence Tools?

Often, SQL databases are used in conjunction with a business intelligence (BI) tool. These tools can make the process of querying and visualizing large datasets much simpler, and some can even rival spreadsheets in their flexibility.

Combining a well-constructed database with a BI tool is certainly a viable option for a security analyst needing to view large CSV files. However, there are still two considerations:

  1. The cost of building and maintaining an SQL database and paying for a BI tool is substantial.

  2. As powerful as they can be, BI tools still run into performance issues when working with extremely large databases.

Some BI tools allow analysts to import CSV files directly. But, yet again, performance issues and size limitations abound when working with very large CSV files.

How To Open CSV files in Python

At this point, it has likely become clear that — for our long-suffering security analyst — workarounds don’t get the job done. If you want to open big CSV files (potentially running to hundreds of millions of rows) you need to take a completely different approach.

One such option is to use Python, or another similarly powerful coding or scripting language.

Python is a general-purpose programming language that, among other uses, has historically seen a lot of uptake in the scientific and mathematical communities. Its high-performance nature and built-in library of useful modules make Python an extremely powerful tool for interrogating and visualizing huge datasets.

And Python is far from the only option. As you can see from discussions like this one, there are many ways to use programming languages and custom scripts to interrogate even the largest CSV files. However, they all run into the same problems: time and complexity.

As a security analyst, do you have the capacity (or inclination) to learn one or more complex programming languages just to analyze large CSV files? Even if you do, will you ever be completely confident that your custom-written scripts and tools are watertight?

Ultimately, while Python and other scripting languages are undoubtedly an option, they don’t fulfill our criteria.

Security analysts don’t need to be full-on data scientists or programming experts - they need to be security experts! Security experts need a simple, powerful solutions that allow them to work with huge CSV files just as easily as they would with a smaller file, using a universal spreadsheet-like application.

How to Open Very Large CSV Files

If you’ve been waiting for the pitch, here it is.

Gigasheet is a no-code, cybersecurity analyst workbench that allows analysts to work efficiently with even the largest security datasets.

No longer will you be pressured to be a ‘unicorn analyst’ who can code, manage databases, and perform data science tasks. With Gigasheet, you can open CSV files of up to a BILLION rows, and work with them just as easily as you’d work with a much smaller file in Excel or Google Sheets.

Even better, Gigasheet is specifically designed for security data. It understands IP addresses, time and date fields, and other common security data — so you won’t be stuck trying to reformat, split, or concatenate columns just to answer basic questions.

On the surface, Gigasheet is a web-based, billion-row spreadsheet. Behind the scenes, it provides a high-performance big data analytics platform built specifically for security analysts.

We’re looking for beta testers right now to help us make Gigasheet the best it can be. We want to know exactly what problems you face as a security analyst so we can make sure Gigasheet is equipped to solve as many of them as possible.

If you’d like to help us make Gigasheet the ideal solution to your problems, create a free account.