Column formats, such as Apache Parquet, offer large compression savings and are much easier to scan, process and analyze than other formats such as CSV. In this article we show you how to convert your CSV data to parquet with AWS glue.
What is a column format?
CSV files, log files, and other character-separated files efficiently store data in columns. Each row of data has a certain number of columns which are all separated by the delimiter, e.g. comma or space. But under the hood, these formats are still just strings. There is no easy way to scan just a single column in a CSV file.
This can be a problem with services like AWS Athena, which can run SQL queries on data stored in CSV and other delimited files. Even if you only ask for a single column, Athena must scan whole the contents of the file. Athena̵7;s only fee is GB for the data being processed, so running up the bill by processing unnecessary data is not the best idea.
The solution is a real column format. Column format stores data in columns, much like a traditional relational database. The columns are stored together, and the data is much more homogeneous, making them easier to compress. They are not exactly readable by humans, but they are understood by the fact that the application treats them just fine. Because there is less data to scan, they are actually much easier to process.
Because Athena only needs to scan one column to make a selection by column, it drastically reduces costs, especially for larger data sets. If you have 10 columns in each file and only scan one, there are 90% cost savings just from switching to hardwood.
Convert automatically with AWS glue
AWS Glue is a tool from Amazon that converts data sets between formats. It is primarily used as part of a pipeline to process data stored in delimited and other formats and inject them into databases for use in Athena. Although it can be set to be automatic, you can also run it manually, and with a little tweaking, it can be used to convert CSV files to parquet format.
Switch to the AWS glue console and select “Get started”. Click “Add crawler” from the sidebar and create a new crawler. The crawler is configured to search for data from S3 buckets and import data into a database for use in the conversion.
Give your crawler a name and choose to import data from a data warehouse. Select S3 (although DynamoDB is another option) and enter the path to a folder that contains your files. If you only have one file you want to convert, put it in its own folder.
You will then be prompted to create an IAM role for your crawler to act as. Create the role and then select it from the list. You may need to press the refresh button next to it to display it.
Select a database for the crawler to be output to; If you have used Athena before, you can use your custom database, but if not by default, it should work fine.
If you wanted to automate the process, you can give your crawler a schedule so that it runs regularly. If not, select manual mode and run it yourself from the console.
Once created, start the crawler to import data into the database you selected. If everything worked, you should see your file imported with the correct schedule. The data types for each column are assigned automatically based on the source input.
Once your data is in the AWS system, you can convert it. Switch to the “Jobs” tab from the glue console and create a new job. Give it a name, add your IAM role and select “A suggested script generated by AWS glue” as that job runs.
Select your table on the next screen, then select “Change Schedule” to indicate that this job is running a conversion.
Then you need to select “Create tables in your data target”, specify Parquet as the format and enter a new target path. Make sure this is an empty space without other files.
Then you can edit the schedule for your file. This standard is a one-to-one mapping of CSV columns to Parquet Columns, which is probably what you want, but you can change it if you need to.
Create the job, and you will come to a page that allows you to edit the Python script as it runs. The default script should work fine, so press “Save” and return to the Jobs tab.
In our testing, the script always failed unless the IAM role was specifically allowed to write to the location we specified which output they would go to. You may need to edit the permissions from the IAM Management Console manually if you encounter the same problem.
Otherwise, click “Run” and your script will start. It may take a minute or two to process, but you should see the status in the info panel. When done, you will see a new file created in S3.
This job can be configured to run out of triggers set by the crawler that imports data, so that the whole process can be automated from start to finish. Importing server logs to S3 in this way can be an easy method to convert them to a more useful format.