Convert CSV file to Apache Parquet... with Drill
A very common use case when working with Hadoop is to store and query simple files (CSV, TSV, ...); then to get better performance and efficient storage convert these files into more efficient format, for example Apache Parquet.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. Apache Parquet has the following characteristics:
The following steps will show you how to do convert a simple CSV into a Parquet file using Drill.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. Apache Parquet has the following characteristics:
- Self-describing
- Columnar format
- Language-independent
How to convert CSV files into Parquet files?
You can use code to achieve this, as you can see in the ConvertUtils sample/test class. You can use a simpler way with Apache Drill. Drill allows you save the result of a query as Parquet files.The following steps will show you how to do convert a simple CSV into a Parquet file using Drill.