With another source file more random, generated as follow:Ĭat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold | head -c 500000k | awk ''> source. Two implementations share most functionalities with different design goals. ORC Implementation Spark supports two ORC implementations ( native and hive) which is controlled by. This example was a bit artificial as the source file was very compressible. Apache ORC is a columnar format which has more advanced features like native zstd compression, bloom filter and columnar encryption. ![]() Long story short, the output of desc extended regarding compression is useless. Hive -orcfiledump /tmp/shouldreallybeuncompressed/000000_0 # This is the buffer size, nothing to do with actual data size Hive -orcfiledump /tmp/shouldbecompressed/000007_0 To look even deeper, hive on the command line has an option –orcfiledump, which will give some metadata about an orc file. on a side note, using SNAPPY instead of ZLIB the data size was 197k instead of 44k. Long story short, ORC does some compression on its own, and the parameter orc.compress is just a cherry on top. So indeed, the uncompressed table is less compressed, but is still a far cry from the 500MB I expected. SELECT COUNT(*) FROM shouldreallybeuncompressed INSERT INTO shouldreallybeuncompressed SELECT * FROM sourcedata SELECT COUNT(*) FROM shouldreallybecompressed INSERT INTO shouldreallybecompressed SELECT * FROM sourcedata LOCATION '/tmp/shouldreallybeuncompressed' ![]() So let’s try again:ĬREATE TABLE shouldreallybecompressed ( id INT, s STRING)ĬREATE TABLE shouldreallybeuncompressed ( id INT, s STRING) The other valid values are SNAPPY or NONE. Hum, apparently both tables are compressed? It turned out that I forgot about an orc parameter ( orc.compress), set by default to ZLIB for me. I still have compressed:false, but what happens on disk? SELECT COUNT(*) FROM shouldbeuncompressed INSERT INTO shouldbeuncompressed SELECT * FROM sourcedata INSERT INTO shouldbecompressed SELECT * FROM sourcedata ROW FORMAT DELIMITED FIELDS TERMINATED BY ','ĬREATE TABLE shouldbecompressed ( id INT, s STRING)ĬREATE TABLE shouldbeuncompressed (id INT, s STRING) ![]() Then I loaded this data in 2 tables, compressed and uncompressed, directed with the setting .ĬREATE EXTERNAL TABLE sourcedata (id INT, s STRING) Hdfs dfs -copyFromLocal /tmp/source.csv /tmp/compressiontest/source.csv Yes '1,longish string which will compress really well' | head -n 10000000 > /tmp/source.csv # create 1 csv, 500MB of easy to compress data The same principle applies for ORC, text file, and JSON storage formats. I generated an easy to compress data set, and load it in a few different tables with different options. are compressed with Snappy and other Parquet files are compressed with GZIP. Well, it turned out that it was false for all my tables although I was pretty sure I set up everything correctly, so I dug and experimented a bit. I only use ORC tables in Hive, and while trying to understand some performance issues I wanted to make sure my tables where properly compressed.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |