I have an external table that uses the open csv serde. It points to one big file (4.9 GB) - This is from a customer, so I have no control over the size or amount of files.
If I run counts on the file in TEZ, it says there are about 7 million records. In MR there are 38 million records - This is correct.
I ran a query joining to some other tables inserting into a parquet table. With TEZ, it will put in just the 7 million rows, but with MR, all rows get inserted.
I ran analyze on the table, but it didn't change anything - In fact the analyze statement came back with 7 million rows when run in tez.
Any ideas on why this is happening?
This is the table structure:
create external table if not exists tlog ( tLogRow string, Transaction string, TransactionDate string, Item string, Site string, Shopper string, SaleType string, ReturnFlag string, SalesQty string, Currency string, COGS string, SalesValue string, Voucher string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' ) stored as textfile location "wasb://someplace.blob.core.windows.net/tlog/" tblproperties("skip.header.line.count"="1","serialization.null.format" = "");
The counts I tried were:
select count(*) from tlog;
select count(transaction) from tlog;
select count(*) from (select * from tlog) a;
All return 7 million in TEZ, but 38 million in MR.
Plus, when I process the data into a new table, TEZ will only take 7 million records, but MR will process the entire 38 million records in the new table.
The output is several hundred lines long, but the counts add up to 7 million.
Also ran it in MR, and got 38 million when I summed up all the counts.
im not an expert in Azure, but try with HDFS storage first. Just copy the file from wasb to local cluster hdfs and do the same test. It might be an issue with big files