Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

TEZ not returning all rows

Highlighted

TEZ not returning all rows

New Contributor

Hi All,

I have an external table that uses the open csv serde. It points to one big file (4.9 GB) - This is from a customer, so I have no control over the size or amount of files.

If I run counts on the file in TEZ, it says there are about 7 million records. In MR there are 38 million records - This is correct.

I ran a query joining to some other tables inserting into a parquet table. With TEZ, it will put in just the 7 million rows, but with MR, all rows get inserted.

I ran analyze on the table, but it didn't change anything - In fact the analyze statement came back with 7 million rows when run in tez.

Any ideas on why this is happening?

Thanks!

John

7 REPLIES 7
Highlighted

Re: TEZ not returning all rows

@John Aherne

I don't think issue with either Tez or MR. I believe it's some cache issue.

Is this Partitioned table? if yes, can you drop all the partitions and insert the data again?

Highlighted

Re: TEZ not returning all rows

New Contributor

No, the table is not partitioned.. It is a plain external table that points to a folder with one big file.

Highlighted

Re: TEZ not returning all rows

@John Aherne

Can you give below details please?

1) describe formatted <table name>

2) Count script which your are running?

Highlighted

Re: TEZ not returning all rows

New Contributor

This is the table structure:

create external table if not exists tlog
( tLogRow string,
Transaction string,
TransactionDate string,
Item string,
Site string,
Shopper string,
SaleType string,
ReturnFlag string,
SalesQty string,
Currency string,
COGS string,
SalesValue string,
Voucher string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar'     = '"',
   'escapeChar'    = '\\'
)
stored as textfile
location "wasb://someplace.blob.core.windows.net/tlog/"
tblproperties("skip.header.line.count"="1","serialization.null.format" = "");

The counts I tried were:

select count(*) from tlog;

select count(transaction) from tlog;

select count(*) from (select * from tlog) a;

All return 7 million in TEZ, but 38 million in MR.

Plus, when I process the data into a new table, TEZ will only take 7 million records, but MR will process the entire 38 million records in the new table.

Highlighted

Re: TEZ not returning all rows

Can you please run below hive script on both MR & Tez and send me the out put please?

SELECT Site,count(1) FROM tlog group by Site;

Highlighted

Re: TEZ not returning all rows

New Contributor

The output is several hundred lines long, but the counts add up to 7 million.

Also ran it in MR, and got 38 million when I summed up all the counts.

Highlighted

Re: TEZ not returning all rows

Expert Contributor

im not an expert in Azure, but try with HDFS storage first. Just copy the file from wasb to local cluster hdfs and do the same test. It might be an issue with big files

Don't have an account?
Coming from Hortonworks? Activate your account here