Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Streaming API is 10x slower in 5.6.0 comparing with 5.4.2

Hive Streaming API is 10x slower in 5.6.0 comparing with 5.4.2

Explorer

 

I have setup a new CDH 5.6.0 cluster and migrated all my Hive data from previous CDH 5.4.2 recently.

The migration was simple:

1. copied all files from hdfs://old_hdfs/user/hive/warehouse  to hdfs://new_hdfs/user/hive/warehouse

2. dumped the metadata from metadata mysql then restored them to the the new metadata mysql (and also updated the url of blocks from the old name node to the new name node)

 

 

 

The new cluster is more powerful than the old one in infrastructure, with more CPUs and memory. I also setup the HA for HDFS, to make sure it's ready for production usage.

Everything looked good before I found that my hive streaming program were running in really low throughput in the CDH 5.6.0 cluster, compared with the 5.4.2 one.

My streaming program reads data from Kafka, and writes them to Hive with the HCatalog Streaming API inspired by https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest.

 

The codes look similar as below:

 

StreamingConnection connection2 = hiveEP.newConnection(true);
DelimitedInputWriter writer2 =
                     new DelimitedInputWriter(fieldNames,",", endPt);
TransactionBatch txnBatch2= connection.fetchTransactionBatch(10, writer2); // step1
 
 
///// Batch 1 - First TXN
txnBatch2.beginNextTransaction(); //step2
txnBatch2.write("21,Venkat Ranganathan".getBytes());
txnBatch2.write("22,Bowen Zhang".getBytes());
txnBatch2.commit(); //step3
 

For writing 10 rows in a batch:

 

1. In 5.4.2, the steps of beginNextTransaction, commit might take: 40ms, 11ms

2. while in 5.6.0, they took: 250ms and 174ms

 

Why the streaming performance is so much slow in 5.6? Does anyone have an idea how to resolve this?

 

 

 

 

 

 

 

 

1 REPLY 1

Re: Hive Streaming API is 10x slower in 5.6.0 comparing with 5.4.2

Explorer

I did some tests today, here is my test codes:

 

    private void writeBatch(List<String> list) throws Exception {

        StreamingConnection connection = hiveEP.newConnection(true, conf);
        StrictJsonWriter writer = new StrictJsonWriter(hiveEP, conf);
        long tsStart = System.currentTimeMillis();

        TransactionBatch txnBatch = connection.fetchTransactionBatch(10, writer);

        long tsFetch = System.currentTimeMillis();

        txnBatch.beginNextTransaction();

        long tsNextTx = System.currentTimeMillis();

        for (String item : list) {
            try {
                txnBatch.write(item.getBytes());
            } catch (Exception e) {
                System.err.println(Thread.currentThread().getName()+" write item failed:" + item);
                e.printStackTrace();
                continue;
            }
        }


        long tsWrite = System.currentTimeMillis();

        txnBatch.commit();

        long tsCmt = System.currentTimeMillis();

        txnBatch.close();
        long tsBtClose = System.currentTimeMillis();

        long elpTotal = tsCmt - tsStart;
        long elpFetch = tsFetch - tsStart;
        long elpNextTx = tsNextTx - tsFetch;
        long elpWrite = tsWrite - tsNextTx;
        long elpCommit = tsCmt - tsWrite;

        System.out.println(Thread.currentThread().getName()+" write batch, size:" + list.size() + " total:" + elpTotal
                + "ms, " +
                "fetch:" + elpFetch +
                "ms, " +
                "nextTx:" + elpNextTx + "ms, write:" + elpWrite + "ms, commit:" + elpCommit + "ms, close:" + (tsBtClose - tsCmt) + "ms");
    }

 

The DDL:

 

create table if not exists test_counter(
	 name string,
	 ts bigint,
	 sid string,
	 value int
)  partitioned by (date int) clustered by (sid) into 3 buckets stored as orc tblproperties ("orc.compress"="SNAPPY", "transactional"="true");

 

I tests the performance with writing a list of 179491 rows to the table 'test_counter', and the result show as below:

 

Test result on CDH 5.4.2:

Thread-2 write batch, size:179491 total:19292ms, fetch:902ms, nextTx:12ms, write:18119ms, commit:259ms, close:52ms

 

Test result on CDH 5.6.0

Thread-2 write batch, size:179491 total:31428ms, fetch:700ms, nextTx:1228ms, write:28922ms, commit:578ms, close:91ms

 

The evidence shown that:

1. StreamingConnection.fetchTransactionBatch()  is 10 times slower

2. TransactionBatch.write() is 1.5 times slower

 

I also compared the HDD performances where hdfs run between the new and old cluster, it shown no difference.

 

What can be the reason of the slowing down? could it be the HA of HDFS?