Member since
06-07-2017
3
Posts
0
Kudos Received
0
Solutions
08-09-2017
09:48 AM
I have little experience in Hive and currently learning Spark with Scala Also I am working with HDP 2.6. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below ORC will do the same as parquet in Spark Tez engine will give better performance like Spark engine Joins are better/faster in Hive than Spark I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa. Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc. Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon. Input/lookup data are csv/text formats and updating into tables Two input tables which has 5 million rows and 30 columns 30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns. Multiple joins involved like inner and left outer since many look up tables used. Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment. Method 1: Hive on Tez with ORC tables Python UDF thru TRANSFORM option Joins with performance tuning like map join Method 2: SparkSQL with Parquet format which is converting from text/csv Scala for UDF Hope we can perform multiple inner and left outer join in Spark
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Apache Tez
06-05-2017
12:31 PM
Hi, I am facing Broken pipe issue in Hive while streaming larger number of rows (33000) into python after multiple joins. Same script works fine till 7656 number of rows. 0: jdbc:hive2://xxx..xx.xx.xxx:10000> insert overwrite table test.transform_inner_temp select * from test.transform_inner_temp_view2 limit 7656;
INFO : Session is already open
INFO : Dag name: insert overwrite table test.transf...7656(Stage-1)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_xxxxxxxxxxx_xxxx)
INFO : Map 1: 0(+2)/2 Reducer 2: 0/1
INFO : Map 1: 1(+1)/2 Reducer 2: 0(+1)/1
INFO : Map 1: 1(+1)/2 Reducer 2: 0(+1)/1
INFO : Map 1: 1(+1)/2 Reducer 2: 0(+1)/1
INFO : Map 1: 1(+1)/2 Reducer 2: 0(+1)/1
INFO : Map 1: 1(+1)/2 Reducer 2: 0(+1)/1
INFO : Map 1: 2/2 Reducer 2: 0(+1)/1
INFO : Map 1: 2/2 Reducer 2: 1/1
INFO : Loading data to table test.transform_inner_temp from hdfs://az-xxxxxxxxxxx.com:8020/apps/hive/warehouse/test.db/transform_inner_temp/.hive-staging_hive_2017-06-05_10-58-09_863_348122508386369397-929/-ext-10000
INFO : Table test.transform_inner_temp stats: [numFiles=1, numRows=7656, totalSize=4447368, rawDataSize=4439712]
No rows affected (19.867 seconds)
0: jdbc:hive2://xxx.xx.xxx.xxxx:10000> insert overwrite table test.transform_inner_temp select * from test.transform_inner_temp_view2;
INFO : Session is already open
INFO : Dag name: insert overwrite tabl...orm_inner_temp_view2(Stage-1)
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_xxxxxxxxxx_xxxxx)
INFO : Map 1: -/-
INFO : Map 1: 0/2
INFO : Map 1: 0/2
INFO : Map 1: 0(+2)/2
INFO : Map 1: 0(+2)/2
INFO : Map 1: 1(+1)/2
INFO : Map 1: 1(+1)/2
INFO : Map 1: 1(+1)/2
INFO : Map 1: 1(+1)/2
INFO : Map 1: 1(+1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-1)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-2)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
INFO : Map 1: 1(+1,-3)/2
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1491563674918_58393_1_00, diagnostics=[Task failed, taskId=task_1491563674918_58393_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
... 17 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]: An error occurred while reading or writing to your custom script. It may have crashed with an error.
at org.apache.hadoop.hive.ql.exec.ScriptOperator.process(ScriptOperator.java:456)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:133)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:170)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:555)
... 18 more
Caused by: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)
at org.apache.hadoop.hive.ql.exec.ScriptOperator.process(ScriptOperator.java:425)
... 24 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1491563674918_58393_1_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)
I could see same kind of issue only with join condition alone in below link. https://community.hortonworks.com/questions/11025/hive-query-issue.html I tried the solution which suggested in above link by setting hive.vectorized.execution.enabled=false; But issue is same in my case. I thought issue might be in multiple joins so I have written all the join results in a single table and tried to stream into python. Still issue is same. If the script is wrong then it shouldn't work fine with smaller rows. Hope the script is fine and the issue is something with hive set up. I have put lot of effort in internet but couldn't figure it out similar kind of issue. Please provide your suggestions/solutions. Apache Hive (version 1.2.1000.2.5.3.0-37) with beeline.
... View more
Labels:
- Labels:
-
Apache Hive