Member since
10-13-2016
68
Posts
10
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
943 | 02-15-2019 11:50 AM | |
2807 | 10-12-2017 02:03 PM | |
341 | 10-13-2016 11:52 AM |
10-01-2019
05:11 AM
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector. Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration: set hive.tez.container.size = 8192 For these statements to take effect, they need to run on the same session than the main query and that's my issue. I tried 2 ways: The first one was to generate a new hive session for each query, with a properly setup url eg.: url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever") It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work. The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. his does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only. Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
... View more
Labels:
09-04-2019
10:30 PM
Thanks you nailed it indeed. set hiveconf:tez.am.container.reuse.enabled=false; did the trick.
... View more
05-03-2019
01:50 PM
This query outputs NPE. The tasks with NPEs are retried, and most of the times (but not always) end up succeeding. I could not find a smaller query showing my problem so I give here my full query: select
s.ts_utc as sent_dowhour
, o.ts_utc as open_dowhour
, sum(count(s.ts_utc)) over(partition by s.ts_utc) as sent_count
from vault.sent s
left join open o on
o.id=s.id
group by 1, 2 My guess is that the construction sum(count(...)) over(partition by ...) has issues. When it fails, this is the output I get: Vertex failed, vertexName=Reducer 2, vertexId=vertex_1556016846110_42971_7_03, diagnostics=
» Task failed, taskId=task_1556016846110_42971_7_03_000221, diagnostics=
» TaskAttempt 0 failed, info=
» Error: Error while running task ( failure ) : attempt_1556016846110_42971_7_03_000221_0:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
... 16 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:795)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363)
... 19 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115)
at org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114)
at org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200)
at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155)
at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538)
at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349)
at org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123)
at org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1050)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:850)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:724)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:790)
... 20 more Semantically my query is valid (and indeed sometimes succeeds) so what is going on? Note: hdp 3.1, hive 3 orc tables, orc intermediate results tez
... View more
Labels:
02-15-2019
11:50 AM
The ODBC driver does not support all syntax niceties (no CTE) and if there is a syntax error, it will output a completely irrelevant message, which adds a lot to the confusion. To actually see the actual error, you need to add ODBC logging and look at the log files.
... View more
02-15-2019
11:17 AM
@Anika S 2 years later, I have the same issue. Did you manage to fix it? If so, how?
... View more
01-28-2019
07:07 AM
Context: Hive3, HDP 3.1. Tests done with Python/odbc (official HDP driver) under Windows and Linux. I ran the following queries:
"select ? as lic, ? as cpg" "select * from (select ? as lic, ? as cpg) as t" "with init as (select ? as lic, ? as cpg) select * from init", 1) and 2) work fine, and give me the expected result. 3 gives me a ParseException :
Error while compiling statement: FAILED: ParseException line 1:21 cannot recognize input near '?' 'as' 'lic' in select clause (80) (SQLPrepare)") The exact same statements ran with java/jdbc work fine. Note that 2) looks like is a workaround for 3) but it works for this tiny example, not for bigger queries. Is there something I can do to have ODBC working as expected? Alternatively, where can I find the limits of the ODBC driver? For full context, the full test code is as follow: cnxnstr = 'DSN=HiveProd'
cnxn = pyodbc.connect(cnxnstr, autocommit=True)
cursor = cnxn.cursor()
queries = [
"with init as (select ? as lic, ? as cpg) select * from init",
"select 2 * ? as lic, ? as cpg",
"select * from (select ? as lic, ? as cpg) as t",
]
for q in queries:
print("\nExecuting " + q)
try:
cursor.execute(q, '1', '2')
except pyodbc.ProgrammingError as e:
print(e)
continue
... View more
Labels:
- Labels:
-
Apache Hive
01-11-2019
05:26 AM
In addition to these steps: restart ambari server (we had one instance where it looked like the application was OK but the alert was cached and keep being displayed), check your yarn logs. If there is not enough memory for yarn, the service will not be able to start.
... View more
01-11-2019
05:24 AM
1 Kudo
You will lose some job history, but nothing else and certainly no data, so it should not be an issue.
... View more
01-10-2019
08:50 AM
2 Kudos
It worked for me eventually after cleaning up *everything*: - destroying the app and cleaning hdfs as explained there: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/data-operating-system/content/remove_ats_hbase_before_switching_between_clusters.html - cleaning zookeeper: zookeeper-client rmr /atsv2-hbase-unsecure and finally restarting *all* yarn services from ambari should did the trick.
... View more
01-07-2019
05:01 AM
I got it working by - cleaning up the hdfs directories of hbase-ats - cleaning up the zookeeper nodes related to hbase-hdfs I hope there are better ways, but that's the only one I found out and was working.
... View more
11-22-2018
06:37 PM
@Aditya Sirna, you are right, hbase runs as a service (is_hbase_system_service_launch is true). I am giving example with nodeN, which are the names of my data nodes, This is based on what I see right now and makes it easier to understand The region server (node5) tries to report for duty but fails. It tries to connect to node1:17020, but port 17020 is only open on node5. On node1 hbase master tried to start, but stopped because it apparently cannot find the active namenode Failed get of master address: java.io.IOException: Can't get master address from ZooKeeper; znode data == null I will look into zookeeper, it seems to ring a bell. I have 2 questions if you don't mind: - how do you start a yarn service on a specic node? - how does the timelinereader know where to connect? In any case thanks, you gave me some ideas to carry on.
... View more
11-22-2018
03:07 PM
Hello, I have a new hdp3.0.1 installation with ats-hbase which runs embedded (with proper queue configured, as per the documentation). At the end of all tasks (seen with the hive compactor, oozie steps), I have hundreds of lines with org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING ending up with : org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1542872934100_0068 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548) at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception:t java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception Looking at /var/log/hadoop-yarn/yar/hadoop-yarn-nodemanager I have a lot of lines with: Call exception, tries=7, retries=7, started=8194 ms ago, cancelled=false, msg=Call to xxxxx/192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: prod-nl-dpnode1.dmdelivery.local/192.168.36.161:17020, details=row 'prod.timelineservice.entity,hive!yarn-cluster!xxxx-34-compactor-vault.contact.license_name=lectiva!^?�����@@!^?����d��^?���!MAPREDUCE_TASK_ATTEMPT!^?�����!attempt_1542205428050_2307_m_000461_0,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1542294270073, seqNum=-1 Looking at /var/log/hadoop-yarn/yarn/hadoop-yarn-timelinereader, I see Connection refused: dpnode1/192.168.36.161:17020 Indeed, there is no hbase on dpnode1. Hbase does run on dpnode5 (or another one, depending on yarn restart), but in any case, the timelinereader does not know which server to reach, and always goes to one seemingly hardcoded hostname. How can I tell yarn to use the right node to connect to hbase? Thanks,
... View more
Labels:
11-13-2018
06:55 AM
Eventually, after a restart of everything (not only the services seen as requiring a restart) it went OK.
... View more
11-01-2018
10:57 AM
Hello, I installed a new (not an update) HDP 3.0.1 and seem to have many issues with the timeline server. 1) The first weird thing is that the Yarn tab in ambari keeps showing this error: ATSv2 HBase Application The HBase application reported a 'STARTED' state. Check took 2.125s 2) The second issue seems to be with oozie. Running a job, it starts but stalls with the following log repeated hundreds of times 2018-11-01 11:15:37,842 INFO [Thread-82] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
Then with: 2018-11-01 11:15:37,888 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Exception while publishing configs on JOB_SUBMITTED Event for the job : job_1541066376053_0066
org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.publishConfigsOnJobSubmittedEvent(JobHistoryEventHandler.java:1254)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1414)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out 3) In hadoop-yarn-timelineserver-${hostname}.log I see, repeated many times: 2018-11-01 11:32:47,715 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1541066376053_0144_2 (TEZ_DAG_ID): 6 4) In hadoop-yarn-timelinereader-${hostname}.log I see, repeated many times: Thu Nov 01 11:34:10 CET 2018, RpcRetryingCaller{globalStartTime=1541068444076, pause=1000, maxAttempts=4}, java.net.ConnectException: Call to /192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.x.x:17020
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:145)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
... 3 more
Caused by: java.net.ConnectException: Call to /192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.x.x:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165) and indeed, there is nothing listening to port 17020 on 192.168.x.x. 5) I cannot find on any server a process named ats-hbase, this might be the reason for everything else. The queue set is just yarn_hbase_system_service_queue_name=default, which has no limit which would prevent Hbase to start. I am sure that something is very wrong here, and any help would be appreciated.
... View more
Labels:
10-29-2018
05:38 AM
I was using Oozie 4.2.0 (hdp2.6) and am now trying Oozie 4.3.1 (hdp 3.0). One major difference is that one java action could in the past read its jar from the local filesystem, but it looks like it is not possible anymore. The Java action is basic: <java xmlns="uri:oozie:workflow:0.5">
<job-tracker>http://something.local:8050</job-tracker>
<name-node>hdfs://HdfsNameService</name-node>
<main-class>io.JsonPoster</main-class>
<file>file:///opt/jsonposter/jsonposter.jar</file>
</java> The error I get is quite clear: org.apache.oozie.action.ActionExecutorException: UnsupportedOperationException: Accessing local file system is not allowed
I know I could put the jar on hdfs put I am trying to avoid that for now (because it used to work and all our deployment are done via rpm). I am ready to take the responsibility of having all jars in sync over all datanodes. I already set oozie.service.HadoopAccessorService.supported.filesystems to * , with no effect. Is there a way to tell Oozie that yes, I am happy for it to read the local FS?
... View more
Labels:
04-12-2018
01:21 PM
@rtrivedi Thanks for your answer but I believe that's not the issue. I tried a lot of variations with the `hive delete` command, to no avail: delete jar hdfs:///myudfs/myfunc.jar; list jar; --give a localised jar delete jar $localised_jar; CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar'; And I end up having the same error again.
... View more
04-12-2018
09:59 AM
I created my own (generic) udf, which works very well when added in hive:
CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar';
After a while I wanted to update my udf, so I created a new jar with the same name, and put it in hdfs by overwriting the old jar. Lo and behold, I cannot use my function again! It does not matter if I do first a:
drop function if exists myfunc;
CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar';
From beeline, I got one of these error message:
java.io.IOException: Previous writer likely failed to write hdfs://ip-10-0-10-xxx.eu-west-1.compute.internal:8020/tmp/hive/hive/_tez_session_dir/0de6055d-190d-41ee-9acb-c6b402969940/hmyfunc.jar Failing because I am unlikely to write too.
or
org.apache.hadoop.hive.ql.metadata.HiveException: Default queue should always be returned.Hence we should not be here.
Looking at the logs, it looks like Hive is localising the jar file (good) but as a session is reused, if the new jar does not match the jar already present in the localised directory hive will complain and will apparently wait indefinitely. If my understanding is correct, is there a way to tell Tez to not reuse any of the current sessions? If my understanding is not correct, is there a way to do what I want? Context: hdp 2.6.0.3, no llap, on aws. Thanks,
... View more
Labels:
- Labels:
-
Apache Hive
03-09-2018
12:35 PM
I have a query, always failing with the following error:
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) [...]
The query itself is quite a small MERGE (Other much bigger queries work flawlessly):
MERGE INTO summary dst USING (
SELECT
e.id1
, e.id2
, e.id3
, e.name
, e.subject
FROM
mailing e
) src
ON
dst.id1 = src.id1
AND dst.id2 = src.id2
AND dst.id3 = src.id3
WHEN MATCHED
THEN UPDATE SET
name = src.name
, subject=src.subject
The source table has 1.7M rows (50M on disk), the destination has 75M rows, (1.5GB on disk).
Both are ACID tables, ORC.
On the image, map 1 is the one with the issue, and I cannot understand why it has only one task. Naively I would think that more tasks would each have a smaller load and would work better, but I did not manage to do that.
Note that I maxed out already all memory parameters, I cannot do more on those: yarn-site/yarn.nodemanager.resource.memory-mb = 24064
yarn-site/yarn.scheduler.minimum-allocation-mb = 1024
yarn-site/yarn.scheduler.maximum-allocation-mb = 24064
mapred-site/mapreduce.map.memory.mb = 4096
mapred-site/mapreduce.reduce.memory.mb = 8192
mapred-site/mapreduce.map.java.opts = 3276
mapred-site/mapreduce.reduce.java.opts = 6553
hive-site/hive.tez.container.size = 4096 Is there a way to increase the number of tasks in the mapper, or another way to not get this out of memory error?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
01-16-2018
06:46 AM
@Jordan Moore Not really relevant to the question but no this is not the point. The use case here is data export, where some clients have their own BI tools, processes and so on. They just need the data, csv in a zip file. Other clients do not have this in place and have a different access to this data.
... View more
01-15-2018
06:08 AM
The zip file is the output of the process, not to be read in hdfs anymore - it will just end up being downloaded and sent to a user. In this context using zip makes sense, as I am only looking at *compressing* multiple csv together, not reading them afterwards. Using beeline with formatted output is what I do currently, but I end up downloading locally multiple gigs, compress and re-upload. This is a waste and could actually fill my local disks up. Using coalesce in Spark is the best option I found, but the compression step is still not easy. Thanks!
... View more
01-09-2018
01:45 PM
1 Kudo
My end goal is to run a few hive queries, get 1 csv file (with headers) per query, compress all those files together in one zip (not gzip or bzip, unfortunately, needs to open natively under windows) and hopefully get the zip back in hdfs. My current solution (CTAS) ends up creating one directory per table, with possibly multiple files under it (depending on number of reducers and presence/absence of UNION). I can easily generate as well a header file per table with only one line in it. Now how to put all that together? The only option I could find implies to do all the processing locally (hdfs dfs -getmerge followed by a an actual zip command). This adds a lot of overhead and could technically fill up the local disk. So my questions are: is there a way to concatenate files inside hdfs without getting them locally? is there a way to compress a bunch of files together (not individually) in zip, inside hdfs? Thanks
... View more
- Tags:
- Data Processing
- HDFS
Labels:
- Labels:
-
Apache Hadoop
12-11-2017
01:54 PM
hdp 2.6.0, HMS has 6GB memory, the metastore itself is mysql. After a few days the server hosting the HMS will have its CPU 100% used, hive queries are slow, and looking at the GC logs the HMS is constantly having stop the world events. Restarting the metastore 'fixes' the problem for a few days. I have found a few JIRA link related to memory leaks https://issues.apache.org/jira/browse/HIVE-15551 or https://issues.apache.org/jira/browse/HIVE-13749 . Is it a known issue in hdp 2.6.0? Is it known if the latest hdp version is fixed? Thanks,
... View more
Labels:
- Labels:
-
Apache Hive
10-24-2017
04:21 AM
Then I interpret this setting as "if there is too much data let's use it all instead of pruning it" and am very confused 🙂 I suppose it's due to internal hive implementation as you said.
... View more
10-23-2017
07:41 PM
I indeed see INFO [HiveServer2-Handler-Pool: Thread-107]: optimizer.RemoveDynamicPruningBySize (RemoveDynamicPruningBySize.java:process(61)) - Disabling dynamic pruning for: TS. Expected data size is too big: 1119008712 So if I understand well, this has to do with event size and not data size? I did try to get the value very high to enable pruning, pruning did indeed occur but locking all partitions timed out. Will post an explain asap.
... View more
10-23-2017
07:32 PM
I added this explain.
... View more
10-23-2017
05:05 AM
hive.tez.dynamic.partition.pruning already globally true hive.optimize.ppd true by default, I explicitly set it to true hive.optimize.index.filter false by default, I set it to true I set hive.tez.bucket.pruning to true as well. I think that my issue is related to https://community.hortonworks.com/questions/142167/why-not-set-hivetezdynamicpartitionpruningmaxdatas.html Thanks for your help!
... View more
10-20-2017
09:55 AM
Context: I have an issue with a MERGE statement, which does not use the partitions of the destination table. Looking for solutions, I stumbled upon this JIRA ticket which creates 3 new (in hive 0.14) configuration options: hive.tez.dynamic.partition.pruning: default true
hive.tez.dynamic.partition.pruning.max.event.size: default 1*1024*1024L
hive.tez.dynamic.partition.pruning.max.data.size: default 100*1024*1024L Now I wonder why should I not just set these variables to the max value possible to make sure that partition pruning always happen? It is disabled if data size is too big, but I find it counter intuitive as not pruning will massively increase data size. Cheers,
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
10-20-2017
05:53 AM
Thanks @Eugene Koifman. You are of course right, the issue is that I updated the MERGE text without reflecting the edits on the CREATE (now fixed in the questions). I am indeed using the partitions in the MERGE: ...
ON dst.license_name = src.license_name AND dst.campaign_id = src.campaign_id
... but as far as I can tell, the pruning does not happen.
... View more
10-19-2017
12:27 PM
I have 2 tables, with the same structure: CREATE TABLE IF NOT EXISTS src (
-- DISTINCT field (+ partitions)
id BIGINT
-- other fields
, email STRING
, domain STRING
, lang STRING
, mobile_nr STRING
, custom_fields STRING
, groups array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>
, ts_utc TIMESTAMP
, sys_schema_version INT
, sys_server_ipv4 BIGINT
, sys_server_name STRING
)
PARTITIONED BY (
license_name STRING
, campaign_id INT
)
CLUSTERED BY (id)
INTO 64 BUCKETS
STORED AS ORC
One is a source table (basically recreated from scratch with new data for each merge, so needs to be fully reprocessed everytime) one is the destination table, which will grow. Both tables have the same partitions and bucket definitions. When I EXPLAIN the MERGE statement, which has a join on the partitions and the bucketed field, I cannot see any partition pruning happening. set hive.merge.cardinality.check=false;
set hive.tez.exec.print.summary=true;
set tez.user.explain=true;
explain MERGE INTO
-- default.2steps_false_64_1
vault.contact
dst
USING default.2steps_2steps_false_64_1 src
ON
dst.license_name = src.license_name
AND dst.campaign_id = src.campaign_id
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.ts_utc
THEN UPDATE SET
-- other fields
email = src.email
, city = src.city
, lang = src.lang
, mobile_nr = src.mobile_nr
, custom_fields = src.custom_fields
, groups = src.groups
, updated_on_utc = src.ts_utc
, sys_schema_version = src.sys_schema_version
, sys_server_ipv4 = src.sys_server_ipv4
, sys_server_name = src.sys_server_name
WHEN NOT MATCHED THEN INSERT VALUES (
src.id
, src.email
, src.city
, src.lang
, src.mobile_nr
, src.custom_fields
, src.groups
, src.ts_utc
, src.ts_utc
, NULL -- deleted_on
, src.sys_schema_version
, src.sys_server_ipv4
, src.sys_server_name
, src.license_name
, src.campaign_id
)
;
+-----------------------------------------------------------------------------------------------------------------------------+--+
| Explain
+-----------------------------------------------------------------------------------------------------------------------------+--+
| Vertex dependency in root stage
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 5 (SIMPLE_EDGE)
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
| Reducer 4 <- Reducer 2 (SIMPLE_EDGE)
|
| Stage-5
| Stats-Aggr Operator
| Stage-0
| Move Operator
| partition:{}
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Stage-3
| Dependency Collection{}
| Stage-2
| Reducer 3
| File Output Operator [FS_904]
| compressed:true
| Statistics:Num rows: 496014 Data size: 166660704 Basic stats: COMPLETE Column stats: PARTIAL
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Select Operator [SEL_901]
| | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
| | Statistics:Num rows: 496014 Data size: 166660704 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Reducer 2 [SIMPLE_EDGE]
| Reduce Output Operator [RS_900]
| key expressions:_col0 (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
| Map-reduce partition columns:UDFToInteger(_col0) (type: int)
| sort order:+
| Statistics:Num rows: 496014 Data size: 257927280 Basic stats: COMPLETE Column stats: PARTIAL
| value expressions:_col1 (type: bigint), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col8 (type: timestamp), _col9 (type: timestamp), _col10 (type: timestamp), _col11 (type: int), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: bigint)
| Select Operator [SEL_899]
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
| Statistics:Num rows: 496014 Data size: 257927280 Basic stats: COMPLETE Column stats: PARTIAL
| Filter Operator [FIL_906]
| predicate:((_col13 = _col29) and (_col14 = _col30) and (_col0 = _col18) and (_col7 < _col25)) (type: boolean)
| Statistics:Num rows: 496014 Data size: 226182384 Basic stats: COMPLETE Column stats: PARTIAL
| Merge Join Operator [MERGEJOIN_916]
| | condition map:[{"":"Right Outer Join0 to 1"}]
| | keys:{"0":"license_name (type: string), campaign_id (type: bigint), id (type: bigint)","1":"license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)"}
| | outputColumnNames:["_col0","_col7","_col8","_col9","_col13","_col14","_col17","_col18","_col19","_col20","_col21","_col22","_col23","_col24","_col25","_col26","_col27","_col28","_col29","_col30"]
| | Statistics:Num rows: 11904348 Data size: 25284835152 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Map 1 [SIMPLE_EDGE]
| | Reduce Output Operator [RS_889]
| | key expressions:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
| | Map-reduce partition columns:license_name (type: string), campaign_id(type: bigint), id (type: bigint)
| | sort order:+++
| | Statistics:Num rows: 129102910 Data size: 16525280556 Basic stats: COMPLETE Column stats: PARTIAL
| | value expressions:updated_on_utc (type: timestamp), created_on_utc (type: timestamp), deleted_on_utc (type: timestamp), ROW__ID (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
| | TableScan [TS_887]
| | ACID table:true
| | alias:dst
| | Statistics:Num rows: 129102910 Data size: 16525280556 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Map 5 [SIMPLE_EDGE]
| Reduce Output Operator [RS_890]
| key expressions:license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)
| Map-reduce partition columns:license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)
| sort order:+++
| Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
| value expressions:email (type: string), city (type: string), lang (type: string), mobile_nr (type: string), custom_fields (type: string), groups (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), ts_utc (type: timestamp), sys_schema_version (type: int), sys_server_ipv4 (type: bigint), sys_server_name (type: string), campaign_id (type: int)
| TableScan [TS_888]
| alias:src
| Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
| Reducer 4
| File Output Operator [FS_897]
| compressed:true
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Select Operator [SEL_895]
| | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
| | Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Reducer 2 [SIMPLE_EDGE]
| Reduce Output Operator [RS_894]
| Map-reduce partition columns:_col0 (type: bigint)
| sort order:
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| value expressions:_col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col7 (type: timestamp), _col10 (type: int), _col11 (type: bigint), _col12 (type: string), _col13 (type: string), _col14 (type: int)
| Select Operator [SEL_893]
| outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col14","_col2","_col3","_col4","_col5","_col6","_col7"]
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| Filter Operator [FIL_907]
| predicate:(_col13 is null and _col14 is null and _col0 is null) (type: boolean)
| Statistics:Num rows: 1 Data size: 456 Basic stats: COMPLETE Column stats: PARTIAL
| Please refer to the previous Merge Join Operator [MERGEJOIN_916]
| Stage-4
| Stats-Aggr Operator
| Stage-1
| Move Operator
| partition:{}
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Please refer to the previous Stage-3
|
Other explain with hive.explain.user=true 0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.merge.cardinality.check=false;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- set hive.tez.dynamic.partition.pruning=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- set hive.tez.dynamic.partition.pruning.max.data.size=107374182400; -- 100GB
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.tez.exec.print.summary=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set tez.user.explain=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.explain.user=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> explain MERGE INTO
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- default.2steps_false_64_1
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> vault.contact
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> dst
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> USING default.2steps_2steps_false_64_1 src
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> ON
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> dst.license_name = src.license_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> AND dst.campaign_id = src.campaign_id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> AND dst.id = src.id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> AND dst.license_name = 'baarn'
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- On match: keep latest loaded
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> WHEN MATCHED
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> AND dst.updated_on_utc < src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> THEN UPDATE SET
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- other fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> email = src.email
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , domain = src.domain
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , lang = src.lang
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , mobile_nr = src.mobile_nr
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , custom_fields = src.custom_fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , groups = src.groups
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , updated_on_utc = src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , sys_schema_version = src.sys_schema_version
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , sys_server_ipv4 = src.sys_server_ipv4
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , sys_server_name = src.sys_server_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> WHEN NOT MATCHED THEN INSERT VALUES (
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> src.id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.email
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.domain
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.lang
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.mobile_nr
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.custom_fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.groups
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , NULL -- deleted_on
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.sys_schema_version
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.sys_server_ipv4
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.sys_server_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.license_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> , src.campaign_id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> )
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> ;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Explain
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Vertex dependency in root stage
| Map 2 <- Map 1 (BROADCAST_EDGE)
| Reducer 3 <- Map 2 (SIMPLE_EDGE)
| Reducer 4 <- Map 2 (SIMPLE_EDGE)
|
| Stage-5
| Stats-Aggr Operator
| Stage-0
| Move Operator
| partition:{}
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Stage-3
| Dependency Collection{}
| Stage-2
| Reducer 3
| File Output Operator [FS_1461]
| compressed:true
| Statistics:Num rows: 496014 Data size: 123507486 Basic stats: COMPLETE Column stats: PARTIAL
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Select Operator [SEL_1458]
| | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
| | Statistics:Num rows: 496014 Data size: 123507486 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Map 2 [SIMPLE_EDGE]
| Reduce Output Operator [RS_1457]
| key expressions:_col0 (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
| Map-reduce partition columns:UDFToInteger(_col0) (type: int)
| sort order:+
| Statistics:Num rows: 496014 Data size: 79362240 Basic stats: COMPLETE Column stats: PARTIAL
| value expressions:_col1 (type: bigint), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col8 (type: timestamp), _col9 (type: timestamp), _col10 (type: timestamp), _col11 (type: int), _col12 (type: bigint), _col13 (type: string), _col15 (type: bigint)
| Select Operator [SEL_1456]
| outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col15","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
| Statistics:Num rows: 496014 Data size: 79362240 Basic stats: COMPLETE Column stats: PARTIAL
| Filter Operator [FIL_1463]
| predicate:((_col13 = 'baarn') and (_col13 = _col29) and (_col14 = _col30) and (_col0 = _col18) and (_col7 < _col25)) (type: boolean)
| Statistics:Num rows: 496014 Data size: 179061054 Basic stats: COMPLETE Column stats: PARTIAL
| Map Join Operator [MAPJOIN_1474]
| | condition map:[{"":"Right Outer Join0 to 1"}]
| | HybridGraceHashJoin:true
| | keys:{"Map 2":"license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)","Map 1":"license_name (type: string), campaign_id (type: bigint), id (type: bigint)"}
| | outputColumnNames:["_col0","_col7","_col8","_col9","_col13","_col14","_col17","_col18","_col19","_col20","_col21","_col22","_col23","_col24","_col25","_col26","_col27","_col28","_col29","_col30"]
| | Statistics:Num rows: 11904348 Data size: 24153922092 Basic stats: COMPLETE Column stats: PARTIAL
| |<-Map 1 [BROADCAST_EDGE]
| | Reduce Output Operator [RS_1446]
| | key expressions:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
| | Map-reduce partition columns:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
| | sort order:+++
| | Statistics:Num rows: 621448 Data size: 79546063 Basic stats: COMPLETE Column stats: PARTIAL
| | value expressions:updated_on_utc (type: timestamp), created_on_utc (type: timestamp), deleted_on_utc (type: timestamp), ROW__ID (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
| | TableScan [TS_1443]
| | ACID table:true
| | alias:dst
| | Statistics:Num rows: 621448 Data size: 79546063 Basic stats: COMPLETE Column stats: PARTIAL
| |<-TableScan [TS_1444]
| alias:src
| Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
| Reduce Output Operator [RS_1451]
| Map-reduce partition columns:_col0 (type: bigint)
| sort order:
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| value expressions:_col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col7 (type: timestamp), _col10 (type: int), _col11 (type: bigint), _col12 (type: string), _col13 (type: string), _col14 (type: int)
| Select Operator [SEL_1450]
| outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col14","_col2","_col3","_col4","_col5","_col6","_col7"]
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| Filter Operator [FIL_1464]
| predicate:(_col13 is null and _col14 is null and _col0 is null) (type: boolean)
| Statistics:Num rows: 1 Data size: 361 Basic stats: COMPLETE Column stats: PARTIAL
| Please refer to the previous Map Join Operator [MAPJOIN_1474]
| Reducer 4
| File Output Operator [FS_1454]
| compressed:true
| Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Select Operator [SEL_1452]
| | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
| | Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
| |<- Please refer to the previous Map 2 [SIMPLE_EDGE]
| Stage-4
| Stats-Aggr Operator
| Stage-1
| Move Operator
| partition:{}
| table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
| Please refer to the previous Stage-3
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>
Could anybody confirm or infirm from this explain that the partitions are properly pruned? hdp 2.6, small (4 nodes) AWS cluster
... View more
Labels:
- Labels:
-
Apache Hive
10-12-2017
02:03 PM
The answer pointed at https://community.hortonworks.com/questions/57795/how-to-fix-under-replicated-blocks-fasly-its-take.html is the good one. Those are undocumented features in hadoop 2.7 but they can be set up and used and now I do see that replication is speed up.
... View more