Member since
10-13-2016
68
Posts
9
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
512 | 02-15-2019 11:50 AM | |
1614 | 10-12-2017 02:03 PM | |
164 | 10-13-2016 11:52 AM |
11-18-2019
03:18 AM
I have a merge statement and was looking at how to make it faster. Inside the using part of the statement, there is a row_number() function to do some deduplication. In the logs I see: INFO physical.Vectorizer (:()) - Reduce vectorized: false
INFO physical.Vectorizer (:()) - Reduce notVectorizedReason: PTF operator: ROW_NUMBER not in supported functions [avg, count, dense_rank, first_value, last_value, max, min, rank, row_number, sum] This log statement does not seem right: ROW_NUMBER not in [row_number] ? I tried for the sake of my peace of mind with uppercase and lowercase row_number, but without any difference, luckily. Is there anything I could do to get vectorisation and row_number together? This is with hive 3.1.0 from HDP 3.1.4.
... View more
11-15-2019
02:11 AM
Consider this example. Preparation: create temporary table opens as (
select stack(1,
1 , cast ( '2019-11-13 08:07:28' as timestamp)
) as (id , load_ts )
); Queries: This is just about counting the number of rows, with filters always matching, and possibly sort by. 1 is always expected. select count(*) from ( select * from opens) t;
select count(*) from ( select * from opens sort by id) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' ) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' sort by id) t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' ) t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t; The last query (sort by and <= on timestamp) returns 0 rows. I believe that this is the cause of other issues I have, where I have missing rows in queries with the timestamp (but not the explicit sort by). Note that if instead of a temporary table I use a CTE for opens, the issue does not appear. I tried workarounds (inverse order of operands, adding not or not not ) to no avail. One thing that did work is to explicitly cast the string to a timestamp: select count(*) from ( select * from opens where load_ts <= cast ( '2019-11-13 09:07:00' as timestamp) sort by id) t; It might be good practice indeed, but there still is a discrepancy between how >= and <= are handled, or how sort by works. Note: this is on Hive from 3.1.4, without llap.
... View more
11-14-2019
05:43 AM
Since the upgrade from hdp 3.1.0 to 3.1.4, I have some issue in Hive I do not understand. Note that I am only using ORC transactional tables.
For instance this (simplified) query:
with cte as (
select
e.type
, c.json
, c.id
from event e
join contact as c on c.id=e.contact_id
)
select
type
, id
, lv.customfield
from cte
lateral view outer
json_tuple(cte.json, 'customfield') cv AS `customfield`
It worked perfectly before the upgrade.
Now, even if the CTE returns a certain number of rows, using the lateral view will just drop rows from
the resultset, without any error, whereas there is no extra where clause outside the CTE (in my real example, the query returns 66 rows without the lateral view, but only 19 with).
Another extremely surprising thing is that inside the CTE there is an if statement, for instance: `if(contact.is_deleted is null, 'true', 'false')`. If I replace the `is null` with `is not distinct from null`, which should be perfectly valid, no rows are returned by the CTE.
I tried quite a few variations:
extract one row from the contact table to create a second table with only one contact (`create table contact2 as select * from contact where id=42`). I get the exact same behaviour, so I ruled out corrupted data
if I replace the event table by a static CTE (`select stack(1, ...)`) I have the result I expect
if I remove the lateral view, I have the number of rows I expect (as long as I do not use is distinct from)
if instead of a CTE I create and use a temporary table, the outcome does not change.
I am completely at loss and I have no idea why this happens and how I can trust hive.
I cannot replicate the error by generating manual data so I cannot give a (not) working example.
... View more
11-05-2019
05:31 AM
Hello,
I upgraded from HDP 3.1.0 to 3.1.4. it went relatively well, except for a few queries with lateral views. An example would be:
with j as (
select 1 as d, '{"relatienummer": 42, "notrelevant": 1}' as rn
union all
select 2, null
union all
select 3, '{"notrelevant": 1}' as rn
union all
select 4, '{"relatienummer": 42, "notrelevant": 1}' as rn
)
select
d
from j
lateral view json_tuple(rn, 'relatienummer') cv as `relatienummer`
But this example is a bit small and actually works. The same type of query based on a few joined orc tables fails.
The query actually succeeds according to hive, but returns 0 rows. If I remove the `lateral view` (`outer` or not) I got all the rows, If I add the lateral view (even if I do not use it in the select part) I have 0 rows.
There is nothing in the logs, and the query worked perfectly before the upgrade.
If instead of using a CTE (or a subquery) I were to make a temporary table, and then use this temporary table, then the lateral view would work.
Any idea what could happen and how to fix this?
... View more
10-01-2019
05:11 AM
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector. Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration: set hive.tez.container.size = 8192 For these statements to take effect, they need to run on the same session than the main query and that's my issue. I tried 2 ways: The first one was to generate a new hive session for each query, with a properly setup url eg.: url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever") It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work. The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. his does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only. Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
... View more
09-04-2019
10:30 PM
Thanks you nailed it indeed. set hiveconf:tez.am.container.reuse.enabled=false; did the trick.
... View more
05-03-2019
01:50 PM
This query outputs NPE. The tasks with NPEs are retried, and most of the times (but not always) end up succeeding. I could not find a smaller query showing my problem so I give here my full query: select
s.ts_utc as sent_dowhour
, o.ts_utc as open_dowhour
, sum(count(s.ts_utc)) over(partition by s.ts_utc) as sent_count
from vault.sent s
left join open o on
o.id=s.id
group by 1, 2 My guess is that the construction sum(count(...)) over(partition by ...) has issues. When it fails, this is the output I get: Vertex failed, vertexName=Reducer 2, vertexId=vertex_1556016846110_42971_7_03, diagnostics=
» Task failed, taskId=task_1556016846110_42971_7_03_000221, diagnostics=
» TaskAttempt 0 failed, info=
» Error: Error while running task ( failure ) : attempt_1556016846110_42971_7_03_000221_0:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
... 16 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:795)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363)
... 19 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115)
at org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114)
at org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200)
at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155)
at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538)
at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349)
at org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123)
at org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1050)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:850)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:724)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:790)
... 20 more Semantically my query is valid (and indeed sometimes succeeds) so what is going on? Note: hdp 3.1, hive 3 orc tables, orc intermediate results tez
... View more
02-15-2019
01:31 PM
Short version: How can I get the difference in seconds between 2 timestamps, via the ODBC driver? Long version: Using ODBC for a simple query (not that I use cast (... as timestamp) to have a standalone line, the actual query runs against a table with timestamp data): select unix_timestamp(cast('2019-02-01 01:02:03' as timestamp)) as tto
I got the error message: unix_timestamp is not a valid scalar function or procedure call I could not find any configuration option that would change this. Native query is disabled (because I am using prepared statements) and other functions work fine. My guess is that unix_timestamp() (without parameter) is deprecated, and the driver is a bit enthusiastic about preventing using the function. I tried to work around the problem, and I cast the timestamp as bigint instead of using the unix_timestamp function: select cast(cast('2019-02-01 01:02:03' as timestamp) as bigint) This works fine! But when I try to get the diff of 2 timestamps: select cast(cast('2019-02-01 01:02:03' as timestamp) as bigint) - cast(cast('2019-02-01 01:02:03' as timestamp) as bigint) I got the message Operand types SQL_WCHAR and SQL_WCHAR are incompatible for the binary minus operator (but then only for complex queries, not if the query consists only of this select). The driver will accept a diff between 2 timestamps, but then I end up with an interval type, which I cannot convert back to seconds. I would consider that those are bugs in the ODBC drier, but I cannot contact Hortonworks because I am not a paying customer, and I cannot contact Simba either because I am not a paying customer. Any idea how I could get around this?
... View more
02-15-2019
11:50 AM
The ODBC driver does not support all syntax niceties (no CTE) and if there is a syntax error, it will output a completely irrelevant message, which adds a lot to the confusion. To actually see the actual error, you need to add ODBC logging and look at the log files.
... View more
02-15-2019
11:17 AM
@Anika S 2 years later, I have the same issue. Did you manage to fix it? If so, how?
... View more
01-28-2019
12:17 PM
I want to use the new KafkaStorageHandler, which looks awesome. The only thing is that the avro in kafka is not standard (I am looking at you, confluent), I thus need to use my own serde (well, the serde from confluent). I added wihout error to hive the relevant jar which contains io.confluent.kafka.streams.serdes.avro.GenericAvroSerde. add jar hdfs:///tmp/kafka-streams-avro-serde-5.1.0.jar; If I now try to create the external table : CREATE EXTERNAL TABLE click ( ... )
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
"kafka.topic" = "click", "kafka.bootstrap.servers"="kafka:9092"
,"kafka.serde.class"="io.confluent.kafka.streams.serdes.avro.GenericAvroSerde"
); Hive bails out and says: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.lang.ClassNotFoundException: io.confluent.kafka.streams.serdes.avro.GenericAvroSerde (state=08S01,code=1) How can I tell hive where to find this class? Thanks,
... View more
Labels:
01-28-2019
07:07 AM
Context: Hive3, HDP 3.1. Tests done with Python/odbc (official HDP driver) under Windows and Linux. I ran the following queries:
"select ? as lic, ? as cpg" "select * from (select ? as lic, ? as cpg) as t" "with init as (select ? as lic, ? as cpg) select * from init", 1) and 2) work fine, and give me the expected result. 3 gives me a ParseException :
Error while compiling statement: FAILED: ParseException line 1:21 cannot recognize input near '?' 'as' 'lic' in select clause (80) (SQLPrepare)") The exact same statements ran with java/jdbc work fine. Note that 2) looks like is a workaround for 3) but it works for this tiny example, not for bigger queries. Is there something I can do to have ODBC working as expected? Alternatively, where can I find the limits of the ODBC driver? For full context, the full test code is as follow: cnxnstr = 'DSN=HiveProd'
cnxn = pyodbc.connect(cnxnstr, autocommit=True)
cursor = cnxn.cursor()
queries = [
"with init as (select ? as lic, ? as cpg) select * from init",
"select 2 * ? as lic, ? as cpg",
"select * from (select ? as lic, ? as cpg) as t",
]
for q in queries:
print("\nExecuting " + q)
try:
cursor.execute(q, '1', '2')
except pyodbc.ProgrammingError as e:
print(e)
continue
... View more
Labels:
01-11-2019
05:26 AM
In addition to these steps: restart ambari server (we had one instance where it looked like the application was OK but the alert was cached and keep being displayed), check your yarn logs. If there is not enough memory for yarn, the service will not be able to start.
... View more
01-11-2019
05:24 AM
1 Kudo
You will lose some job history, but nothing else and certainly no data, so it should not be an issue.
... View more
01-10-2019
08:50 AM
2 Kudos
It worked for me eventually after cleaning up *everything*: - destroying the app and cleaning hdfs as explained there: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/data-operating-system/content/remove_ats_hbase_before_switching_between_clusters.html - cleaning zookeeper: zookeeper-client rmr /atsv2-hbase-unsecure and finally restarting *all* yarn services from ambari should did the trick.
... View more
01-07-2019
05:01 AM
I got it working by - cleaning up the hdfs directories of hbase-ats - cleaning up the zookeeper nodes related to hbase-hdfs I hope there are better ways, but that's the only one I found out and was working.
... View more
12-19-2018
01:35 PM
In short: I have a working hive on hdp3, which I cannot reach from pyspark, running under yarn (on the same hdp). How do I get pyspark to find my tables? spark.catalog.listDatabases() only show default, any query run will not show in my hive logs. This is my code, with spark 2.3.1 from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
settings = []
conf = SparkConf().setAppName("Guillaume is here").setAll(settings)
spark = (
SparkSession
.builder
.master('yarn')
.config(conf=conf)
.enableHiveSupport()
.getOrCreate()
)
print(spark.catalog.listDatabases()) Note that `settings` is empty. I though it would be sufficient, because in the logs I see loading hive config file: file:/etc/spark2/3.0.1.0-187/0/hive-site.xml and more interestingly Registering function intersectgroups io.x.x.IntersectGroups This is a UDF I wrote and added to hive manually. This means that there is some sort of connection done. The only output I get (except logs) is: [ Database(name=u'default', description=u'default database', locationUri=u'hdfs://HdfsNameService/apps/spark/warehouse')] I understand that I should set `spark.sql.warehouse.dir` in settings. No matter if I set it to the value I find in hive-site, the path to the database I am interested in (it's not in the default location), its parent, nothing changes. I put many other config options in settings (including thrift uris), no changes. I have seen as well that I should copy hive-site.xml into the spark2 conf dir. I did it on all nodes of my cluster, no changes. My command to run is: HDP_VERSION=3.0.1.0-187 PYTHONPATH=.:/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip SPARK_HOME=/usr/hdp/current/spark2-client HADOOP_USER_NAME=hive spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip --files /etc/hive/conf/hive-site.xml ./subjanal/anal.py
... View more
11-22-2018
06:37 PM
@Aditya Sirna, you are right, hbase runs as a service (is_hbase_system_service_launch is true). I am giving example with nodeN, which are the names of my data nodes, This is based on what I see right now and makes it easier to understand The region server (node5) tries to report for duty but fails. It tries to connect to node1:17020, but port 17020 is only open on node5. On node1 hbase master tried to start, but stopped because it apparently cannot find the active namenode Failed get of master address: java.io.IOException: Can't get master address from ZooKeeper; znode data == null I will look into zookeeper, it seems to ring a bell. I have 2 questions if you don't mind: - how do you start a yarn service on a specic node? - how does the timelinereader know where to connect? In any case thanks, you gave me some ideas to carry on.
... View more
11-22-2018
03:07 PM
Hello, I have a new hdp3.0.1 installation with ats-hbase which runs embedded (with proper queue configured, as per the documentation). At the end of all tasks (seen with the hive compactor, oozie steps), I have hundreds of lines with org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING ending up with : org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1542872934100_0068 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548) at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception:t java.lang.Thread.run(Thread.java:745) Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Call From null to prod-nl-dpnode3.dmdelivery.local:33602 failed on socket timeout exception Looking at /var/log/hadoop-yarn/yar/hadoop-yarn-nodemanager I have a lot of lines with: Call exception, tries=7, retries=7, started=8194 ms ago, cancelled=false, msg=Call to xxxxx/192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: prod-nl-dpnode1.dmdelivery.local/192.168.36.161:17020, details=row 'prod.timelineservice.entity,hive!yarn-cluster!xxxx-34-compactor-vault.contact.license_name=lectiva!^?�����@@!^?����d��^?���!MAPREDUCE_TASK_ATTEMPT!^?�����!attempt_1542205428050_2307_m_000461_0,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1542294270073, seqNum=-1 Looking at /var/log/hadoop-yarn/yarn/hadoop-yarn-timelinereader, I see Connection refused: dpnode1/192.168.36.161:17020 Indeed, there is no hbase on dpnode1. Hbase does run on dpnode5 (or another one, depending on yarn restart), but in any case, the timelinereader does not know which server to reach, and always goes to one seemingly hardcoded hostname. How can I tell yarn to use the right node to connect to hbase? Thanks,
... View more
11-13-2018
06:55 AM
Eventually, after a restart of everything (not only the services seen as requiring a restart) it went OK.
... View more
11-01-2018
10:57 AM
Hello, I installed a new (not an update) HDP 3.0.1 and seem to have many issues with the timeline server. 1) The first weird thing is that the Yarn tab in ambari keeps showing this error: ATSv2 HBase Application The HBase application reported a 'STARTED' state. Check took 2.125s 2) The second issue seems to be with oozie. Running a job, it starts but stalls with the following log repeated hundreds of times 2018-11-01 11:15:37,842 INFO [Thread-82] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
Then with: 2018-11-01 11:15:37,888 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Exception while publishing configs on JOB_SUBMITTED Event for the job : job_1541066376053_0066
org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.publishConfigsOnJobSubmittedEvent(JobHistoryEventHandler.java:1254)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1414)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out 3) In hadoop-yarn-timelineserver-${hostname}.log I see, repeated many times: 2018-11-01 11:32:47,715 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1541066376053_0144_2 (TEZ_DAG_ID): 6 4) In hadoop-yarn-timelinereader-${hostname}.log I see, repeated many times: Thu Nov 01 11:34:10 CET 2018, RpcRetryingCaller{globalStartTime=1541068444076, pause=1000, maxAttempts=4}, java.net.ConnectException: Call to /192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.x.x:17020
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:145)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
... 3 more
Caused by: java.net.ConnectException: Call to /192.168.x.x:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.x.x:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165) and indeed, there is nothing listening to port 17020 on 192.168.x.x. 5) I cannot find on any server a process named ats-hbase, this might be the reason for everything else. The queue set is just yarn_hbase_system_service_queue_name=default, which has no limit which would prevent Hbase to start. I am sure that something is very wrong here, and any help would be appreciated.
... View more
10-29-2018
05:38 AM
I was using Oozie 4.2.0 (hdp2.6) and am now trying Oozie 4.3.1 (hdp 3.0). One major difference is that one java action could in the past read its jar from the local filesystem, but it looks like it is not possible anymore. The Java action is basic: <java xmlns="uri:oozie:workflow:0.5">
<job-tracker>http://something.local:8050</job-tracker>
<name-node>hdfs://HdfsNameService</name-node>
<main-class>io.JsonPoster</main-class>
<file>file:///opt/jsonposter/jsonposter.jar</file>
</java> The error I get is quite clear: org.apache.oozie.action.ActionExecutorException: UnsupportedOperationException: Accessing local file system is not allowed
I know I could put the jar on hdfs put I am trying to avoid that for now (because it used to work and all our deployment are done via rpm). I am ready to take the responsibility of having all jars in sync over all datanodes. I already set oozie.service.HadoopAccessorService.supported.filesystems to * , with no effect. Is there a way to tell Oozie that yes, I am happy for it to read the local FS?
... View more
04-12-2018
01:21 PM
@rtrivedi Thanks for your answer but I believe that's not the issue. I tried a lot of variations with the `hive delete` command, to no avail: delete jar hdfs:///myudfs/myfunc.jar; list jar; --give a localised jar delete jar $localised_jar; CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar'; And I end up having the same error again.
... View more
04-12-2018
09:59 AM
I created my own (generic) udf, which works very well when added in hive:
CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar';
After a while I wanted to update my udf, so I created a new jar with the same name, and put it in hdfs by overwriting the old jar. Lo and behold, I cannot use my function again! It does not matter if I do first a:
drop function if exists myfunc;
CREATE FUNCTION myfunc AS 'io.company.hive.udf.myfunc' USING JAR 'hdfs:///myudfs/myfunc.jar';
From beeline, I got one of these error message:
java.io.IOException: Previous writer likely failed to write hdfs://ip-10-0-10-xxx.eu-west-1.compute.internal:8020/tmp/hive/hive/_tez_session_dir/0de6055d-190d-41ee-9acb-c6b402969940/hmyfunc.jar Failing because I am unlikely to write too.
or
org.apache.hadoop.hive.ql.metadata.HiveException: Default queue should always be returned.Hence we should not be here.
Looking at the logs, it looks like Hive is localising the jar file (good) but as a session is reused, if the new jar does not match the jar already present in the localised directory hive will complain and will apparently wait indefinitely. If my understanding is correct, is there a way to tell Tez to not reuse any of the current sessions? If my understanding is not correct, is there a way to do what I want? Context: hdp 2.6.0.3, no llap, on aws. Thanks,
... View more
Labels:
03-09-2018
12:35 PM
I have a query, always failing with the following error:
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) [...]
The query itself is quite a small MERGE (Other much bigger queries work flawlessly):
MERGE INTO summary dst USING (
SELECT
e.id1
, e.id2
, e.id3
, e.name
, e.subject
FROM
mailing e
) src
ON
dst.id1 = src.id1
AND dst.id2 = src.id2
AND dst.id3 = src.id3
WHEN MATCHED
THEN UPDATE SET
name = src.name
, subject=src.subject
The source table has 1.7M rows (50M on disk), the destination has 75M rows, (1.5GB on disk).
Both are ACID tables, ORC.
On the image, map 1 is the one with the issue, and I cannot understand why it has only one task. Naively I would think that more tasks would each have a smaller load and would work better, but I did not manage to do that.
Note that I maxed out already all memory parameters, I cannot do more on those: yarn-site/yarn.nodemanager.resource.memory-mb = 24064
yarn-site/yarn.scheduler.minimum-allocation-mb = 1024
yarn-site/yarn.scheduler.maximum-allocation-mb = 24064
mapred-site/mapreduce.map.memory.mb = 4096
mapred-site/mapreduce.reduce.memory.mb = 8192
mapred-site/mapreduce.map.java.opts = 3276
mapred-site/mapreduce.reduce.java.opts = 6553
hive-site/hive.tez.container.size = 4096 Is there a way to increase the number of tasks in the mapper, or another way to not get this out of memory error?
... View more
Labels:
01-16-2018
06:46 AM
@Jordan Moore Not really relevant to the question but no this is not the point. The use case here is data export, where some clients have their own BI tools, processes and so on. They just need the data, csv in a zip file. Other clients do not have this in place and have a different access to this data.
... View more
01-15-2018
06:08 AM
The zip file is the output of the process, not to be read in hdfs anymore - it will just end up being downloaded and sent to a user. In this context using zip makes sense, as I am only looking at *compressing* multiple csv together, not reading them afterwards. Using beeline with formatted output is what I do currently, but I end up downloading locally multiple gigs, compress and re-upload. This is a waste and could actually fill my local disks up. Using coalesce in Spark is the best option I found, but the compression step is still not easy. Thanks!
... View more
01-12-2018
02:57 PM
Hello, I am running on hdp 2.6 on QWS with 8 nodes. Here are my relevant settings: yarn.nodemanager.resource.memory-mb=24gb yarn.scheduler.maximum-allocation-mb=24gb yarn.scheduler.minimum-allocation-mb=1gb tez container size=4gb I have 32GB per DN, I see in my monitoring that only 13GB (max) is used, but I still receive when running a big hive query: received a TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) Could you help me make sense of this? Thanks
... View more
Labels:
01-09-2018
01:45 PM
My end goal is to run a few hive queries, get 1 csv file (with headers) per query, compress all those files together in one zip (not gzip or bzip, unfortunately, needs to open natively under windows) and hopefully get the zip back in hdfs. My current solution (CTAS) ends up creating one directory per table, with possibly multiple files under it (depending on number of reducers and presence/absence of UNION). I can easily generate as well a header file per table with only one line in it. Now how to put all that together? The only option I could find implies to do all the processing locally (hdfs dfs -getmerge followed by a an actual zip command). This adds a lot of overhead and could technically fill up the local disk. So my questions are: is there a way to concatenate files inside hdfs without getting them locally? is there a way to compress a bunch of files together (not individually) in zip, inside hdfs? Thanks
... View more
- Tags:
- Data Processing
- HDFS
Labels:
12-11-2017
02:12 PM
hdp 2.6.0, HMS has 6GB memory, the metastore itself is mysql. After a few days the server hosting the HMS will have its CPU 100% used, hive queries are slow, and looking at the GC logs the HMS is constantly having stop the world events. Restarting the metastore 'fixes' the problem for a few days. I have found a few JIRA link related to memory leaks https://issues.apache.org/jira/browse/HIVE-15551 or https://issues.apache.org/jira/browse/HIVE-13749 . Is it a known issue in hdp 2.6.0? Is it known if the latest hdp version is fixed? Thanks,
... View more
- Tags:
- hivemetastore
Labels: