Member since
10-28-2020
500
Posts
31
Kudos Received
35
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
124 | 03-28-2024 09:51 AM | |
213 | 03-20-2024 03:54 AM | |
378 | 03-14-2024 06:29 AM | |
605 | 03-12-2024 04:00 AM | |
1095 | 02-18-2024 09:59 AM |
03-12-2024
07:46 AM
@Leopold It is disabled for external tables as data in HDFS can change without Hive knowing about it. Unfortunately I do not see a way to enforce fetch task for a query with an aggregate function.
... View more
03-12-2024
04:00 AM
1 Kudo
@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message: 2024-03-12 10:48:37,247 INFO org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer. But enabling "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.
... View more
03-12-2024
02:38 AM
@Leopold provided we have column stats available, Hive could use a fetch task to perform a simple aggregation task such as max(), instead of launching a Map job. Try hive.fetch.task.aggr=true . This property is disabled by default.
... View more
02-26-2024
05:21 AM
@frbelotto It does not require any installation. You just need to share the path to the Driver jar file.
... View more
02-24-2024
05:31 PM
1 Kudo
My bad. If you are using the Cloudera JDBC jar, the driver class should be com.cloudera.hive.jdbc.HS2Driver. As we are talking about Kerberos authentication, you should get a kerberos ticket in the client machine first, and use jdbc_url as follows: jar_file = '/path/to/hive-jdbc.jar'
jdbc_url = 'jdbc:hive2://{server}:{port}/default;principal={principal}'
# Connect to Hive
conn = jaydebeapi.connect('com.cloudera.hive.jdbc.HS2Driver', jdbc_url, ['', ''], jar_file)
cursor = conn.cursor()
... View more
02-18-2024
09:59 AM
1 Kudo
@frbelotto I have not tried with pyhive, I think it requires additional modules if you want to connect using zookeeper quorum. But you could use jaydebeapi python module to connect to Hive3. It works for any type of connection string knox/ZK. You would require Hive driver that you could download from here. An example on how to to make use of jaydebeapi module to connect to Hive: import jaydebeapi
# Connection parameters
jdbc_url = 'jdbc:hive2://knox.host:8443/default' # JDBC URL for HiveServer2
username = 'your-username'
password = 'your-password'
jar_file = '/path/to/hive-jdbc-driver.jar' # Path to the Hive JDBC driver JAR file
# Establish connection to Hive
conn = jaydebeapi.connect(
'org.apache.hive.jdbc.HiveDriver',
jdbc_url,
[username, password],
jar_file
)
# Create cursor
cursor = conn.cursor()
# Execute Hive query
cursor.execute('SELECT * FROM hive_table')
# Fetch results
result = cursor.fetchall()
# Close cursor and connection
cursor.close()
conn.close()
... View more
01-21-2024
06:45 AM
@narasimha8177 is it happening for all the jobs? Could you check the utilization of the yarn.nodemanager.local-dirs(YARN NodeManager Local directories) directory? You must have defined a path under YARN configuration? The localized resources gets stored under this location. Try to delete all the contents from the usercache directories on all data nodes, and resubmit the job. While deleting the contents from usercache directory make sure that there is no job in running state. Else, take a downtime to perform this. We need to understand why this localization fails. Either the source file is missing, or the target location is not in good shape.
... View more
01-16-2024
10:55 AM
@narasimha8177 I think HADOOP-12252 fixes this issue. This is not available in HDP 2.5. Do check the disk usage in all your node manager hosts. df -Th If you notice any directory 100% utilized, do clear some files, and make sure all the directories read/writable.
... View more
01-16-2024
07:49 AM
@bulbcat Hive has a KILL QUERY <queryId> command to achieve what you requested here. The feature got added through the following upstream Jiras HIVE-17483, HIVE-20549 But this command is not present in the Hive version you mentioned above. It got added in Hive 3.0, which is available in Cloudera CDP clusters.
... View more
12-27-2023
09:48 AM
2 Kudos
Hive 3.0 introduced an option to re-attempt a failed Hive query, in case the first run fails. It would only make sense if we fixed whatever was the issue in the previous run. We'll discuss the ways to configure this once without having to intervene after each failure event. The following Hive property enables query re-execution. This should be enabled out of the box. hive.query.reexecution.enabled=true;
hive.query.reexecution.strategies=overlay,reoptimize,recompile_without_cbo,reexecute_lost_am; Re-execution strategies: Overlay Using this method, we can set a Hive property that should be applied on the re-execution. It works by adding a configuration subtree as an overlay to the actual hive settings(reexec.overlay.*). set reexec.overlay.{hive_property}=new_value Every hive setting which has a prefix of "reexec.overlay" will be set for all re-executions. e.g. In case our Hive queries fail with OOM while performing Map Joins, which could occur when we do not have correct stats for the tables, we could try disabling hive.auto.convert.join for the next attempt: set reexec.overlay.hive.auto.convert.join=false;
set hive.query.reexecution.strategies=overlay; Reoptimize Throughout the execution of a query, the system actively monitors the real count of rows passing through each operator. This recorded information is leveraged in subsequent re-planning stages, potentially leading to the generation of a more optimized query plan. Instances where this becomes essential include: - Absence of statistics. - Inaccurate statistics. - Scenarios involving numerous joins. In order to enable this, use: set hive.query.reexecution.strategies=overlay,reoptimize
set hive.query.reexecution.stats.persist.scope=query hive.query.reexecution.stats.persist.scope provides an option to persists the runtime stats at different levels: query - only used during the reexecution hiveserver2 - persisted in the HS2 until restarted metastore - persisted in the metastore; and loaded on hive server startup. Avoid setting it to "metastore" due to the bug discussed in HIVE-26978 recompile_without_cbo When CBO fails during compilation phase, it falls back to legacy optimizer, but in many cases the it is unable to correctly recreate the AST. HIVE-25792 helps recompile the query without CBO in case it fails. reexecute_lost_am Re-executes query if it failed due to tez am node gets decommissioned. Some relevant configurations : Configuration default hive.query.reexecution.always.collect.operator.stats false Enable to gather runtime statistics on all queries. hive.query.reexecution.enabled true Feature enabler hive.query.reexecution.max.count 1 number of reexecution that may happen hive.query.reexecution.stats.cache.batch.size -1 If runtime stats are stored in metastore; the maximal batch size per round during load. hive.query.reexecution.stats.cache.size 100 000 Size of the runtime statistics cache. Unit is: OperatorStat entry; a query plan consist ~100. hive.query.reexecution.stats.persist.scope query runtime statistics can be persisted: query: - only used during the reexecution hiveserver: persisted during the lifetime of the hiveserver metastore: persisted in the metastore; and loaded on hiveserver startu[ hive.query.reexecution.strategies overlay,reoptimize,recompile_without_cbo,reexecute_lost_am reexecution plugins; currently overlay and reoptimize is supported runtime.stats.clean.frequency 3600s Frequency at which timer task runs to remove outdated runtime stat entries. runtime.stats.max.age 3days Stat entries which are older than this are removed.
... View more