Member since
09-16-2021
336
Posts
53
Kudos Received
27
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
290 | 11-22-2024 05:29 AM | |
165 | 11-15-2024 06:38 AM | |
377 | 11-13-2024 07:12 AM | |
414 | 11-10-2024 11:19 PM | |
566 | 10-25-2024 05:02 AM |
06-06-2023
10:14 AM
mapreduce.output.basename also works since as part of setoutput name assigning the same. Code snippet form ParquetOutputFormat. protected static void setOutputName(JobContext job, String name) {
job.getConfiguration().set("mapreduce.output.basename", name);
} JOB CONF - Configuration conf = getConf();
conf.set("mapreduce.output.basename","parquet_output"); Output [hive@c1757-node3 ~]$ hdfs dfs -ls /tmp/parquet-sample
Found 4 items
-rw-r--r-- 2 hive supergroup 0 2023-06-06 17:08 /tmp/parquet-sample/_SUCCESS
-rw-r--r-- 2 hive supergroup 271 2023-06-06 17:08 /tmp/parquet-sample/_common_metadata
-rw-r--r-- 2 hive supergroup 1791 2023-06-06 17:08 /tmp/parquet-sample/_metadata
-rw-r--r-- 2 hive supergroup 2508 2023-06-06 17:08 /tmp/parquet-sample/parquet_output-m-00000.parquet
... View more
06-06-2023
05:13 AM
Since the output file is .parquet , hope you're using ParquetOutputFormat in the MR job config. In that case ParquetOutputFormat.setOutputname method will help to set the base name of the output file. Ref - https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.2/org/apache/parquet/hadoop/ParquetOutputFormat.html https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#setOutputName(org.apache.hadoop.mapreduce.JobContext,%20java.lang.String)
... View more
06-06-2023
04:37 AM
It is not possible to add an aux jar directly from CM. Follow the below documents depending on the requirement. https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/using-hiveql/topics/hive_create_place_udf_jar.html https://docs.cloudera.com/cdw-runtime/1.5.0/integrating-hive-and-bi/topics/hive_setup_jdbcstoragehandler_edb.html
... View more
04-20-2023
10:47 PM
It's working expected. Please find the below code snippet >>> columns = ["language","users_count"]
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data).toDF(*columns)
>>> df.write.csv("/tmp/test")
>>> df2=spark.read.csv("/tmp/test/*.csv")
d>>> df2.show()
+------+------+
| _c0| _c1|
+------+------+
|Python|100000|
| Scala| 3000|
| Java| 20000|
+------+------+
... View more
04-20-2023
05:31 AM
From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.
... View more
10-13-2022
02:46 AM
@Sunil1359 Compilation might be higher if the table has a large number of partitions or if the HMS process is slow when the query runs. Please check the below on the corresponding time period to find the root cause. HS2 log HMS log HMS jstack In Tez engine queries will run in the form of DAG. In the compilation phase, once the semantic analysis process is completed, the plan will be generated depending on the query you submitted. explain <your query> gives the plan of the query. Once the plan is generated DAG will be submitted to yarn and the DAG will run depending on the plan. As part of DAG, Split generation, input file read, shuffle fetch ..etc will be taken care and the end result will be transferred to the client.
... View more
04-25-2022
11:15 PM
Hi, From shell find the files that needs to be deleted and save them in a temp file like below, #!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done Then execute arbitrary shell command using form example subprocess.call or sh library so something like below import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each]) Also, you can use hdfs fs API in PySpark like below, from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
... View more
04-22-2022
04:24 AM
Hi , The below source code removes files that are older than 3 days from the HDFS path #!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
hdfs dfs -rm -r $filePath
fi
done hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured. To ignore moving the file to trash folder use skipTrash option.
... View more
02-17-2022
09:22 AM
Hi, Tried the same in 3.1.0.0-78. It's working as expected in the document. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_surrogate_keys.html Could you please share beeline -u <hiveserver jdbc uri> -e "set -v" output. Also, the error stack trace looks like it's coming from NN. Please check your NN logs for further information. Also could see the below JIRA regarding this, https://issues.apache.org/jira/browse/HIVE-21238
... View more
02-15-2022
12:04 AM
Hi, Hiveserver2 runs as a java process, When running timestamp-related UDF functions in Hive, the default behavior is to use the system's timezone information to convert timestamp values. PFB INFO : Executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e): SELECT current_timestamp()
INFO : Completed executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e); Time taken: 0.092 seconds
INFO : OK
+--------------------------+
| _c0 |
+--------------------------+
| 2022-02-15 13:17:16.204 |
+--------------------------+
1 row selected (2.413 seconds)
[hive@c2757-node3 ~]$ date
Tue Feb 15 13:17:26 IST 2022
[hive@c2757-node3 ~]$ timedatectl | grep "Time zone"
Time zone: Asia/Kolkata (IST, +0530) To make Hive return a specific timezone with the timestamp function. please follow the steps below: 1. Go to the Cloudera Manager home page > Hive > Configuration Under "Client Java Configuration Options", append " -Duser.timezone=UTC" in the text string (be mindful of the leading space in front, if you append to the end of existing options). 2 . Under "Java Configuration Options for HiveServer2", append the same thing to the end of the text string " -Duser.timezone=UTC". 3 . Save the configuration, then restart any HiveServer2 instances, and select Actions -> "Deploy Client Configuration" through Cloudera Manager. To confirm the new configuration is working, see below test outputs: Before the change (system's default timezone is IST): INFO : Executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e): SELECT current_timestamp()
INFO : Completed executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e); Time taken: 0.092 seconds
INFO : OK
+--------------------------+
| _c0 |
+--------------------------+
| 2022-02-15 13:17:16.204 |
+--------------------------+
1 row selected (2.413 seconds)
[hive@c2757-node3 ~]$ date
Tue Feb 15 13:17:26 IST 2022
[hive@c2757-node3 ~]$ timedatectl | grep "Time zone"
Time zone: Asia/Kolkata (IST, +0530) After the change to UTC, [hive@c2757-node3 ~]$ timedatectl | grep "Time zone"
Time zone: Asia/Kolkata (IST, +0530)
[hive@c2757-node3 ~]$ date
Tue Feb 15 13:28:05 IST 2022
INFO : Executing command(queryId=hive_20220215075824_6a58bdcb-b1b6-470d-9202-26ccfc60f521): SELECT current_timestamp()
INFO : Completed executing command(queryId=hive_20220215075824_6a58bdcb-b1b6-470d-9202-26ccfc60f521); Time taken: 0.079 seconds
INFO : OK
+--------------------------+
| _c0 |
+--------------------------+
| 2022-02-15 07:58:24.785 |
+--------------------------+
1 row selected (2.24 seconds) Like this, you can mention hive to use other timezones.
... View more
- « Previous
- Next »