About ggangadharan

ggangadharan · ‎06-06-2023

mapreduce.output.basename also works since as part of setoutput name assigning the same. Code snippet form ParquetOutputFormat. protected static void setOutputName(JobContext job, String name) { job.getConfiguration().set("mapreduce.output.basename", name); } JOB CONF - Configuration conf = getConf(); conf.set("mapreduce.output.basename","parquet_output"); Output [hive@c1757-node3 ~]$ hdfs dfs -ls /tmp/parquet-sample Found 4 items -rw-r--r-- 2 hive supergroup 0 2023-06-06 17:08 /tmp/parquet-sample/_SUCCESS -rw-r--r-- 2 hive supergroup 271 2023-06-06 17:08 /tmp/parquet-sample/_common_metadata -rw-r--r-- 2 hive supergroup 1791 2023-06-06 17:08 /tmp/parquet-sample/_metadata -rw-r--r-- 2 hive supergroup 2508 2023-06-06 17:08 /tmp/parquet-sample/parquet_output-m-00000.parquet

ggangadharan · ‎06-06-2023

Since the output file is .parquet , hope you're using ParquetOutputFormat in the MR job config. In that case ParquetOutputFormat.setOutputname method will help to set the base name of the output file. Ref - https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.2/org/apache/parquet/hadoop/ParquetOutputFormat.html https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#setOutputName(org.apache.hadoop.mapreduce.JobContext,%20java.lang.String)

ggangadharan · ‎06-06-2023

It is not possible to add an aux jar directly from CM. Follow the below documents depending on the requirement. https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/using-hiveql/topics/hive_create_place_udf_jar.html https://docs.cloudera.com/cdw-runtime/1.5.0/integrating-hive-and-bi/topics/hive_setup_jdbcstoragehandler_edb.html

ggangadharan · ‎04-20-2023

It's working expected. Please find the below code snippet >>> columns = ["language","users_count"] >>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] >>> df = spark.createDataFrame(data).toDF(*columns) >>> df.write.csv("/tmp/test") >>> df2=spark.read.csv("/tmp/test/*.csv") d>>> df2.show() +------+------+ | _c0| _c1| +------+------+ |Python|100000| | Scala| 3000| | Java| 20000| +------+------+

ggangadharan · ‎04-20-2023

From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.

ggangadharan · ‎10-13-2022

@Sunil1359 Compilation might be higher if the table has a large number of partitions or if the HMS process is slow when the query runs. Please check the below on the corresponding time period to find the root cause. HS2 log HMS log HMS jstack In Tez engine queries will run in the form of DAG. In the compilation phase, once the semantic analysis process is completed, the plan will be generated depending on the query you submitted. explain <your query> gives the plan of the query. Once the plan is generated DAG will be submitted to yarn and the DAG will run depending on the plan. As part of DAG, Split generation, input file read, shuffle fetch ..etc will be taken care and the end result will be transferred to the client.

ggangadharan · ‎04-25-2022

Hi, From shell find the files that needs to be deleted and save them in a temp file like below, #!/bin/sh today=`date +'%s'` hdfs dfs -ls /file/Path/ | while read line ; do dir_date=$(echo ${line} | awk '{print $6}') difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) )) filePath=$(echo ${line} | awk '{print $8}') if [ ${difference} -gt 3 ]; then echo -e "$filePath" >> toDelete fi done Then execute arbitrary shell command using form example subprocess.call or sh library so something like below import subprocess file = open('toDelete', 'r') for each in file: subprocess.call(["hadoop", "fs", "-rm", "-f", each]) Also, you can use hdfs fs API in PySpark like below, from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate() def delete_path(spark, path): sc = spark.sparkContext fs = (sc._jvm.org .apache.hadoop .fs.FileSystem .get(sc._jsc.hadoopConfiguration()) ) fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True) delete_path(spark, "Your/hdfs/path")

ggangadharan · ‎04-22-2022

Hi , The below source code removes files that are older than 3 days from the HDFS path #!/bin/sh today=`date +'%s'` hdfs dfs -ls /file/Path/ | while read line ; do dir_date=$(echo ${line} | awk '{print $6}') difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) )) filePath=$(echo ${line} | awk '{print $8}') if [ ${difference} -gt 3 ]; then hdfs dfs -rm -r $filePath fi done hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured. To ignore moving the file to trash folder use skipTrash option.

ggangadharan · ‎02-17-2022

Hi, Tried the same in 3.1.0.0-78. It's working as expected in the document. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_surrogate_keys.html Could you please share beeline -u <hiveserver jdbc uri> -e "set -v" output. Also, the error stack trace looks like it's coming from NN. Please check your NN logs for further information. Also could see the below JIRA regarding this, https://issues.apache.org/jira/browse/HIVE-21238

ggangadharan · ‎02-15-2022

Hi, Hiveserver2 runs as a java process, When running timestamp-related UDF functions in Hive, the default behavior is to use the system's timezone information to convert timestamp values. PFB INFO : Executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e): SELECT current_timestamp() INFO : Completed executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e); Time taken: 0.092 seconds INFO : OK +--------------------------+ | _c0 | +--------------------------+ | 2022-02-15 13:17:16.204 | +--------------------------+ 1 row selected (2.413 seconds) [hive@c2757-node3 ~]$ date Tue Feb 15 13:17:26 IST 2022 [hive@c2757-node3 ~]$ timedatectl | grep "Time zone" Time zone: Asia/Kolkata (IST, +0530) To make Hive return a specific timezone with the timestamp function. please follow the steps below: 1. Go to the Cloudera Manager home page > Hive > Configuration Under "Client Java Configuration Options", append " -Duser.timezone=UTC" in the text string (be mindful of the leading space in front, if you append to the end of existing options). 2 . Under "Java Configuration Options for HiveServer2", append the same thing to the end of the text string " -Duser.timezone=UTC". 3 . Save the configuration, then restart any HiveServer2 instances, and select Actions -> "Deploy Client Configuration" through Cloudera Manager. To confirm the new configuration is working, see below test outputs: Before the change (system's default timezone is IST): INFO : Executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e): SELECT current_timestamp() INFO : Completed executing command(queryId=hive_20220215131716_33c6c90b-207a-43fe-9bcf-4a52cd04de3e); Time taken: 0.092 seconds INFO : OK +--------------------------+ | _c0 | +--------------------------+ | 2022-02-15 13:17:16.204 | +--------------------------+ 1 row selected (2.413 seconds) [hive@c2757-node3 ~]$ date Tue Feb 15 13:17:26 IST 2022 [hive@c2757-node3 ~]$ timedatectl | grep "Time zone" Time zone: Asia/Kolkata (IST, +0530) After the change to UTC, [hive@c2757-node3 ~]$ timedatectl | grep "Time zone" Time zone: Asia/Kolkata (IST, +0530) [hive@c2757-node3 ~]$ date Tue Feb 15 13:28:05 IST 2022 INFO : Executing command(queryId=hive_20220215075824_6a58bdcb-b1b6-470d-9202-26ccfc60f521): SELECT current_timestamp() INFO : Completed executing command(queryId=hive_20220215075824_6a58bdcb-b1b6-470d-9202-26ccfc60f521); Time taken: 0.079 seconds INFO : OK +--------------------------+ | _c0 | +--------------------------+ | 2022-02-15 07:58:24.785 | +--------------------------+ 1 row selected (2.24 seconds) Like this, you can mention hive to use other timezones.

Online	Offline
Last Visited	‎01-06-2025 12:29 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎01-06-2025 12:29 AM
Posts	336
Kudos received	53

Cloudera Community

Re: Hive Job - OutOfMemoryError: Java heap space

Re: Insert into table test values('a', 'b'); not w...

Re: how to drop partition table using date_add fun...

Re: Issue with Hive HQL insert query - KryoExcepti...

Re: Error when do an alter table change column on ...

Re: Change default output filename part-r-00000.sn...

Re: Change default output filename part-r-00000.sn...

Re: unable add hive aux jars from CM

Re: Unable to convert a pyspark dataframe to CSV

Re: Error with socket timeout in CDP hive 3.1.3 wh...

Re: Tez query profiling

Re: HDFS dir cleanup which older than 7 days in py...

Re: HDFS dir cleanup which older than 7 days in py...

Re: Apache Hive "surrogate_key" error

Re: Can we change default hive/hbase timestamp fro...