Member since
09-16-2021
144
Posts
6
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
226 | 11-06-2023 03:10 AM | |
115 | 10-30-2023 07:17 AM | |
187 | 10-27-2023 12:07 AM | |
230 | 10-10-2023 10:57 AM | |
222 | 10-10-2023 10:50 AM |
06-13-2023
12:12 AM
Output of below to identify the exact ouptut records details, explain formatted <query> explain extended <query> explain analyze <query>
... View more
06-06-2023
09:53 PM
Use the below Query to fetch the table location from HMS . select "DBS"."NAME" as DB_NAME, "TBLS"."TBL_NAME", "SDS"."LOCATION" from "DBS" join "TBLS" on "DBS"."DB_ID" = "TBLS"."DB_ID" AND "TBLS"."TBL_TYPE" != 'VIRTUAL_VIEW' join "SDS" on "TBLS"."SD_ID" = "SDS"."SD_ID"; To query the same from hive, Would recommend to use JDBC Storage Handler. In CDP by default in sysdb this tables has been created. you can use the same. Query select dbs.name as db_name , tbls.tbl_name , sds.location from dbs join tbls on dbs.db_id = tbls.db_id and tbls.tbl_type != 'VIRTUAL_VIEW' join sds on tbls.sd_id = sds.sd_id;
... View more
06-06-2023
09:37 PM
Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive. df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records") Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html Sample Code Snippet df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \
.option("dbtable", "\"<SourceTableName>\"") \
.option("user", "<Username>") \
.option("password", "<Password>") \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.mode('overwrite').saveAsTable("<TargetTableName>")
From hive
INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds
INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
.
.
.
+------+
| _c0 |
+------+
| 122 |
+------+
... View more
06-06-2023
10:14 AM
mapreduce.output.basename also works since as part of setoutput name assigning the same. Code snippet form ParquetOutputFormat. protected static void setOutputName(JobContext job, String name) {
job.getConfiguration().set("mapreduce.output.basename", name);
} JOB CONF - Configuration conf = getConf();
conf.set("mapreduce.output.basename","parquet_output"); Output [hive@c1757-node3 ~]$ hdfs dfs -ls /tmp/parquet-sample
Found 4 items
-rw-r--r-- 2 hive supergroup 0 2023-06-06 17:08 /tmp/parquet-sample/_SUCCESS
-rw-r--r-- 2 hive supergroup 271 2023-06-06 17:08 /tmp/parquet-sample/_common_metadata
-rw-r--r-- 2 hive supergroup 1791 2023-06-06 17:08 /tmp/parquet-sample/_metadata
-rw-r--r-- 2 hive supergroup 2508 2023-06-06 17:08 /tmp/parquet-sample/parquet_output-m-00000.parquet
... View more
06-06-2023
05:13 AM
Since the output file is .parquet , hope you're using ParquetOutputFormat in the MR job config. In that case ParquetOutputFormat.setOutputname method will help to set the base name of the output file. Ref - https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.2/org/apache/parquet/hadoop/ParquetOutputFormat.html https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#setOutputName(org.apache.hadoop.mapreduce.JobContext,%20java.lang.String)
... View more
06-06-2023
04:37 AM
It is not possible to add an aux jar directly from CM. Follow the below documents depending on the requirement. https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/using-hiveql/topics/hive_create_place_udf_jar.html https://docs.cloudera.com/cdw-runtime/1.5.0/integrating-hive-and-bi/topics/hive_setup_jdbcstoragehandler_edb.html
... View more
04-20-2023
10:47 PM
It's working expected. Please find the below code snippet >>> columns = ["language","users_count"]
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data).toDF(*columns)
>>> df.write.csv("/tmp/test")
>>> df2=spark.read.csv("/tmp/test/*.csv")
d>>> df2.show()
+------+------+
| _c0| _c1|
+------+------+
|Python|100000|
| Scala| 3000|
| Java| 20000|
+------+------+
... View more
04-20-2023
05:31 AM
From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.
... View more
10-13-2022
02:46 AM
@Sunil1359 Compilation might be higher if the table has a large number of partitions or if the HMS process is slow when the query runs. Please check the below on the corresponding time period to find the root cause. HS2 log HMS log HMS jstack In Tez engine queries will run in the form of DAG. In the compilation phase, once the semantic analysis process is completed, the plan will be generated depending on the query you submitted. explain <your query> gives the plan of the query. Once the plan is generated DAG will be submitted to yarn and the DAG will run depending on the plan. As part of DAG, Split generation, input file read, shuffle fetch ..etc will be taken care and the end result will be transferred to the client.
... View more
04-25-2022
11:15 PM
Hi, From shell find the files that needs to be deleted and save them in a temp file like below, #!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done Then execute arbitrary shell command using form example subprocess.call or sh library so something like below import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each]) Also, you can use hdfs fs API in PySpark like below, from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
... View more
- « Previous
- Next »