About ggangadharan

Faflusniak · ‎06-15-2023

Thank you for your advise. We will investigate proposed solution with spark-xml Best regards

ggangadharan · ‎06-13-2023

Output of below to identify the exact ouptut records details, explain formatted <query> explain extended <query> explain analyze <query>

DianaTorres · ‎06-09-2023

@Josh2023 Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.

snm1523 · ‎06-07-2023

Thank you for the respone @ggangadharan

ggangadharan · ‎06-06-2023

Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive. df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records") Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html Sample Code Snippet df = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \ .option("dbtable", "\"<SourceTableName>\"") \ .option("user", "<Username>") \ .option("password", "<Password>") \ .option("driver", "org.postgresql.Driver") \ .load() df.write.mode('overwrite').saveAsTable("<TargetTableName>") From hive INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES . . . +------+ | _c0 | +------+ | 122 | +------+

ggangadharan · ‎04-20-2023

It's working expected. Please find the below code snippet >>> columns = ["language","users_count"] >>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] >>> df = spark.createDataFrame(data).toDF(*columns) >>> df.write.csv("/tmp/test") >>> df2=spark.read.csv("/tmp/test/*.csv") d>>> df2.show() +------+------+ | _c0| _c1| +------+------+ |Python|100000| | Scala| 3000| | Java| 20000| +------+------+

ggangadharan · ‎04-20-2023

From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.

ggangadharan · ‎10-13-2022

@Sunil1359 Compilation might be higher if the table has a large number of partitions or if the HMS process is slow when the query runs. Please check the below on the corresponding time period to find the root cause. HS2 log HMS log HMS jstack In Tez engine queries will run in the form of DAG. In the compilation phase, once the semantic analysis process is completed, the plan will be generated depending on the query you submitted. explain <your query> gives the plan of the query. Once the plan is generated DAG will be submitted to yarn and the DAG will run depending on the plan. As part of DAG, Split generation, input file read, shuffle fetch ..etc will be taken care and the end result will be transferred to the client.

arati · ‎04-29-2022

Hi ggangadharan I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script. and do it as an import subprocess subprocess.call('./home/test.sh/' ,shell=True) file = open('toDelete, 'r') for each in file: subprocess.call(["hadoop", "fs", "-rm", "-f", each]) but now my shell script is not executing and not showing any output ,plz suggest me what i do. thanks

Yusuke · ‎03-09-2022

Hello, Sorry for the late answer. I can't paste all the informations on the chat. Do you have some specific values that i can share ?

Online	Online
Last Visited	‎01-08-2025 12:21 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎01-08-2025 12:21 AM
Posts	337
Kudos received	53

Cloudera Community

Re: Hive Job - OutOfMemoryError: Java heap space

Re: Insert into table test values('a', 'b'); not w...

Re: how to drop partition table using date_add fun...

Re: Issue with Hive HQL insert query - KryoExcepti...

Re: Error when do an alter table change column on ...

Re: How to automatically convert huge and complex ...

Re: Huge number of RECORDS_OUT_OPERATOR_RS_22

Re: Change default output filename part-r-00000.sn...

Re: All hive tables location

Re: Regarding data import into hive from csv

Re: Unable to convert a pyspark dataframe to CSV

Re: Error with socket timeout in CDP hive 3.1.3 wh...

Re: Tez query profiling

Re: HDFS dir cleanup which older than 7 days in py...

Re: Apache Hive "surrogate_key" error