About ggangadharan

ggangadharan · ‎10-30-2023

It appears that the JSON data contains multiple application entries within a single line, presented as struct data. This format makes schema creation challenging. To address this, you can leverage Spark to flatten the schema and store the data in Hive. This enables you to query the data conveniently from either Hive or Spark. Read the data JSON data df = spark.read.json("/user/hive/app_data_sample_data.json") First, explode the "app" array to separate rows from pyspark.sql.functions import col, explode, lit, struct exploded_df = df.select( explode(col("apps.app")).alias("app") ) Flatten and transform the exploded DataFrame # Flatten and transform the exploded DataFrame flattened_df = exploded_df.select( col("app.id").alias("id"), col("app.user").alias("user"), col("app.name").alias("name"), col("app.queue").alias("queue"), col("app.state").alias("state"), col("app.finalstatus").alias("finalstatus"), col("app.progress").alias("progress"), col("app.trackingui").alias("trackingui"), col("app.trackingurl").alias("trackingurl"), col("app.diagnostics").alias("diagnostics"), col("app.clusterid").alias("clusterid"), col("app.applicationtype").alias("applicationtype"), col("app.applicationtags").alias("applicationtags"), col("app.priority").alias("priority"), col("app.startedtime").alias("startedtime"), col("app.launchtime").alias("launchtime"), col("app.finishedtime").alias("finishedtime"), col("app.elapsedtime").alias("elapsedtime"), col("app.amcontainerlogs").alias("amcontainerlogs"), col("app.amhosthttpaddress").alias("amhosthttpaddress"), col("app.amrpcaddress").alias("amrpcaddress"), col("app.masternodeid").alias("masternodeid"), col("app.allocatedmb").alias("allocatedmb"), col("app.allocatedvcores").alias("allocatedvcores"), col("app.reservedmb").alias("reservedmb"), col("app.reservedvcores").alias("reservedvcores"), col("app.runningcontainers").alias("runningcontainers"), col("app.memoryseconds").alias("memoryseconds"), col("app.vcoreseconds").alias("vcoreseconds"), col("app.queueusagepercentage").alias("queueusagepercentage"), col("app.clusterusagepercentage").alias("clusterusagepercentage"), col("app.preemptedresourcemb").alias("preemptedresourcemb"), col("app.preemptedresourcevcores").alias("preemptedresourcevcores"), col("app.numnonamcontainerpreempted").alias("numnonamcontainerpreempted"), col("app.numamcontainerpreempted").alias("numamcontainerpreempted"), col("app.preemptedmemoryseconds").alias("preemptedmemoryseconds"), col("app.preemptedvcoreseconds").alias("preemptedvcoreseconds"), col("app.logaggregationstatus").alias("logaggregationstatus"), col("app.unmanagedapplication").alias("unmanagedapplication"), col("app.amnodelabelexpression").alias("amnodelabelexpression"), struct( lit("lifetime").alias("type"), lit("unlimited").alias("expirytime"), lit(-1).alias("remainingtimeinseconds") ).alias("timeouts") ) Validate the flattened DataFrame flattened_df.show(truncate=False) If the data looks good , save the data as table. flattened_df.write.mode('overwrite').saveAsTable("app_data") Query form hive (beeline) +---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+ | app_data.id | app_data.user | app_data.name | app_data.queue | app_data.state | app_data.finalstatus | app_data.progress | app_data.trackingui | app_data.trackingurl | app_data.diagnostics | app_data.clusterid | app_data.applicationtype | app_data.applicationtags | app_data.priority | app_data.startedtime | app_data.launchtime | app_data.finishedtime | app_data.elapsedtime | app_data.amcontainerlogs | app_data.amhosthttpaddress | app_data.amrpcaddress | app_data.masternodeid | app_data.allocatedmb | app_data.allocatedvcores | app_data.reservedmb | app_data.reservedvcores | app_data.runningcontainers | app_data.memoryseconds | app_data.vcoreseconds | app_data.queueusagepercentage | app_data.clusterusagepercentage | app_data.preemptedresourcemb | app_data.preemptedresourcevcores | app_data.numnonamcontainerpreempted | app_data.numamcontainerpreempted | app_data.preemptedmemoryseconds | app_data.preemptedvcoreseconds | app_data.logaggregationstatus | app_data.unmanagedapplication | app_data.amnodelabelexpression | app_data.timeouts | +---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+ | application_282828282828_12717 | xyz | xyz-4b6bdae2-1a0c-4772-bd8e-0d7454268b82 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12717/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0 | 282828282828 | aquaman | ABC,xyz_20221107070124_2beb5d90-24c7-4b1b-b977-3c9af1397195,userid=dummy | 0 | 1667822485626 | 1667822485767 | 1667822553365 | 67739 | http://dingdong:8042/node/containerlogs/container_e65_282828282828_12717_01_000001/xyz | dingdong:8042 | dingdong:46457 | dingdong:8041 | -1 | -1 | -1 | -1 | -1 | 1264304 | 79 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} | | application_282828282828_12724 | xyz | xyz-94962a3e-d230-4fd0-b68b-01b59dd3299d | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12724/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0 | 282828282828 | aquaman | ZZZ_,xyz_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy | 0 | 1667822585231 | 1667822585437 | 1667822631435 | 46204 | http://ding:8042/node/containerlogs/container_e65_282828282828_12724_01_000002/xyz | ding:8042 | ding:46648 | ding:8041 | -1 | -1 | -1 | -1 | -1 | 5603339 | 430 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | time_out | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} | | application_282828282828_12736 | xyz | xyz-1a9c73ef-2992-40a5-aaad-9f0688bb04f4 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12736/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0 | 282828282828 | aquaman | BLAHBLAH,xyz_20221107070609_8d261352-3efa-46c5-a5a0-8a3cd745d180,userid=dummy | 0 | 1667822771170 | 1667822773663 | 1667822820351 | 49181 | http://dong:8042/node/containerlogs/container_e65_282828282828_12736_01_000001/xyz | dong:8042 | dong:34266 | dong:8041 | -1 | -1 | -1 | -1 | -1 | 1300011 | 89 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} | | application_282828282828_12735 | xyz | xyz-d5f56a0a-9c6b-4651-8f88-6eaff5953777 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12735/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0 | 282828282828 | aquaman | HAHAHA_,xyz_20221107070605_a082d9d8-912f-4278-a2ef-5dfe66089fd7,userid=dummy | 0 | 1667822766897 | 1667822766999 | 1667822796759 | 29862 | http://dung:8042/node/containerlogs/container_e65_282828282828_12735_01_000001/xyz | dung:8042 | dung:42765 | dung:8041 | -1 | -1 | -1 | -1 | -1 | 669695 | 44 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} | +---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+

ggangadharan · ‎10-27-2023

The error message "Unknown HS2 problem when communicating with Thrift server" typically indicates that there is an issue when trying to communicate with the Hive Server 2 (HS2) through its Thrift interface. This error can occur for various reasons, and troubleshooting it may require some investigation. Here are some common steps to help resolve this issue: Check Hive Server Status: Ensure that the Hive Server 2 is up and running and that it's reachable from your client. You can check its status and logs to see if there are any errors or issues reported. Network Connectivity: Verify that there are no network-related issues that might be preventing your client from connecting to the Hive Server. Check firewalls, network configurations, and any potential network interruptions. Hive Configuration: Review the Hive server's configuration to ensure it's correctly set up. Pay attention to security configurations, like authentication and authorization settings. Thrift Protocol Version: Ensure that the Thrift protocol version used by your client matches the version supported by the Hive Server. Mismatched protocol versions can lead to communication problems. Client-Side Issues: Check the client application or code you are using to interact with Hive. Ensure that it's properly configured and making the correct requests to the Hive Server. Logs and Error Messages: Examine the logs and error messages in more detail to get specific information about what might be causing the problem. This can help pinpoint the issue. Server Version Compatibility: Ensure that the client and server components (Hive client and Hive Server) are compatible in terms of versions. Incompatible versions can lead to communication issues. Authentication and Authorization: If your Hive server is configured with authentication and authorization, ensure that you have the necessary permissions and credentials to access it. Load and Resource Constraints: Check if the Hive Server is under heavy load or if there are resource constraints that might be affecting its ability to respond to client requests. Driver and Libraries: Ensure that you are using the correct driver or libraries for your client application. If you're using JDBC or ODBC, make sure the corresponding driver is installed and configured correctly. If you continue to face issues after performing these checks, it may be necessary to provide more specific error messages or details about your environment to diagnose the problem further.

ggangadharan · ‎10-26-2023

Please follow the below article and validate the same. https://my.cloudera.com/knowledge/How-to-configure-HDFS-and-Hive-to-use-different-JCEKS-and?id=326056

ggangadharan · ‎10-26-2023

It seems that you are facing a situation where Query 1 returns results, Query 2 (with an additional field) does not return results, but when using SELECT *, results are returned, and when trimming all the condition fields, results are also returned. This behavior can be attributed to the way you've constructed your queries: Query 1: This query specifies certain conditions and fields, which may match records in your database. Query 2: In Query 2, you've added an additional field (af.unq_id_src_stm) to the SELECT statement. This change in the SELECT clause can affect the results returned. It's possible that the additional field is causing the query not to match any records due to the way the data is structured or the filter conditions. Using SELECT *: When you use SELECT *, it selects all fields in the result set, and it may include fields that are necessary for the join conditions or other aspects of the query. By selecting all fields, you are getting the complete result set. Trimming Condition Fields: If you trim or remove condition fields, it can affect the filter criteria, and as a result, the query may return results that were previously excluded by the conditions. To resolve the issue in Query 2, you may need to carefully review the additional field you added and ensure it doesn't unintentionally affect the join conditions or filter criteria. Additionally, ensure that the data you are querying contains the values specified in the conditions and the new field. You should also consider whether the additional field is really needed for your analysis. If it's not necessary, you can remove it to get the results you expect

RangaReddy · ‎10-24-2023

I think you don't have sufficient resources to run the job for queue root.hdfs. Verify is there any pending running jobs/application in the root.hdfs queue from Resource Manager UI. If it is running kill those if it is not required. And also verify from spark side you have given less resource to test it.

VidyaSargur · ‎10-16-2023

@DaveNepal, Thank you for your participation in the Cloudera Community. I'm happy to see you resolved your issue. Could you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?

ggangadharan · ‎10-13-2023

In Hadoop, you can use the Hadoop Distributed File System (HDFS) shell commands to remove files that meet certain criteria, such as being older than a certain number of days or greater than a certain number of files in a folder. You can achieve this using HDFS shell commands in a shell script. Here's how you can do it: To remove all files greater than 100 files in a folder: hadoop fs -count -q -h <folder_path>: This command retrieves a count of files in the specified folder, along with their sizes and other information. awk '$2 > 100 {print $3}': This awk command filters the output to select only those file paths where the file count is greater than 100. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm To remove all files older than 10 days in a folder: hadoop fs -ls <folder_path> | awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}' | xargs -I {} hadoop fs -rm {} hadoop fs -ls <folder_path>: This command lists the files in the specified folder. awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}': This awk command calculates the timestamp for 10 days ago and compares it to the modification timestamps of the files. It selects files with modification timestamps older than 10 days. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm.

ggangadharan · ‎10-10-2023

In Hive, you can achieve a similar result as the UNPIVOT operation in SQL Server by using the LATERAL VIEW and lateral VIEW OUTER explode functions to split the columns into rows. Here's how you can convert your SQL Server query to Hive: SELECT x, check AS y, split AS z FROM dbo.tbl1 LATERAL VIEW OUTER explode(array(1, y2, y3, y4, y5, y6, y7, y8, y9, y10)) tbl AS split; In this Hive query: LATERAL VIEW OUTER explode is used to split the values from columns y2 to y10 into separate rows. The AS clause assigns aliases to the columns, where split corresponds to the values from the UNPIVOTed columns, check corresponds to the column name (y), and x remains unchanged. This query will produce a result similar to the UNPIVOT operation in SQL Server, where the values from columns y2 to y10 are split into separate rows along with their corresponding x and y values. In Hive, you can achieve a similar result as the PIVOT operation in SQL Server by using conditional aggregation along with CASE statements. Here's how you can convert your SQL Server query to Hive: SELECT * FROM ( SELECT a, b, c, cbn_TYPE FROM tbl2 ) SRC LEFT JOIN ( SELECT a, SUM(CASE WHEN cbn_TYPE = 'ONE TQ FOUR' THEN TOTAL_AMOUNT ELSE 0 END) AS ONE_TQ_FOUR, SUM(CASE WHEN cbn_TYPE = 'going loss' THEN TOTAL_AMOUNT ELSE 0 END) AS going_loss, SUM(CASE WHEN cbn_TYPE = 'COSTS LEAVING team sales' THEN TOTAL_AMOUNT ELSE 0 END) AS COSTS_LEAVING_team_sales, SUM(CASE WHEN cbn_TYPE = 'profit' THEN TOTAL_AMOUNT ELSE 0 END) AS profit, SUM(CASE WHEN cbn_TYPE = 'check money' THEN TOTAL_AMOUNT ELSE 0 END) AS check_money FROM tbl2 GROUP BY a ) PIV ON SRC.a = PIV.a; In this Hive query: We first create an intermediate result set (PIV) that calculates the sums for each cbn_TYPE using conditional aggregation (SUM with CASE statements). The LEFT JOIN is used to combine the original source table (SRC) with the aggregated result (PIV) based on the common column a. The result will have columns a, b, c, and the pivoted columns ONE_TQ_FOUR, going_loss, COSTS_LEAVING_team_sales, profit, and check_money, similar to the PIVOT operation in SQL Server. This query essentially performs a manual pivot operation in Hive by using conditional aggregation to calculate the sums for each cbn_TYPE and then joining the results back to the original table. In Hive, you can use the CASE statement to achieve the same result as the SQL Server expression NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0). Here's the equivalent Hive query: SELECT CASE WHEN (abc.Tc IS NULL AND abc.YR IS NULL) OR (abc.Tc + abc.YR = 0) THEN NULL ELSE abc.Tc + abc.YR END AS result FROM your_table AS abc; In this Hive query: We use the CASE statement to conditionally calculate the result. If both abc.Tc and abc.YR are NULL, or if their sum is equal to 0, we return NULL. Otherwise, we return the sum of abc.Tc and abc.YR. This query replicates the behavior of the NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0) expression in SQL Server, providing a Hive-compatible solution for achieving the same result.

ggangadharan · ‎10-10-2023

Please share the complete stack-trace to get better context. To perform an INSERT OVERWRITE operation on a Hive ACID transactional table, you need to ensure that you have the right configuration and execute the query correctly. Here are the steps and configurations: Enable ACID Transactions: Make sure your table is created with ACID properties. You can specify it during table creation like this: CREATE TABLE my_table ( -- Your table schema here ) STORED AS ORC TBLPROPERTIES ('transactional'='true'); If your table is not already transactional, you may need to create a new transactional table with the desired schema. Set Hive ACID Properties: You should set some Hive configuration properties to enable ACID transactions if they are not already set: SET hive.support.concurrency=true; SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; SET hive.compactor.initiator.on=true; SET hive.compactor.worker.threads=1; -- Number of compactor threads depending on the number of managed tables and usgae Perform the INSERT OVERWRITE: Use the INSERT OVERWRITE statement to replace the data in the table: INSERT OVERWRITE TABLE my_table SELECT ... FROM ... Ensure that the SELECT statement fetches the data you want to overwrite with. You can use a WHERE clause or other filters to specify the data you want to replace. Enable Auto-Compaction : You can enable auto-compaction to periodically clean up small files created by ACID transactions. REF - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ch02s05s01.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_acid_operations.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_hive_data_compaction.html

ggangadharan · ‎10-10-2023

Please share some sample data to provide a more accurate solution

Online	Offline
Last Visited	‎01-21-2026 07:19 PM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎01-21-2026 07:19 PM
Posts	423
Kudos received	55

Cloudera Community

Re: HWC on CDP 7.3.1 with Spark 3.5

Re: Using Hadoop Iceberg catalog with Hive engine ...

Re: Where can I find the Maven repository for HDP ...

Re: Failed with exception java.io.IOException:org....

Re: Hive on TEZ memory footprint and Impala stats...

Re: How to parse Yarn Metrics via REST ( api v49 )...

Re: Unable to connect to HIVE from CLI though the ...

Re: Unable to create a hive table from the .csv fi...

Re: Self Join result weird

Re: Spark application stuck at ACCEPTED state at s...

Re: SQOOP import of "image" data type into hive

Re: How to remove all files greater than 100 files...

Re: SQL to Hive Query Conversion

Re: [How?] Hive INSERT OVERWRITE to Transaction Ta...

Re: Why I'm getting Null values in hive table thou...