Member since
09-16-2021
330
Posts
52
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
239 | 11-10-2024 11:19 PM | |
372 | 10-25-2024 05:02 AM | |
1946 | 09-10-2024 07:50 AM | |
700 | 09-04-2024 05:35 AM | |
1555 | 08-28-2024 12:40 AM |
10-24-2023
09:36 PM
I think you don't have sufficient resources to run the job for queue root.hdfs. Verify is there any pending running jobs/application in the root.hdfs queue from Resource Manager UI. If it is running kill those if it is not required. And also verify from spark side you have given less resource to test it.
... View more
10-16-2023
09:34 PM
@DaveNepal, Thank you for your participation in the Cloudera Community. I'm happy to see you resolved your issue. Could you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?
... View more
10-13-2023
04:32 AM
In Hadoop, you can use the Hadoop Distributed File System (HDFS) shell commands to remove files that meet certain criteria, such as being older than a certain number of days or greater than a certain number of files in a folder. You can achieve this using HDFS shell commands in a shell script. Here's how you can do it: To remove all files greater than 100 files in a folder: hadoop fs -count -q -h <folder_path>: This command retrieves a count of files in the specified folder, along with their sizes and other information. awk '$2 > 100 {print $3}': This awk command filters the output to select only those file paths where the file count is greater than 100. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm To remove all files older than 10 days in a folder: hadoop fs -ls <folder_path> | awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}' | xargs -I {} hadoop fs -rm {} hadoop fs -ls <folder_path>: This command lists the files in the specified folder. awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}': This awk command calculates the timestamp for 10 days ago and compares it to the modification timestamps of the files. It selects files with modification timestamps older than 10 days. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm.
... View more
10-10-2023
10:57 AM
1 Kudo
In Hive, you can achieve a similar result as the UNPIVOT operation in SQL Server by using the LATERAL VIEW and lateral VIEW OUTER explode functions to split the columns into rows. Here's how you can convert your SQL Server query to Hive: SELECT x, check AS y, split AS z
FROM dbo.tbl1
LATERAL VIEW OUTER explode(array(1, y2, y3, y4, y5, y6, y7, y8, y9, y10)) tbl AS split; In this Hive query: LATERAL VIEW OUTER explode is used to split the values from columns y2 to y10 into separate rows. The AS clause assigns aliases to the columns, where split corresponds to the values from the UNPIVOTed columns, check corresponds to the column name (y), and x remains unchanged. This query will produce a result similar to the UNPIVOT operation in SQL Server, where the values from columns y2 to y10 are split into separate rows along with their corresponding x and y values. In Hive, you can achieve a similar result as the PIVOT operation in SQL Server by using conditional aggregation along with CASE statements. Here's how you can convert your SQL Server query to Hive: SELECT *
FROM (
SELECT a, b, c, cbn_TYPE
FROM tbl2
) SRC
LEFT JOIN (
SELECT
a,
SUM(CASE WHEN cbn_TYPE = 'ONE TQ FOUR' THEN TOTAL_AMOUNT ELSE 0 END) AS ONE_TQ_FOUR,
SUM(CASE WHEN cbn_TYPE = 'going loss' THEN TOTAL_AMOUNT ELSE 0 END) AS going_loss,
SUM(CASE WHEN cbn_TYPE = 'COSTS LEAVING team sales' THEN TOTAL_AMOUNT ELSE 0 END) AS COSTS_LEAVING_team_sales,
SUM(CASE WHEN cbn_TYPE = 'profit' THEN TOTAL_AMOUNT ELSE 0 END) AS profit,
SUM(CASE WHEN cbn_TYPE = 'check money' THEN TOTAL_AMOUNT ELSE 0 END) AS check_money
FROM tbl2
GROUP BY a
) PIV
ON SRC.a = PIV.a; In this Hive query: We first create an intermediate result set (PIV) that calculates the sums for each cbn_TYPE using conditional aggregation (SUM with CASE statements). The LEFT JOIN is used to combine the original source table (SRC) with the aggregated result (PIV) based on the common column a. The result will have columns a, b, c, and the pivoted columns ONE_TQ_FOUR, going_loss, COSTS_LEAVING_team_sales, profit, and check_money, similar to the PIVOT operation in SQL Server. This query essentially performs a manual pivot operation in Hive by using conditional aggregation to calculate the sums for each cbn_TYPE and then joining the results back to the original table. In Hive, you can use the CASE statement to achieve the same result as the SQL Server expression NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0). Here's the equivalent Hive query: SELECT
CASE
WHEN (abc.Tc IS NULL AND abc.YR IS NULL) OR (abc.Tc + abc.YR = 0) THEN NULL
ELSE abc.Tc + abc.YR
END AS result
FROM your_table AS abc; In this Hive query: We use the CASE statement to conditionally calculate the result. If both abc.Tc and abc.YR are NULL, or if their sum is equal to 0, we return NULL. Otherwise, we return the sum of abc.Tc and abc.YR. This query replicates the behavior of the NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0) expression in SQL Server, providing a Hive-compatible solution for achieving the same result.
... View more
10-10-2023
10:50 AM
Please share the complete stack-trace to get better context. To perform an INSERT OVERWRITE operation on a Hive ACID transactional table, you need to ensure that you have the right configuration and execute the query correctly. Here are the steps and configurations: Enable ACID Transactions: Make sure your table is created with ACID properties. You can specify it during table creation like this: CREATE TABLE my_table (
-- Your table schema here
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true'); If your table is not already transactional, you may need to create a new transactional table with the desired schema. Set Hive ACID Properties: You should set some Hive configuration properties to enable ACID transactions if they are not already set: SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1; -- Number of compactor threads depending on the number of managed tables and usgae Perform the INSERT OVERWRITE: Use the INSERT OVERWRITE statement to replace the data in the table: INSERT OVERWRITE TABLE my_table
SELECT ...
FROM ... Ensure that the SELECT statement fetches the data you want to overwrite with. You can use a WHERE clause or other filters to specify the data you want to replace. Enable Auto-Compaction : You can enable auto-compaction to periodically clean up small files created by ACID transactions. REF - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ch02s05s01.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_acid_operations.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_hive_data_compaction.html
... View more
10-10-2023
10:35 AM
Please share some sample data to provide a more accurate solution
... View more
10-10-2023
10:29 AM
1 Kudo
Basic spark-submit command with respect to HWC - JDBC_CLUSTER mode pyspark --master yarn --jars /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar --py-files /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.8.0-801.zip --conf spark.sql.hive.hiveserver2.jdbc.url='jdbc:hive2://c3757-node2.coelab.cloudera.com:2181,c3757-node3.coelab.cloudera.com:2181,c3757-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' --conf spark.datasource.hive.warehouse.read.mode='JDBC_CLUSTER' --conf spark.datasource.hive.warehouse.load.staging.dir='/tmp' --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator To append data to an existing Hive ACID table, ensure that you specify the save mode as 'append'. Example Using Python version 2.7.5 (default, Jun 28 2022 15:30:04)
SparkSession available as 'spark'.
>>> from pyspark_llap import HiveWarehouseSession
>>> hive = HiveWarehouseSession.session(spark).build()
>>> df=hive.sql("select * from spark_hwc.employee")
23/10/10 17:20:00 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
23/10/10 17:20:08 INFO rule.HWCSwitchRule: Registering Listeners
>>> df.write.mode("append").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", "spark_hwc.employee_new").save()
>>>
>>>
>>> hive.sql("select count(*) from spark_hwc.employee_new").show()
23/10/10 17:22:04 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
+---+
|_c0|
+---+
| 5|
+---+
>>> To overwrite data to an existing Hive ACID table, ensure that you specify the save mode as 'overwrite'. Example >>> df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", "spark_hwc.employee_new").save()
>>> To append or overwrite a new Hive ACID table, there's no need to specify the saveMode explicitly. The HWC will automatically create the new ACID table based on its structure and internally trigger the LOAD DATA INPATH command Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/integrating-hive-and-bi/topics/hive-read-write-operations.html
... View more
10-10-2023
10:15 AM
Please share the complete error stack-trace. With respect to The table doesn't have partitions. Make sure HDFS and metadata in sync.
... View more
10-10-2023
10:06 AM
Grafana is a popular open-source platform for monitoring and observability, and it is commonly associated with telemetry data visualization, especially when integrated with time-series databases like Prometheus, InfluxDB, or Elasticsearch. However, Grafana is not limited to telemetry data visualization, and it can be used for a wide range of data sources, including HDFS and Hive tables. Here are some options for using Grafana for data visualization beyond telemetry: Hive Data Sources: Grafana has built-in support for various data sources, and it offers plugins for connecting to databases and data lakes. You can configure Grafana to connect to Hive as a data source and visualize data stored in Hive tables. HDFS Data Sources: While Grafana primarily focuses on time-series data, you can still use it to visualize data stored in HDFS by connecting it to Hadoop-related data sources or by exporting HDFS data to another data store (e.g., Elasticsearch, InfluxDB) that Grafana supports. SQL Databases: Grafana can connect to traditional relational databases using SQL data sources. If you have data stored in SQL databases, you can use Grafana to create dashboards and visualizations. Log Data: Grafana can be used for log data analysis and visualization. You can integrate it with tools like Loki (for log aggregation) and explore log data in dashboards. Custom Plugins: If you have a unique data source or a specific format, you can develop custom data source plugins for Grafana to connect to your data and visualize it as needed. API Data: Grafana supports various data sources that expose data through APIs. You can connect to REST APIs, GraphQL APIs, and other web services to visualize data. Mixed Data Sources: Grafana allows you to create dashboards that combine data from multiple sources, making it versatile for various data visualization needs. While Grafana is flexible and can be used for a wide range of data sources, it's important to consider the nature of your data and the specific visualization requirements. Depending on your use case, you may need to choose the most suitable data source, data format, and visualization options within Grafana to achieve your desired results.
... View more
10-10-2023
09:58 AM
When users query a Hive table partitioned on a specific column (in your case, "source system name") but do not include a filter condition on that partition column in their queries, Hive may need to scan all partitions of the table to retrieve the relevant data. This can lead to less efficient query performance, as it requires reading unnecessary data from multiple partitions. In your scenario, where you perform frequent insert overwrites to keep only the current data, the table may not grow drastically in terms of total data volume. However, if the users frequently query the table without specifying the partition column condition, it can still result in increased query processing time and resource utilisation. To improve query efficiency in this situation, you have a few options: Partition Pruning: Encourage users to include the partition column condition in their queries. Hive has built-in partition pruning optimization, which allows it to skip unnecessary partitions when the partition column condition is provided. Materialized Views: If certain common query patterns exist, consider creating materialized views that pre-aggregate or pre-filter data based on those patterns. This can significantly speed up queries that align with the materialized views. Optimize Data Layout: Ensure that the data is stored efficiently, and consider using columnar storage formats like ORC or Parquet, which can improve query performance. Ultimately, the choice of optimization strategy depends on the specific usage patterns and requirements of your users. It's essential to monitor query performance and understand your users' query behavior to determine which optimization approaches are most effective.
... View more