Member since
09-16-2021
144
Posts
6
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
226 | 11-06-2023 03:10 AM | |
115 | 10-30-2023 07:17 AM | |
187 | 10-27-2023 12:07 AM | |
230 | 10-10-2023 10:57 AM | |
221 | 10-10-2023 10:50 AM |
10-10-2023
10:57 AM
1 Kudo
In Hive, you can achieve a similar result as the UNPIVOT operation in SQL Server by using the LATERAL VIEW and lateral VIEW OUTER explode functions to split the columns into rows. Here's how you can convert your SQL Server query to Hive: SELECT x, check AS y, split AS z
FROM dbo.tbl1
LATERAL VIEW OUTER explode(array(1, y2, y3, y4, y5, y6, y7, y8, y9, y10)) tbl AS split; In this Hive query: LATERAL VIEW OUTER explode is used to split the values from columns y2 to y10 into separate rows. The AS clause assigns aliases to the columns, where split corresponds to the values from the UNPIVOTed columns, check corresponds to the column name (y), and x remains unchanged. This query will produce a result similar to the UNPIVOT operation in SQL Server, where the values from columns y2 to y10 are split into separate rows along with their corresponding x and y values. In Hive, you can achieve a similar result as the PIVOT operation in SQL Server by using conditional aggregation along with CASE statements. Here's how you can convert your SQL Server query to Hive: SELECT *
FROM (
SELECT a, b, c, cbn_TYPE
FROM tbl2
) SRC
LEFT JOIN (
SELECT
a,
SUM(CASE WHEN cbn_TYPE = 'ONE TQ FOUR' THEN TOTAL_AMOUNT ELSE 0 END) AS ONE_TQ_FOUR,
SUM(CASE WHEN cbn_TYPE = 'going loss' THEN TOTAL_AMOUNT ELSE 0 END) AS going_loss,
SUM(CASE WHEN cbn_TYPE = 'COSTS LEAVING team sales' THEN TOTAL_AMOUNT ELSE 0 END) AS COSTS_LEAVING_team_sales,
SUM(CASE WHEN cbn_TYPE = 'profit' THEN TOTAL_AMOUNT ELSE 0 END) AS profit,
SUM(CASE WHEN cbn_TYPE = 'check money' THEN TOTAL_AMOUNT ELSE 0 END) AS check_money
FROM tbl2
GROUP BY a
) PIV
ON SRC.a = PIV.a; In this Hive query: We first create an intermediate result set (PIV) that calculates the sums for each cbn_TYPE using conditional aggregation (SUM with CASE statements). The LEFT JOIN is used to combine the original source table (SRC) with the aggregated result (PIV) based on the common column a. The result will have columns a, b, c, and the pivoted columns ONE_TQ_FOUR, going_loss, COSTS_LEAVING_team_sales, profit, and check_money, similar to the PIVOT operation in SQL Server. This query essentially performs a manual pivot operation in Hive by using conditional aggregation to calculate the sums for each cbn_TYPE and then joining the results back to the original table. In Hive, you can use the CASE statement to achieve the same result as the SQL Server expression NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0). Here's the equivalent Hive query: SELECT
CASE
WHEN (abc.Tc IS NULL AND abc.YR IS NULL) OR (abc.Tc + abc.YR = 0) THEN NULL
ELSE abc.Tc + abc.YR
END AS result
FROM your_table AS abc; In this Hive query: We use the CASE statement to conditionally calculate the result. If both abc.Tc and abc.YR are NULL, or if their sum is equal to 0, we return NULL. Otherwise, we return the sum of abc.Tc and abc.YR. This query replicates the behavior of the NULLIF(ISNULL(abc.Tc, 0) + ISNULL(abc.YR, 0), 0) expression in SQL Server, providing a Hive-compatible solution for achieving the same result.
... View more
10-10-2023
10:50 AM
Please share the complete stack-trace to get better context. To perform an INSERT OVERWRITE operation on a Hive ACID transactional table, you need to ensure that you have the right configuration and execute the query correctly. Here are the steps and configurations: Enable ACID Transactions: Make sure your table is created with ACID properties. You can specify it during table creation like this: CREATE TABLE my_table (
-- Your table schema here
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true'); If your table is not already transactional, you may need to create a new transactional table with the desired schema. Set Hive ACID Properties: You should set some Hive configuration properties to enable ACID transactions if they are not already set: SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1; -- Number of compactor threads depending on the number of managed tables and usgae Perform the INSERT OVERWRITE: Use the INSERT OVERWRITE statement to replace the data in the table: INSERT OVERWRITE TABLE my_table
SELECT ...
FROM ... Ensure that the SELECT statement fetches the data you want to overwrite with. You can use a WHERE clause or other filters to specify the data you want to replace. Enable Auto-Compaction : You can enable auto-compaction to periodically clean up small files created by ACID transactions. REF - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ch02s05s01.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_acid_operations.html https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_hive_data_compaction.html
... View more
10-10-2023
10:35 AM
Please share some sample data to provide a more accurate solution
... View more
10-10-2023
10:29 AM
1 Kudo
Basic spark-submit command with respect to HWC - JDBC_CLUSTER mode pyspark --master yarn --jars /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar --py-files /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.8.0-801.zip --conf spark.sql.hive.hiveserver2.jdbc.url='jdbc:hive2://c3757-node2.coelab.cloudera.com:2181,c3757-node3.coelab.cloudera.com:2181,c3757-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' --conf spark.datasource.hive.warehouse.read.mode='JDBC_CLUSTER' --conf spark.datasource.hive.warehouse.load.staging.dir='/tmp' --conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator To append data to an existing Hive ACID table, ensure that you specify the save mode as 'append'. Example Using Python version 2.7.5 (default, Jun 28 2022 15:30:04)
SparkSession available as 'spark'.
>>> from pyspark_llap import HiveWarehouseSession
>>> hive = HiveWarehouseSession.session(spark).build()
>>> df=hive.sql("select * from spark_hwc.employee")
23/10/10 17:20:00 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
23/10/10 17:20:08 INFO rule.HWCSwitchRule: Registering Listeners
>>> df.write.mode("append").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", "spark_hwc.employee_new").save()
>>>
>>>
>>> hive.sql("select count(*) from spark_hwc.employee_new").show()
23/10/10 17:22:04 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
+---+
|_c0|
+---+
| 5|
+---+
>>> To overwrite data to an existing Hive ACID table, ensure that you specify the save mode as 'overwrite'. Example >>> df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", "spark_hwc.employee_new").save()
>>> To append or overwrite a new Hive ACID table, there's no need to specify the saveMode explicitly. The HWC will automatically create the new ACID table based on its structure and internally trigger the LOAD DATA INPATH command Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/integrating-hive-and-bi/topics/hive-read-write-operations.html
... View more
10-10-2023
10:15 AM
Please share the complete error stack-trace. With respect to The table doesn't have partitions. Make sure HDFS and metadata in sync.
... View more
10-10-2023
10:06 AM
Grafana is a popular open-source platform for monitoring and observability, and it is commonly associated with telemetry data visualization, especially when integrated with time-series databases like Prometheus, InfluxDB, or Elasticsearch. However, Grafana is not limited to telemetry data visualization, and it can be used for a wide range of data sources, including HDFS and Hive tables. Here are some options for using Grafana for data visualization beyond telemetry: Hive Data Sources: Grafana has built-in support for various data sources, and it offers plugins for connecting to databases and data lakes. You can configure Grafana to connect to Hive as a data source and visualize data stored in Hive tables. HDFS Data Sources: While Grafana primarily focuses on time-series data, you can still use it to visualize data stored in HDFS by connecting it to Hadoop-related data sources or by exporting HDFS data to another data store (e.g., Elasticsearch, InfluxDB) that Grafana supports. SQL Databases: Grafana can connect to traditional relational databases using SQL data sources. If you have data stored in SQL databases, you can use Grafana to create dashboards and visualizations. Log Data: Grafana can be used for log data analysis and visualization. You can integrate it with tools like Loki (for log aggregation) and explore log data in dashboards. Custom Plugins: If you have a unique data source or a specific format, you can develop custom data source plugins for Grafana to connect to your data and visualize it as needed. API Data: Grafana supports various data sources that expose data through APIs. You can connect to REST APIs, GraphQL APIs, and other web services to visualize data. Mixed Data Sources: Grafana allows you to create dashboards that combine data from multiple sources, making it versatile for various data visualization needs. While Grafana is flexible and can be used for a wide range of data sources, it's important to consider the nature of your data and the specific visualization requirements. Depending on your use case, you may need to choose the most suitable data source, data format, and visualization options within Grafana to achieve your desired results.
... View more
10-10-2023
09:58 AM
When users query a Hive table partitioned on a specific column (in your case, "source system name") but do not include a filter condition on that partition column in their queries, Hive may need to scan all partitions of the table to retrieve the relevant data. This can lead to less efficient query performance, as it requires reading unnecessary data from multiple partitions. In your scenario, where you perform frequent insert overwrites to keep only the current data, the table may not grow drastically in terms of total data volume. However, if the users frequently query the table without specifying the partition column condition, it can still result in increased query processing time and resource utilisation. To improve query efficiency in this situation, you have a few options: Partition Pruning: Encourage users to include the partition column condition in their queries. Hive has built-in partition pruning optimization, which allows it to skip unnecessary partitions when the partition column condition is provided. Materialized Views: If certain common query patterns exist, consider creating materialized views that pre-aggregate or pre-filter data based on those patterns. This can significantly speed up queries that align with the materialized views. Optimize Data Layout: Ensure that the data is stored efficiently, and consider using columnar storage formats like ORC or Parquet, which can improve query performance. Ultimately, the choice of optimization strategy depends on the specific usage patterns and requirements of your users. It's essential to monitor query performance and understand your users' query behavior to determine which optimization approaches are most effective.
... View more
10-09-2023
06:27 AM
Could you kindly provide the DDL and a sample dataset to facilitate a more in-depth explanation?
... View more
10-04-2023
02:33 AM
1 Kudo
In Hive, there is no specific built-in data type that directly corresponds to the SQL Image data type for retaining the original binary image data from an SQL source. Hive primarily deals with structured data types like strings, numbers, and complex types such as arrays, maps, and structs. To store binary data like images in Hive, you typically use the BINARY data type or store them as STRING data, especially if you want to represent them in base64-encoded format. However, neither of these data types inherently retains the original binary value as is. You would need to handle the encoding and decoding of the binary data yourself. Here's an example of how you might store binary image data in Hive using the BINARY data type: CREATE TABLE image_data (
image_id INT,
image_content BINARY
); When you insert data into this table, you would need to encode the binary image data into a binary format suitable for storage in Hive. If retaining the original binary image data in its original format is crucial, you may want to consider other data storage solutions that are specifically designed for binary data, such as a distributed file system or binary data storage services. Hive, being primarily designed for structured data, may not be the best choice for this use case if you need to maintain the exact original binary data without encoding or modification.
... View more
10-03-2023
02:15 AM
It seems that the query involves dynamic partitioning, but the dynamic partition column is not included in either the select statement or the Common Table Expression (CTE). Please add the dynamic partition column 'date' to the select statement and validate it in Beeline.
... View more