Member since
09-16-2021
423
Posts
55
Kudos Received
39
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1298 | 10-22-2025 05:48 AM | |
| 1378 | 09-05-2025 07:19 AM | |
| 2337 | 07-15-2025 02:22 AM | |
| 3198 | 05-22-2025 03:00 AM | |
| 2004 | 05-19-2025 03:02 AM |
01-09-2024
03:40 AM
As I was already using the Hadoop Credential Provider, I found a solution that does not require decrypting the password as follows: PySpark code: # Spark session
spark = SparkSession.builder \
.config("spark.yarn.keytab=/etc/security/keytabs/<APPLICATION_USER>.keytab") \
.appName('SPARK_TEST') \
.master("yarn") \
.getOrCreate()
credential_provider_path = 'jceks://hdfs/<PATH>/<CREDENTIAL_FILE>.jceks'
credential_name = 'PASSWORD.ALIAS'
# Hadoop credential
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set('hadoop.security.credential.provider.path',credential_provider_path)
credential_raw = conf.getPassword(credential_name)
for i in range(credential_raw.__len__()):
password = password + str(credential_raw.__getitem__(i)) The important point above is the .config() line in SparkSession. You must enter the keytab to access the password. Otherwise you will get the encrypted value. I can't say that I'm very happy with being able to directly manipulate the password value in the code. I would like to delegate this to some component in a way that the programmer does not have direct access to the password value. Maybe what I'm looking for is some kind of authentication provider, but for now the solution above works for me.
... View more
11-29-2023
10:40 PM
Thanks. I was able to resolve it by updating the location in hive meta store. /usr/bin/hive --service metatool -updateLocation new-location old-location
... View more
11-21-2023
09:37 AM
The error you're encountering (OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, ...) indicates that there was an issue during the execution of the Hive query. The specific error message within the response is Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Here are a few steps you can take to troubleshoot and resolve the issue: Check Hive Query Logs: Review the Hive query logs to get more details about the error. The logs might provide information about the specific query or task that failed, including any error messages or stack traces. You can find the logs in the Hive logs directory. The location may vary based on your Hadoop distribution and configuration. Inspect Query Syntax: Double-check the syntax of your Hive SQL query. Ensure that the query is valid and properly formed. Sometimes, a syntax error can lead to execution failures. Verify Hive Table Existence: Confirm that the Hive table you're querying actually exists. If the table or the specified database is missing, it can lead to errors. Check Permissions: Verify that the user running the Python query has the necessary permissions to access and query the Hive table. Lack of permissions can result in execution errors. Examine Tez Configuration: If your Hive queries use the Tez execution engine, check the Tez configuration. Ensure that Tez is properly configured on your cluster and that there are no issues with the Tez execution. Look for Resource Constraints: The error message mentions TezTask, so consider checking if there are any resource constraints on the Tez execution, such as memory or container size limitations. Update Python Library: Ensure that you are using a compatible version of the Python library for interacting with Hive (e.g., pyhive or pyhive[hive]). Updating the library to the latest version might help resolve certain issues. Test with a Simple Query: Simplify your query to a basic one and see if it executes successfully. This can help isolate whether the issue is specific to the query or a more general problem. After reviewing the logs and checking the mentioned aspects, you should have more insights into what might be causing the error. If the issue persists, consider providing more details about the Hive query and the surrounding context, so we can offer more targeted assistance.
... View more
11-21-2023
06:30 AM
To achieve your goal of loading data from all the latest files in each folder into a single DataFrame, you can collect the file paths from each folder in a list and then load the data into the DataFrame outside the loop. Here's a modified version of your code: import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val static_path = "/user/hdfs/test/partition_date="
val hours = 3
// Creating list of each folder.
val paths = (0 until hours)
.map(h => currentTs.minusHours(h))
.map(ts => s"${static_path}${ts.toLocalDate}/hour=${ts.getHour}")
.toList
// Collect the latest file paths from each folder in a list
val latestFilePaths = paths.flatMap { eachfolder =>
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
val pathstatus = fs.listStatus(new Path(eachfolder))
val currpathfiles = pathstatus.map(x => (x.getPath.toString, x.getModificationTime))
val latestFilePath = currpathfiles
.filter(_._1.endsWith(".csv"))
.sortBy(_._2)
.reverse
.headOption
.map(_._1)
latestFilePath
}
// Load data from all the latest files into a single DataFrame
val df = spark.read.format("csv").load(latestFilePaths: _*)
// Show the combined DataFrame
df.show() In this modified code: latestFilePaths is a list that collects the latest file path from each folder. Outside the loop, spark.read.format("csv").load(latestFilePaths: _*) is used to load data from all the latest files into a single DataFrame. Now, df will contain data from all the latest files in each folder, and you can perform further operations or analysis on this combined DataFrame.
... View more
11-21-2023
06:10 AM
In Hive, metadata related to tables and columns is typically stored in the 'hive' database, specifically within the 'TBLS' and 'COLUMNS_V2' tables in the 'metastore' database. It is not recommended for users to query the metadata directly. Instead, users can leverage the 'sys' database tables. Here is a modified query that utilizes the 'hive' database tables: sql USE sys;
-- Get the count of columns for all tables
SELECT
t.tbl_name AS TABLE_NAME,
COUNT(c.column_name) AS COLUMN_COUNT
FROM
tbls t
JOIN
columns_v2 c
ON
t.tbl_id = c.cd_id
GROUP BY
t.tbl_name; Explanation: The 'sys.tbls' table contains information about tables, while the 'sys.columns_v2' table contains information about columns. We join these tables on the 'TBL_ID' and 'CD_ID' columns to retrieve information about columns for each table. The 'COUNT(c.COLUMN_NAME)' expression calculates the count of columns for each table. This query provides a list of tables along with the count of columns for each table, using the 'sys' database tables."
... View more
11-21-2023
05:43 AM
The error message indicates that there is an inconsistency between the expected schema for the column 'db.table.parameter_11' and the actual schema found in the Parquet file 'hdfs:/path/table/1_data.0.parq'. The column type is expected to be a STRING, but the Parquet schema suggests that it is an optional int64 (integer) column. To resolve this issue, you'll need to investigate and potentially correct the schema mismatch. Here are some steps you can take: Verify the Expected Schema: Check the definition of the 'db.table.parameter_11' column in the Impala metadata or Hive metastore. Ensure that it is defined as a STRING type. Inspect the Parquet File Schema: You can use tools like parquet-tools to inspect the schema of the Parquet file directly. Run the following command in the terminal: bash parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. Compare Expected vs. Actual Schema: Compare the expected schema for 'db.table.parameter_11' with the actual schema found in the Parquet file. Identify any differences in data types. Investigate Data Inconsistencies: If there are data inconsistencies, investigate how they might have occurred. It's possible that there was a schema evolution or a mismatch during the data writing process. Resolve Schema Mismatch: Depending on your findings, you may need to correct the schema mismatch. This could involve updating the metadata in Impala or Hive to match the actual schema or adjusting the Parquet file schema. Update Impala Statistics: After resolving the schema mismatch, it's a good practice to update Impala statistics for the affected table. This can be done using the COMPUTE STATS command in Impala: This step ensures that Impala has up-to-date statistics for query optimization. Here's a high-level example of what the Parquet schema inspection might look like: parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. If the data type in the Parquet schema is incorrect, you may need to investigate how the data was written and whether there were any issues during that process. Correcting the schema mismatch and updating Impala statistics should help resolve the issue.
... View more
11-08-2023
01:17 AM
1 Kudo
To add to the point of @ggangadharan, there are lots of good articles/posts why the float and even the double datatype has these problems. Note that this is not Hive / Hadoop or Java specific. https://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency https://dzone.com/articles/never-use-float-and-double-for-monetary-calculatio https://www.red-gate.com/hub/product-learning/sql-prompt/the-dangers-of-using-float-or-real-datatypes Miklos
... View more
10-30-2023
07:17 AM
It appears that the JSON data contains multiple application entries within a single line, presented as struct data. This format makes schema creation challenging. To address this, you can leverage Spark to flatten the schema and store the data in Hive. This enables you to query the data conveniently from either Hive or Spark. Read the data JSON data df = spark.read.json("/user/hive/app_data_sample_data.json") First, explode the "app" array to separate rows from pyspark.sql.functions import col, explode, lit, struct
exploded_df = df.select(
explode(col("apps.app")).alias("app")
) Flatten and transform the exploded DataFrame # Flatten and transform the exploded DataFrame
flattened_df = exploded_df.select(
col("app.id").alias("id"),
col("app.user").alias("user"),
col("app.name").alias("name"),
col("app.queue").alias("queue"),
col("app.state").alias("state"),
col("app.finalstatus").alias("finalstatus"),
col("app.progress").alias("progress"),
col("app.trackingui").alias("trackingui"),
col("app.trackingurl").alias("trackingurl"),
col("app.diagnostics").alias("diagnostics"),
col("app.clusterid").alias("clusterid"),
col("app.applicationtype").alias("applicationtype"),
col("app.applicationtags").alias("applicationtags"),
col("app.priority").alias("priority"),
col("app.startedtime").alias("startedtime"),
col("app.launchtime").alias("launchtime"),
col("app.finishedtime").alias("finishedtime"),
col("app.elapsedtime").alias("elapsedtime"),
col("app.amcontainerlogs").alias("amcontainerlogs"),
col("app.amhosthttpaddress").alias("amhosthttpaddress"),
col("app.amrpcaddress").alias("amrpcaddress"),
col("app.masternodeid").alias("masternodeid"),
col("app.allocatedmb").alias("allocatedmb"),
col("app.allocatedvcores").alias("allocatedvcores"),
col("app.reservedmb").alias("reservedmb"),
col("app.reservedvcores").alias("reservedvcores"),
col("app.runningcontainers").alias("runningcontainers"),
col("app.memoryseconds").alias("memoryseconds"),
col("app.vcoreseconds").alias("vcoreseconds"),
col("app.queueusagepercentage").alias("queueusagepercentage"),
col("app.clusterusagepercentage").alias("clusterusagepercentage"),
col("app.preemptedresourcemb").alias("preemptedresourcemb"),
col("app.preemptedresourcevcores").alias("preemptedresourcevcores"),
col("app.numnonamcontainerpreempted").alias("numnonamcontainerpreempted"),
col("app.numamcontainerpreempted").alias("numamcontainerpreempted"),
col("app.preemptedmemoryseconds").alias("preemptedmemoryseconds"),
col("app.preemptedvcoreseconds").alias("preemptedvcoreseconds"),
col("app.logaggregationstatus").alias("logaggregationstatus"),
col("app.unmanagedapplication").alias("unmanagedapplication"),
col("app.amnodelabelexpression").alias("amnodelabelexpression"),
struct(
lit("lifetime").alias("type"),
lit("unlimited").alias("expirytime"),
lit(-1).alias("remainingtimeinseconds")
).alias("timeouts")
) Validate the flattened DataFrame flattened_df.show(truncate=False) If the data looks good , save the data as table. flattened_df.write.mode('overwrite').saveAsTable("app_data") Query form hive (beeline) +---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+
| app_data.id | app_data.user | app_data.name | app_data.queue | app_data.state | app_data.finalstatus | app_data.progress | app_data.trackingui | app_data.trackingurl | app_data.diagnostics | app_data.clusterid | app_data.applicationtype | app_data.applicationtags | app_data.priority | app_data.startedtime | app_data.launchtime | app_data.finishedtime | app_data.elapsedtime | app_data.amcontainerlogs | app_data.amhosthttpaddress | app_data.amrpcaddress | app_data.masternodeid | app_data.allocatedmb | app_data.allocatedvcores | app_data.reservedmb | app_data.reservedvcores | app_data.runningcontainers | app_data.memoryseconds | app_data.vcoreseconds | app_data.queueusagepercentage | app_data.clusterusagepercentage | app_data.preemptedresourcemb | app_data.preemptedresourcevcores | app_data.numnonamcontainerpreempted | app_data.numamcontainerpreempted | app_data.preemptedmemoryseconds | app_data.preemptedvcoreseconds | app_data.logaggregationstatus | app_data.unmanagedapplication | app_data.amnodelabelexpression | app_data.timeouts |
+---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+
| application_282828282828_12717 | xyz | xyz-4b6bdae2-1a0c-4772-bd8e-0d7454268b82 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12717/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0
| 282828282828 | aquaman | ABC,xyz_20221107070124_2beb5d90-24c7-4b1b-b977-3c9af1397195,userid=dummy | 0 | 1667822485626 | 1667822485767 | 1667822553365 | 67739 | http://dingdong:8042/node/containerlogs/container_e65_282828282828_12717_01_000001/xyz | dingdong:8042 | dingdong:46457 | dingdong:8041 | -1 | -1 | -1 | -1 | -1 | 1264304 | 79 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} |
| application_282828282828_12724 | xyz | xyz-94962a3e-d230-4fd0-b68b-01b59dd3299d | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12724/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0
| 282828282828 | aquaman | ZZZ_,xyz_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy | 0 | 1667822585231 | 1667822585437 | 1667822631435 | 46204 | http://ding:8042/node/containerlogs/container_e65_282828282828_12724_01_000002/xyz | ding:8042 | ding:46648 | ding:8041 | -1 | -1 | -1 | -1 | -1 | 5603339 | 430 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | time_out | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} |
| application_282828282828_12736 | xyz | xyz-1a9c73ef-2992-40a5-aaad-9f0688bb04f4 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12736/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0
| 282828282828 | aquaman | BLAHBLAH,xyz_20221107070609_8d261352-3efa-46c5-a5a0-8a3cd745d180,userid=dummy | 0 | 1667822771170 | 1667822773663 | 1667822820351 | 49181 | http://dong:8042/node/containerlogs/container_e65_282828282828_12736_01_000001/xyz | dong:8042 | dong:34266 | dong:8041 | -1 | -1 | -1 | -1 | -1 | 1300011 | 89 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} |
| application_282828282828_12735 | xyz | xyz-d5f56a0a-9c6b-4651-8f88-6eaff5953777 | root.users.dummy | finished | succeeded | 100.0 | history | http://dang:8088/proxy/application_282828282828_12735/ | session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0
| 282828282828 | aquaman | HAHAHA_,xyz_20221107070605_a082d9d8-912f-4278-a2ef-5dfe66089fd7,userid=dummy | 0 | 1667822766897 | 1667822766999 | 1667822796759 | 29862 | http://dung:8042/node/containerlogs/container_e65_282828282828_12735_01_000001/xyz | dung:8042 | dung:42765 | dung:8041 | -1 | -1 | -1 | -1 | -1 | 669695 | 44 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | succeeded | false | | {"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1} |
+---------------------------------+----------------+-------------------------------------------+-------------------+-----------------+-----------------------+--------------------+----------------------+----------------------------------------------------+----------------------------------------------------+---------------------+---------------------------+----------------------------------------------------+--------------------+-----------------------+----------------------+------------------------+-----------------------+----------------------------------------------------+-----------------------------+------------------------+------------------------+-----------------------+---------------------------+----------------------+--------------------------+-----------------------------+-------------------------+------------------------+--------------------------------+----------------------------------+-------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------------------------+
... View more
10-27-2023
12:07 AM
The error message "Unknown HS2 problem when communicating with Thrift server" typically indicates that there is an issue when trying to communicate with the Hive Server 2 (HS2) through its Thrift interface. This error can occur for various reasons, and troubleshooting it may require some investigation. Here are some common steps to help resolve this issue: Check Hive Server Status: Ensure that the Hive Server 2 is up and running and that it's reachable from your client. You can check its status and logs to see if there are any errors or issues reported. Network Connectivity: Verify that there are no network-related issues that might be preventing your client from connecting to the Hive Server. Check firewalls, network configurations, and any potential network interruptions. Hive Configuration: Review the Hive server's configuration to ensure it's correctly set up. Pay attention to security configurations, like authentication and authorization settings. Thrift Protocol Version: Ensure that the Thrift protocol version used by your client matches the version supported by the Hive Server. Mismatched protocol versions can lead to communication problems. Client-Side Issues: Check the client application or code you are using to interact with Hive. Ensure that it's properly configured and making the correct requests to the Hive Server. Logs and Error Messages: Examine the logs and error messages in more detail to get specific information about what might be causing the problem. This can help pinpoint the issue. Server Version Compatibility: Ensure that the client and server components (Hive client and Hive Server) are compatible in terms of versions. Incompatible versions can lead to communication issues. Authentication and Authorization: If your Hive server is configured with authentication and authorization, ensure that you have the necessary permissions and credentials to access it. Load and Resource Constraints: Check if the Hive Server is under heavy load or if there are resource constraints that might be affecting its ability to respond to client requests. Driver and Libraries: Ensure that you are using the correct driver or libraries for your client application. If you're using JDBC or ODBC, make sure the corresponding driver is installed and configured correctly. If you continue to face issues after performing these checks, it may be necessary to provide more specific error messages or details about your environment to diagnose the problem further.
... View more