Member since
03-06-2020
366
Posts
40
Kudos Received
31
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
223 | 03-19-2024 09:23 AM | |
261 | 03-04-2024 05:33 AM | |
282 | 03-04-2024 03:36 AM | |
201 | 02-20-2024 11:30 PM | |
365 | 02-06-2024 06:13 AM |
03-04-2024
03:01 AM
1 Kudo
@muneeralnajdi The issue you're encountering with the Hive external table, where it fails when using COUNT(*) or WHERE clauses, seems to be related to the custom input format not being properly utilized during query execution. This can lead to errors when Hive attempts to read the files using the default input format. Ensure Custom Input Format is Used: Verify that the custom input format (CustomAvroContainerInputFormat) is correctly configured and loaded in the Hive environment. Confirm that the JAR containing the custom input format class is added to the Hive session or cluster, and that there are no errors or warnings during the JAR loading process. Check Table Properties: Ensure that the custom input format class is correctly specified in the table properties (INPUTFORMAT), and that there are no typos or syntax errors in the table definition. Test with Basic Queries: Start with basic queries (SELECT *) to ensure that the custom input format is properly utilized and data can be read from the Avro files(I think it is working). If basic queries work fine but more complex queries fail, it may indicate issues with the input format's compatibility with certain Hive operations. Consider Alternative Approaches: If troubleshooting the custom input format does not resolve the issue, consider alternative approaches for filtering the files based on their format. For example, you could pre-process the data to separate Avro and JSON files into different directories or partitions, or use other techniques such as external scripts or custom SerDes to handle different file formats within the same directory. Regards, Chethan YM
... View more
03-04-2024
02:31 AM
1 Kudo
@yagoaparecidoti To estimate how much data the new DataNode 7 will receive after performing a rebalance in HDFS, we need to consider the current data distribution across the existing DataNodes and how the rebalancing algorithm will redistribute the data. Even Data Distribution: The rebalancing process aims to achieve an even distribution of data blocks across all DataNodes in the cluster. This means that HDFS will attempt to redistribute the existing data blocks among all DataNodes, including the new DataNode 7, to balance storage utilization. Redistribution Strategy: HDFS will analyze the current data distribution and determine an optimal redistribution strategy to achieve balance. This strategy may involve moving some data blocks from existing DataNodes to DataNode 7, but it's unlikely that all data from all existing DataNodes will be moved to the new DataNode. Optimization and Efficiency: HDFS aims to minimize data movement and optimize the rebalancing process to achieve a balanced state with minimal disruption. The rebalancing algorithm considers factors such as network bandwidth, disk I/O, and cluster performance to determine the most efficient redistribution strategy. Given these considerations, it's difficult to provide an exact estimate of how much data DataNode 7 will receive after the rebalance without knowing the specific details of the cluster configuration and the rebalancing algorithm used. However, DataNode 7 will likely receive a portion of the existing data blocks from the other DataNodes to help achieve a balanced distribution of data across the cluster. Regards, Chethan YM
... View more
03-04-2024
02:14 AM
1 Kudo
@BrianChan Cluster Average Utilization Calculation: The cluster average utilization during HDFS rebalancing is typically calculated based on the configured capacity of the cluster. The configured capacity represents the total storage capacity allocated to the HDFS cluster as defined in the cluster's configuration settings. Individual Utilization Calculation: Individual utilization during HDFS rebalancing is usually calculated based on the sum of DFS used and remaining space for each datanode. This calculation provides an accurate representation of how much storage is currently being utilized on each datanode and how much space is available for additional data storage. Difference in File Moving Size: The difference between the initially reported file moving size and the actual file moving size in the balancer log can occur due to various factors. These may include changes in data distribution across datanodes during the rebalancing process, optimizations performed by the balancer algorithm, or adjustments made based on real-time cluster conditions and performance considerations. Exceeding DataNode Balancing Bandwidth: While the datanode balancing bandwidth is configured to limit the amount of data transferred between datanodes per second during HDFS rebalancing, it's possible for the actual bandwidth consumption to exceed this limit under certain circumstances. Factors such as network congestion, variations in data transfer rates, or optimizations performed by the balancer algorithm can contribute to bandwidth consumption exceeding the configured limit. Regards, Chethan YM
... View more
03-04-2024
02:06 AM
1 Kudo
@Shivakuk When you replace a disk in an HDFS cluster, especially if it's a DataNode disk, the Hadoop system should handle data replication and rebalancing automatically. This means that once the new disk is added and the DataNode is back online, HDFS will redistribute the data across the cluster to maintain the configured replication factor. If data was wiped during or after the disk replacement process, it's critical to investigate why this occurred and take measures to prevent data loss in the future. Ensure that proper backup and recovery procedures are in place, and consider implementing data mirroring or replication to minimize the risk of data loss due to hardware failures. Regards, Chethan YM
... View more
02-21-2024
01:44 AM
1 Kudo
Hi, The error message you've provided indicates a problem with the agent's ability to send heartbeats to the master. This can occur due to various reasons, such as network issues, firewall settings, or misconfigurations. Check the master server is available and reachable, check the network/firewall settings in the system, re-check the agent config files regarding the hostnames, port numbers etc... Regards, Chethan YM
... View more
02-20-2024
11:30 PM
Hi @Timo , In Apache Hadoop, the directories where HDFS DataNodes and YARN NodeManagers store their data and logs are typically configured using the "dfs.datanode.data.dir" and "yarn.nodemanager.local-dirs" properties respectively. To prevent HDFS DataNodes and YARN NodeManagers from writing data to the root-vg directory when disks fail, you should ensure that these properties are configured correctly to point to directories on the healthy disks or storage volumes. -> Configure HDFS DataNode Data Directories:Set the "dfs.datanode.data.dir" property in "hdfs-site.xml" to specify the directories where HDFS DataNodes should store their data. Make sure to list the directories on the healthy disks or storage volumes. -> Configure YARN NodeManager Local Directories: Set the "yarn.nodemanager.local-dirs" property in "yarn-site.xml" to specify the directories where YARN NodeManagers should store their local data and logs. Again, ensure that these directories are on the healthy disks or storage volumes. Regards, Chethan YM
... View more
02-06-2024
06:13 AM
Hi @Sokka , I think its possible, can you try the below? """<workflow-app name="Workflow" xmlns="uri:oozie:workflow:0.5"> <start to="hive-04f4"/> <kill name="Kill"> <message>Error al realizar la acción. Mensaje de error [${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <!-- First Hive action to read data --> <action name="hive-04f4" cred="hive2"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <jdbc-url>jdbc:hive2://host:10000/default</jdbc-url> <script>${wf:appPath()}/hive-04f4.sql</script> <!-- Set output property to be used in next action --> <capture-output/> </hive2> <ok to="loop-decision"/> <error to="Kill"/> </action> <!-- Decision node to determine whether to execute next action --> <decision name="loop-decision"> <switch> <!-- If output is not null, execute next action --> <case to="hive-1c24">${wf:actionData('hive-04f4')['output'] != null}</case> </switch> <!-- If output is null, end the workflow --> <default to="End"/> </decision> <!-- Second Hive action to write data --> <action name="hive-1c24" cred="hive2"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <jdbc-url>jdbc:hive2://host:10000/default</jdbc-url> <script>${wf:appPath()}/hive-1c24.sql</script> <!-- Use output from previous action as input parameter --> <param>input=${wf:actionData('hive-04f4')['output']}</param> </hive2> <ok to="join"/> <error to="Kill"/> </action> <!-- Join node to synchronize paths after the second action --> <join name="join" to="loop-decision"/> <end name="End"/> </workflow-app>""" The <decision> node (loop-decision) contains a <switch> element with a single <case> element to check if the output of the first Hive action (hive-04f4) is not null. If it's not null, it proceeds to execute the second Hive action (hive-1c24). If it is null, it goes to the <default> path, which ends the workflow. Regards, Chethan YM
... View more
11-23-2023
02:52 AM
Hello @MinhTruong > Did you expand the timeframe at right side of the screen? > From where you ran the query? Impala-shell or Hue? > when its failed what is the error? Regards, Chethan YM
... View more
09-13-2023
07:02 AM
Hello, Do you have any other concerns on the above response? have you tried that for a confirmation? Regards, Chethan YM
... View more
09-13-2023
07:00 AM
Hello @hebamahmoud If the issue is has been from any of the above responses, Could you accept it as a solution? Regards, Chethan YM
... View more