About jstraub

jstraub · ‎11-13-2015

Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java "Keep prioritized queues of under replicated blocks. Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority. Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable." This method priorities under-replicated blocks: /** Return the priority of a block * @param block a under replicated block * @param curReplicas current number of replicas of the block * @param expectedReplicas expected number of replicas of the block * @return the priority for the blocks, between 0 and ({@link #LEVEL}-1) */ private int getPriority(Block block, int curReplicas, int decommissionedReplicas, int expectedReplicas) { assert curReplicas >= 0 : "Negative replicas!"; if (curReplicas >= expectedReplicas) { // Block has enough copies, but not enough racks return QUEUE_REPLICAS_BADLY_DISTRIBUTED; } else if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } //all we have are corrupt blocks return QUEUE_WITH_CORRUPT_BLOCKS; } else if (curReplicas == 1) { //only on replica -risk of loss // highest priority return QUEUE_HIGHEST_PRIORITY; } else if ((curReplicas * 3) < expectedReplicas) { //there is less than a third as many blocks as requested; //this is considered very under-replicated return QUEUE_VERY_UNDER_REPLICATED; } else { //add to the normal queue for under replicated blocks return QUEUE_UNDER_REPLICATED; } } Queues are order as follows: QUEUE_HIGHEST_PRIORITY: the blocks that must be replicated first. That is blocks with only one copy, or blocks with zero live copies but a copy in a node being decommissioned. These blocks are at risk of loss if the disk or server on which they remain fails. QUEUE_VERY_UNDER_REPLICATED: blocks that are very under-replicated compared to their expected values. Currently that means the ratio of the ratio of actual:expected means that there is less than 1:3. These blocks may not be at risk, but they are clearly considered "important". QUEUE_UNDER_REPLICATED: blocks that are also under replicated, and the ratio of actual:expected is good enough that they do not need to go into the QUEUE_VERY_UNDER_REPLICATED queue. QUEUE_REPLICAS_BADLY_DISTRIBUTED: there are as least as many copies of a block as required, but the blocks are not adequately distributed. Loss of a rack/switch could take all copies off-line. QUEUE_WITH_CORRUPT_BLOCKS: This is for blocks that are corrupt and for which there are no-non-corrupt copies (currently) available. The policy here is to keep those corrupt blocks replicated, but give blocks that are not corrupt higher priority.

jstraub · ‎11-12-2015

Honestly I was thinking data ingestion node and temp folder 🙂 Maybe NFS Gateway would be an option, however its not really made for lots of large files and I still have to consider network failures.

jstraub · ‎11-12-2015

Team decided on SFTP for now. We'll look into Nifi for the prod. system, so I am definitely looking forward to file chunking, node affinity, etc. for Nifi.

jstraub · ‎11-11-2015

Thanks, a lot for the input. I agree Nifi is a great fit for this use case and brings a lot of features out of the box. Thanks for filing the Jiras regarding Nifi resume and node afinity 🙂

jstraub · ‎11-11-2015

Your beeline command is fine and should work. Could you please check your Namenode and Hive log to see if there are any kerberos-related issues? I have seen clusters with a green status in Ambari, but the log files were full of Kerberos authentication failures.

jstraub · ‎11-11-2015

Hi, I am looking for a list of tools that can be used to transfer "large" amounts of data (1TB++; filesize usually around 10-20gb, mostly csv) from different machines in the company's network into HDFS. Sometimes the data storage is far away, so lets say we need to transfer data from Europe to the US, how do these tools handle network failures and other errors? What are options and what are drawbacks (e.g. bottlenecks (copyFromLocal...), etc.)? Distcp? Nifi? SFtp/CopyFromLocal? Flume? Direct vs indirect ingestion? (storage->edge->hdfs vs. storage->hdfs) I'd push local data (meaning within one datacenter) directly to HDFS, but in other cases provide an edge node for data ingestion. How does Nifi handle network failures, is there something like FTP's resume method? What are your experiences? Thanks! Jonas

jstraub · ‎11-11-2015

You are using a POST-request, I think that is the reason why its not working, try PUT. I usually follow these steps when I install my cluster with blueprints 1.Upload blueprint 2.Update repositories 3.Create cluster (upload host mapping) To update the repositories you can use the following commands. HDP Utils: curl -H "X-Requested-By: ambari" -X PUT -u admin:<PASSWORD> http://c6601.ambari.apache.org:8080/api/v1/stacks/HDP/versions/2.2/operating_systems/redhat6/repositories/HDP-UTILS-1.1.0.20 -d @repo_payload Payload { "Repositories": { "base_url": "http://c6601.ambari.apache.org/HDP-UTILS-1.1.0.20/repos/centos6", "verify_base_url": true } } HDP: curl -H "X-Requested-By: ambari" -X PUT -u admin:<PASSWORD> http://c6601.ambari.apache.org:8080/api/v1/stacks/HDP/versions/2.2/operating_systems/redhat6/repositories/HDP-2.2 -d @repo_payload Payload { "Repositories": { "base_url": "http://c6601.ambari.apache.org/HDP/centos6/2.x/updates/2.2.4.2", "verify_base_url": true } }

jstraub · ‎11-10-2015

Great article. Thanks for sharing 🙂

jstraub · ‎11-09-2015

It sounds like you want to pull out a lot of data, so I would definitely not use Ranger's RestAPI (not even sure Ranger allows audit export via RestAPI). If you are using DB Audits, which we do not recommend in production systems, you can connect Tableau directly to your DB (check the supported drivers => http://www.tableau.com/support/drivers) Personally I'd go with one of the following solutions: A) Enable HDFS Audit Logs (good idea in general), put a Hive table on top of the audit logs and use Tableau's Hive connector to retrieve and visualize/analyze the data. (Check this out => http://kb.tableau.com/articles/knowledgebase/hadoop-hive-connection) B) If you are using the SolrCloud you could query the Solr index and export the relevant data into a data format that is supported by Tableau (this could be done by Nifi, etc.) . Unfortunately there is no Tableau Solr Driver yet as far as I know.

jstraub · ‎11-09-2015

Checkout my answer in this post => http://community.hortonworks.com/questions/2421/ambari-append-to-env-templates-eg-hadoop-env-clean.html#answer-2440 Let me know if you have any questions 🙂

Online	Offline
Last Visited	‎08-18-2019 08:21 AM

Member Since	‎09-15-2015 02:21 PM
Last Visited	‎08-18-2019 08:21 AM
Posts	457
Kudos received	472

Cloudera Community

Re: NiFi: How do I see the flowfile attributes nam...

Re: NiFi: JSON Array split

Re: Securing Solr with Ranger ERROR 500

Re: Is Ambari Infra open source?

Re: After disabling kerberos , ZKfailover not comi...

Re: What rules set priority of recovery from lost ...

Re: What are good tools / methods to get data into...

Re: What are good tools / methods to get data into...

Re: What are good tools / methods to get data into...

Re: beeline returns "Failed to find any Kerberos t...

What are good tools / methods to get data into HDF...

Re: Use cURL to make an API call to change base UR...

Re: Update NiFi Flow On-the-Fly via API

Re: Ranger Audit Custom reports

Re: SmartSense install fails with 500 status on Ke...