Member since
09-15-2015
457
Posts
507
Kudos Received
90
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
15641 | 11-01-2016 08:16 AM | |
11073 | 11-01-2016 07:45 AM | |
8526 | 10-25-2016 09:50 AM | |
1911 | 10-21-2016 03:50 AM | |
3790 | 10-14-2016 03:12 PM |
11-13-2015
01:27 PM
5 Kudos
Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java "Keep prioritized queues of under replicated blocks.
Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority.
Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable." This method priorities under-replicated blocks: /** Return the priority of a block
* @param block a under replicated block
* @param curReplicas current number of replicas of the block
* @param expectedReplicas expected number of replicas of the block
* @return the priority for the blocks, between 0 and ({@link #LEVEL}-1)
*/
private int getPriority(Block block,
int curReplicas,
int decommissionedReplicas,
int expectedReplicas) {
assert curReplicas >= 0 : "Negative replicas!";
if (curReplicas >= expectedReplicas) {
// Block has enough copies, but not enough racks
return QUEUE_REPLICAS_BADLY_DISTRIBUTED;
} else if (curReplicas == 0) {
// If there are zero non-decommissioned replicas but there are
// some decommissioned replicas, then assign them highest priority
if (decommissionedReplicas > 0) {
return QUEUE_HIGHEST_PRIORITY;
}
//all we have are corrupt blocks
return QUEUE_WITH_CORRUPT_BLOCKS;
} else if (curReplicas == 1) {
//only on replica -risk of loss
// highest priority
return QUEUE_HIGHEST_PRIORITY;
} else if ((curReplicas * 3) < expectedReplicas) {
//there is less than a third as many blocks as requested;
//this is considered very under-replicated
return QUEUE_VERY_UNDER_REPLICATED;
} else {
//add to the normal queue for under replicated blocks
return QUEUE_UNDER_REPLICATED;
}
}
Queues are order as follows: QUEUE_HIGHEST_PRIORITY: the blocks that must be replicated first. That is blocks with only one copy, or blocks with zero live copies but a copy in a node being decommissioned. These blocks are at risk of loss if the disk or server on which they remain fails. QUEUE_VERY_UNDER_REPLICATED: blocks that are very under-replicated compared to their expected values. Currently that means the ratio of the ratio of actual:expected means that there is less than 1:3. These blocks may not be at risk, but they are clearly considered "important". QUEUE_UNDER_REPLICATED: blocks that are also under replicated, and the ratio of actual:expected is good enough that they do not need to go into the QUEUE_VERY_UNDER_REPLICATED queue. QUEUE_REPLICAS_BADLY_DISTRIBUTED: there are as least as many copies of a block as required, but the blocks are not adequately distributed. Loss of a rack/switch could take all copies off-line. QUEUE_WITH_CORRUPT_BLOCKS: This is for blocks that are corrupt and for which there are no-non-corrupt copies (currently) available. The policy here is to keep those corrupt blocks replicated, but give blocks that are not corrupt higher priority.
... View more
11-12-2015
01:36 PM
Honestly I was thinking data ingestion node and temp folder 🙂 Maybe NFS Gateway would be an option, however its not really made for lots of large files and I still have to consider network failures.
... View more
11-12-2015
07:04 AM
Team decided on SFTP for now. We'll look into Nifi for the prod. system, so I am definitely looking forward to file chunking, node affinity, etc. for Nifi.
... View more
11-11-2015
03:15 PM
Thanks, a lot for the input. I agree Nifi is a great fit for this use case and brings a lot of features out of the box. Thanks for filing the Jiras regarding Nifi resume and node afinity 🙂
... View more
11-11-2015
10:00 AM
1 Kudo
Your beeline command is fine and should work. Could you please check your Namenode and Hive log to see if there are any kerberos-related issues? I have seen clusters with a green status in Ambari, but the log files were full of Kerberos authentication failures.
... View more
11-11-2015
09:23 AM
Hi, I am looking for a list of tools that can be used to transfer "large" amounts of data (1TB++; filesize usually around 10-20gb, mostly csv) from different machines in the company's network into HDFS. Sometimes the data storage is far away, so lets say we need to transfer data from Europe to the US, how do these tools handle network failures and other errors? What are options and what are drawbacks (e.g. bottlenecks (copyFromLocal...), etc.)? Distcp? Nifi? SFtp/CopyFromLocal? Flume? Direct vs indirect ingestion? (storage->edge->hdfs vs. storage->hdfs) I'd push local data (meaning within one datacenter) directly to HDFS, but in other cases provide an edge node for data ingestion. How does Nifi handle network failures, is there something like FTP's resume method? What are your experiences? Thanks! Jonas
... View more
Labels:
11-11-2015
06:20 AM
2 Kudos
You are using a POST-request, I think that is the reason why its not working, try PUT. I usually follow these steps when I install my cluster with blueprints 1.Upload blueprint 2.Update repositories 3.Create cluster (upload host mapping) To update the repositories you can use the following commands. HDP Utils: curl -H "X-Requested-By: ambari" -X PUT -u admin:<PASSWORD> http://c6601.ambari.apache.org:8080/api/v1/stacks/HDP/versions/2.2/operating_systems/redhat6/repositories/HDP-UTILS-1.1.0.20 -d @repo_payload Payload {
"Repositories": {
"base_url": "http://c6601.ambari.apache.org/HDP-UTILS-1.1.0.20/repos/centos6",
"verify_base_url": true
}
}
HDP: curl -H "X-Requested-By: ambari" -X PUT -u admin:<PASSWORD> http://c6601.ambari.apache.org:8080/api/v1/stacks/HDP/versions/2.2/operating_systems/redhat6/repositories/HDP-2.2 -d @repo_payload
Payload {
"Repositories": {
"base_url": "http://c6601.ambari.apache.org/HDP/centos6/2.x/updates/2.2.4.2",
"verify_base_url": true
}
}
... View more
11-10-2015
08:59 PM
1 Kudo
Great article. Thanks for sharing 🙂
... View more
11-09-2015
10:24 PM
2 Kudos
It sounds like you want to pull out a lot of data, so I would definitely not use Ranger's RestAPI (not even sure Ranger allows audit export via RestAPI). If you are using DB Audits, which we do not recommend in production systems, you can connect Tableau directly to your DB (check the supported drivers => http://www.tableau.com/support/drivers) Personally I'd go with one of the following solutions: A) Enable HDFS Audit Logs (good idea in general), put a Hive table on top of the audit logs and use Tableau's Hive connector to retrieve and visualize/analyze the data. (Check this out => http://kb.tableau.com/articles/knowledgebase/hadoop-hive-connection) B) If you are using the SolrCloud you could query the Solr index and export the relevant data into a data format that is supported by Tableau (this could be done by Nifi, etc.) . Unfortunately there is no Tableau Solr Driver yet as far as I know.
... View more
11-09-2015
06:08 PM
Checkout my answer in this post => http://community.hortonworks.com/questions/2421/ambari-append-to-env-templates-eg-hadoop-env-clean.html#answer-2440 Let me know if you have any questions 🙂
... View more