About amcbarnett

amcbarnett · ‎10-23-2015

Microsoft Azure Sizing Details (http://azure.microsoft.com/en-us/pricing/details/virtual-machines/#Linux) You need to size and price machine and storage separately. Do not use A8 machines. Use either A10 or A11’s. A8 is backed by Infiniband which is more expensive and unnecessary for Hadoop Recommend D Series for Solid State Drives if needed. Both options will need attached Blob Storage. The 382 GB local disk that comes with the VM is just for temp storage. For Blob Storage, it comes in 1023GB sizes. Each VM has a maximum number of Blob Storage that can be attached. Eg. A10 Vms can have a maximum of 16 * 1TB storage. See the following for more details: https://msdn.microsoft.com/library/azure/dn197896.aspx https://azure.microsoft.com/en-us/documentation/articles/cloud-services-sizes-specs/ https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-size-specs/ According to Microsoft it is recommended that Page Blob Storage is used opposed to Block Storage, (see http://azure.microsoft.com/en-us/pricing/details/storage/) If performance is a must, especially with Kafka and Storm, use Premium storage not Standard. UPDATED - Converted into an article for more updated information. See https://community.hortonworks.com/articles/22376/recommendations-for-microsoft-azure-hdp-deployment-1.html

amcbarnett · ‎10-23-2015

How to setup a cluster in AWS? What type of storage is supported for HDFS? EBS? EMR? EBS is supported and recommended mainly for mission critical, that is for data that must be (mostly) available. You can do ephemeral storage, which will be faster, but if the node goes down you won’t be able to restore that data and since AWS (and other cloud providers) are known to have entire regions disappear, you can and will lose your whole cluster EBS volumes will be available again when the region comes back online, ephemeral won’t. However EBS is also very pricy and you may not want to pay for that option. However another option is using ephemeral storage, but setting up backup routines to S3, so you can restore back to a point in time. (If you want you can use EBS and back up with S3). I guess the main reason EBS is not recommended for HDFS also is that it is very expensive, but it is supported. For HBASE workloads you should use i2. Only use d2 nodes for a storage density workload type (w/ sequential read), which gave you a lot of locally attached storage and the throughput is quite good. Other Storage Tips: Hs1.8xl for Hadoop with ephemeral storage. I2 for hbase D2.8xl for compute intensive hbase plus data intensive storage. Ebs is very expensive and scaling is not so linear. Depends on how many storage array fabrics you mesh to under the covers. The instance/ephemeral storage (on AWS) would only be for data node HDFS. Therefore lose of an instance is less of a concern. Its also going to get much better performance.

amcbarnett · ‎10-22-2015

Good note. Unfortunately it does not recognize the generic Apache Hive JDBC driver. Also if you need to add special properties for ssl, or kerberos or ldap authentication, SQL Developer will not work. Use SQL Workbench J, RazorSQL or Squirrel SQL instead.

amcbarnett · ‎10-22-2015

Sometimes when running Hive on Tez queries, such as "select * from table" large output files are created that may swamp your local disk. Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1444941691373_0130_1_00, diagnostics=[Task failed, taskId=task_1444941691373_0130_1_00_000007, diagnostics=[TaskAttempt 1 failed, info=[Error: Failure while running task:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1444941691373_0130_1_00_000007_1_10003_0/file.outat org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:402)......], .... Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:106, Vertex vertex_1444941691373_0130_1_01 [Map 3] killed/failed due to:null]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2SQLState: 08S01ErrorCode: 2 This may indicate that your disk is filling up. Check out where your yarn.nodemanager.local-dirs parameter is pointing to and increase disk space. General tuning tips General tuning done when debugging errors when too few, or too many, mappers are being ran: Adjust input split size to vary the amount of data being feed to the mappers e.g mapreduce.input.fileinputformat.split.minsize=67108864 e.g mapreduce.input.fileinputformat.split.maxsize=671088640 Change tez.grouping.max-size to a lower number to get more mappers (credit Terry Padget) Read How Tez Initial Parallelism Works and adjust tez.grouping.min-size, tez.grouping.max-size and tez.grouping.split-count accordingly The number of map tasks for a Hadoop job is typically controlled by the input data size and the split size. There is some overhead to starting and stopping each map task. Performance suffers if a Hadoop job creates a large number of map tasks, and most or all of those map tasks run only for a few seconds. To reduce the number of map tasks for a Hadoop job, complete one or more of the following steps: Increase the block size. In HDP, the default value for the dfs.block.size parameter is 128 MB. Typically, each map task processes one block, or 128 MB. If your map tasks have very short durations, you can speed your Hadoop jobs by using a larger block size and fewer map tasks. For best performance, ensure that the dfs.block.size value matches the block size of the data that is processed on the distributed file system (DFS). Assign each mapper to process more data. If your input is many small files, Hadoop jobs likely generate one map task per small file, regardless of the size of the dfs.block.size parameter. This situation causes many map tasks, with each mapper doing very little work. Combine the small files into larger files and have the map tasks process the larger files, resulting in fewer map tasks doing more work. The mapreduce.input.fileinputformat.split.minsize Hadoop parameter in the mapred-site.xml file specifies the minimum data input size that a map task processes. The default value is 0. Assign this parameter a value that is close to the value of the dfs.block.size parameter and, as necessary, repeatedly double its value until you are satisfied with the MapReduce behavior and performance. Note: Override the mapreduce.input.fileinputformat.split.minsize parameter as needed on individual Hadoop jobs. Changing the default value to something other than 0 can have unintended consequences on your other Hadoop jobs.

amcbarnett · ‎10-21-2015

KeySecure key management platform has different mechanisms for integration their own Network Attached Encryption (NAE) API and the OASIS-standards based Key Management Interoperability Protocol (KMIP) API, each of which can either be used directly and/or optionally fronted with either SOAP or REST web services interfaces. Voltage offers an alternate KMS to Ranger KMS, and Voltage KMS also works with HDFS encryption. Voltage KMS works on a stateless key management but they can also work with a Hardware Software Modules (HSM) like Safenet. SAfenet is a hardware security module. Ranger KMS would have to be configured with a proxy to store the Encryption Zone Keys (EZK) in Safenet instead of a database. Voltage KMS is the only solution so far for this. So long and short, Voltage is an alternative KMS to Ranger KMS. Saftenet cannot be used as a direct alternative to Ranger KMS because it is a HSM and it would need a proxy software or a KMS in between.

amcbarnett · ‎10-21-2015

Can I use Voltage or Safenet / Key Secure as the Key Management Solution for the Encrypted Zone Keys needed for Transparent Data Encryption.

amcbarnett · ‎10-21-2015

This is usually a case when an old repo was installed first and was not cleaned before trying to install Ambari 2.1.2. You need to clean everything and reinstall. Because of the old repos that exist, some python scripts would be old and others missing. Clean the repos. yum repolist | grep ambari yum clean all yum clean dbcache yum clean metadata yum makecache rpm —rebuilddb yum history new If there are other repos beside Ambari 2.1.2 or the version you desire, remove it. ambari-server stop ambari-server reset ambari-agent stop rm -rf /etc/yum.repos.d/ambari.repo yum erase ambari-server and/or yum erase ambari-agent Then get the right repos and download on all nodes. Go to doc for your OS. Ambari 2.1.2 wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.2/ambari.repo -O /etc/yum.repos.d/ambari.repo Then follow these instructions to complete cleanup Completely Clean and Reinstall Ambari

amcbarnett · ‎10-21-2015

I'm getting below issue while installing services using Ambari2.1.2 File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 87, in action_create raise Fail("Applying %s failed, parent directory %s doesn't exist" % (self.resource, dirname)) resource_management.core.exceptions.Fail: Applying File['/var/lib/ambari-agent/tmp/changeUid.sh'] failed, parent directory /var/lib/ambari-agent/tmp doesn't exist Error: Error: Unable to run the custom hook script ['/usr/bin/python2.6', '/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-992.json', '/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-992.json', 'INFO', '/var/lib/ambari-agent/tmp'] /var/lib/ambari-agent/tmp does exist

amcbarnett · ‎10-21-2015

This capability allows encryption of the intermediate files generated during the merge and shuffle phases. It can be enabled by setting the mapreduce.job.encrypted-intermediate-data job property to true. Set in mapred-default.xml the following: <property> <name>mapreduce.job.encrypted-intermediate-data</name> <value>false</value> <description>Encrypt intermediate MapReduce spill files or not default is false</description> </property> <property> <name>mapreduce.job.encrypted-intermediate-data-key-size-bits</name> <value>128</value> <description>Mapreduce encrypt data key size default is 128</description> </property> <property> <name>mapreduce.job.encrypted-intermediate-data.buffer.kb</name> <value>128</value> <description>Buffer size for intermediate encrypt data in kb default is 128</description> </property> NOTE: Currently, enabling encrypted intermediate data spills would restrict the number of attempts of the job to 1. It is only available in MR2

amcbarnett · ‎10-21-2015

There is the recommendation that local disks should be encrypted for intermediate Data; need more info on this. Why is this so? How do we proposed encrypting the disk? Is this because Tez stores intermediate data on local disks? Also Map Reduce stores data in local disk with the "mapreduce.cluster.local.dir" parameter. So this has to be encrypted right? So what are the best practices to encrypt the local disks for intermediate data? What is the manual effort involved? Is Hadoop Encrypted Shuffle enough?

Online	Offline
Last Visited	‎04-13-2018 03:07 PM

Member Since	‎09-29-2015 05:35 PM
Last Visited	‎04-13-2018 03:07 PM
Posts	286
Kudos received	595

Cloudera Community

Re: HIVE : counting null values based on group by

Re: ERROR 500 received - when installing the PIVOT...

Re: How do you achieve high availability in HDFS w...

Re: Why can't we use LDAP for Hadoop authenticatio...

Re: Error Installing HDB HAWQ Standby Master

Recommendations for Microsoft Azure HDP Deployment

Tips on Storage Options for HDP on Amazon Web Serv...

Re: Connect Oracle SQL Developer to Hive

Solving the Tez "Could Not Find any Valid Local Di...

Re: Can Voltage or Safenet be used as an Alternati...

Can Voltage or Safenet be used as an Alternative K...

Re: Python errors or script does not Exist while i...

Python errors or script does not Exist while insta...

Re: Transparent Data Encryption (TDE) and Local Di...

Transparent Data Encryption (TDE) and Local Disks ...