About skumpf

skumpf · ‎12-15-2015

It appears you cannot resolve mirrorlist.centos.org via DNS from your virtual machine. Does the follow return a result? nslookup mirrorlist.centos.org If not, I expect you have configured the VM with a Host-Only adapter, which will not allow the VM to access the internet.

skumpf · ‎12-12-2015

Here is the mini cluster project. hadoop-mini-clusters Here is Dhruv's testing project: iot-integration-tester

skumpf · ‎12-03-2015

FWIW, XFS is the default in RHEL 7, so I expect an uptick in new clusters.

skumpf · ‎12-03-2015

Hello Mike, Check that /tmp is not mounted with the noexec flag on that node. sudo mount | grep /tmp If so, remounting without that option should fix this. If removing noexec isn't an option, you can control the directory Java uses for temporary storage through the java.io.tmpdir system property. Give the following a try, replacing the directory with your home directory or another filesystem without the noexec flag. hbase -Djava.io.tmpdir=/some/other/writable/directory shell

skumpf · ‎12-03-2015

DefaultResourceCalculator only takes memory into account. Here is a brief explanation of what you are seeing (relevant part bolded). Pluggable resource-vector in YARN scheduler The CapacityScheduler has the concept of a ResourceCalculator – a pluggable layer that is used for carrying out the math of allocations by looking at all the identified resources. This includes utilities to help make the following decisions: Does this node have enough resources of each resource-type to satisfy this request? How many containers can I fit on this node, sorting a list of nodes with varying resources available. There are two kinds of calculators currently available in YARN – the DefaultResourceCalculator and theDominantResourceCalculator. The DefaultResourceCalculator only takes memory into account when doing its calculations. This is why CPU requirements are ignored when carrying out allocations in the CapacityScheduler by default. All the math of allocations is reduced to just examining the memory required by resource-requests and the memory available on the node that is being looked at during a specific scheduling-cycle. You can find more on this topic on our blog: managing-cpu-resources-in-your-hadoop-yarn-clusters

skumpf · ‎11-09-2015

I don't necessarily agree with this answer. We could avoid needing to change ownership through leveraging proxy users. I hope to find time to write a patch to demonstrate this. I'd also be interested in how many clusters are actually kerberos enabled. I expect it's lower than you think. Data ownership does matter and provides at least rudimentary controls when the user does not or can not enable Kerberos.

skumpf · ‎11-05-2015

When writing data to HDFS in the PutHDFS NiFi Processor, the data is owned by "anonymous". I'm trying to find a good way to control the ownership of data landed via this processor. I looked into Remote Owner and Remote Group, however, those require that the NiFi server is running as the "hdfs" user. This seems like a bad idea to me. I'm curious why this processor doesn't leverage Hadoop Proxy Users, versus enforcing that the NiFi server runs as hdfs? Any other workarounds? My initial thought was to stage the data in HDFS with NiFi and use Falcon to move it to it's final location, however, this seems overkill for users that simply want to ingest the data into its final location. Am I missing something obvious here?

skumpf · ‎11-03-2015

Demo article has been added here: creating-hbase-hfiles-from-an-existing-hive-table

skumpf · ‎11-03-2015

Hive HBase Generate HFiles Demo scripts available at: https://github.com/sakserv/hive-hbase-generatehfiles Below contains an example of leveraging the Hive HBaseStorageHandler for HFile generation. This pattern provides a means of taking data already stored in Hive, exporting it as HFiles, and bulk loading the HBase table from those HFiles. Overview The HFile generation feature was added in HIVE-6473. It adds the following properties that are then leveraged by the Hive HBaseStorageHandler. hive.hbase.generatehfiles - true to generate HFiles hfile.family.path - path in HDFS to put the HFiles. Note that for hfile.family.path, the final sudirectory MUST MATCH the column family name. The scripts in the repo called out above can be used with the Hortonworks Sandbox to test and demo this feature. Example The following is an example of how to use this feature. The scripts in the repo above implement the steps below. It is assumed that the user already has data stored in a hive table, for the sake of this example, the following table was used. CREATE EXTERNAL TABLE passwd_orc(userid STRING, uid INT, shell STRING) STORED AS ORC LOCATION '/tmp/passwd_orc'; First, decide on the HBase table and column family name. We want to use a single column family. For the example below, the HBase table name is "passwd_hbase", the column family name is "passwd". Below is the DDL for the HBase table created through Hive. Couple of notes: userid as my row key. :key is special syntax in the hbase.columns.mapping each column (qualifier) is in the form column family:column (qualifier) CREATE TABLE passwd_hbase(userid STRING, uid INT, shell STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,passwd:uid,passwd:shell'); Next, generate the HFiles for the table. Couple of notes again: The hfile.family.path is where the hfiles will be generated. The final subdirectory name MUST match the column family name. SET hive.hbase.generatehfiles=true; SET hfile.family.path=/tmp/passwd_hfiles/passwd; INSERT OVERWRITE TABLE passwd_hbase SELECT DISTINCT userid,uid,shell FROM passwd_orc CLUSTER BY userid; Finally, load the HFiles into the HBase table: export HADOOP_CLASSPATH=`hbase classpath` yarn jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/passwd_hfiles passwd_hbase The data can now be queried from Hive or HBase.

skumpf · ‎11-03-2015

This shows promise as well. I plan to give this a try soon. However, the accepted answer avoids needing to go from ORC back to Csv, so it gets the win. 🙂

Online	Offline
Last Visited	‎09-15-2023 07:16 AM

Member Since	‎09-21-2015 02:01 PM
Last Visited	‎09-15-2023 07:16 AM
Posts	31
Kudos received	59

Cloudera Community

Re: HDP 2.4 installation on prod. cluster - filesy...

Re: Sandbox 2.4 Nifi connection refused

Re: Hadoop Eclipse plugin

Re: Why does the install of Accumulo not work with...

Re: HBase shell error: java.lang.UnsatisfiedLinkEr...

Re: Why does the install of Accumulo not work with...

Re: Unofficial Storm and Kafka Best Practices Guid...

Re: Ext4 vs XFS Filesystem - Survey of Popularity

Re: HBase shell error: java.lang.UnsatisfiedLinkEr...

Re: YARN - VCores max

Re: NiFi PutHDFS Processor - Remote Owner and Remo...

NiFi PutHDFS Processor - Remote Owner and Remote G...

Re: Loading HBase from Hive ORC Tables

Creating HBase HFiles From a Hive Table

Re: Loading HBase from Hive ORC Tables