Member since
09-21-2015
31
Posts
59
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1362 | 06-01-2016 12:10 PM | |
2965 | 03-08-2016 06:19 PM | |
1295 | 01-19-2016 06:18 PM | |
1031 | 12-15-2015 03:18 PM | |
3053 | 12-03-2015 10:53 PM |
03-09-2017
03:56 PM
2 Kudos
OVERVIEW Docker supplies multiple storage drivers to manage the mutable and immutable layers of images and containers. Many options exist with varying pros and cons. Out of the box, docker uses devicemapper loop-lvm. The loop-lvm storage driver is not recommended for production, but requires zero setup to leverage. When attempting to increase the base size of the mutable layer, it was observed that docker client operations slow. The alternative of using smaller base layers causes failures due to out of storage conditions. The intent of this article is to outline testing that was performed to determine sane defaults for the docker storage driver options. TESTING The following testing methodology was used:
Build the centos6 image with different combinations of base sizes and storage drivers (build) Create a container from the image (run) Stop the container (stop) Remove the container (rm) Stop docker Delete/reprovision the docker graph storage location Repeat The following scenarios were tested:
loop-lvm (xfs) direct-lvm (ext4) direct -lvm (xfs) btrfs zfs overlay aufs The following base sizes were tested:
25GB 50GB 100GB 250GB The following container operation counts were tested:
1 10 25 50 100 The tests were run on the following hardware:
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz 4 core 12GB memory 1x SATA 1TB OS + Docker OS details:
CentOS 7.2.1511 Kernel: 3.10.0-327.4.5.el7.x86_64 docker 1.9.1 Due to docker issue 17653 cgroupfs must be used instead of systemd on CentOS 7.2
--exec-opt native.cgroupdriver=cgroupfs LOOP-LVM Notes: loop-lvm requires no up front storage configuration, and uses /var/lib/docker by default. In these tests, the docker cache directory was reconfigured to use a separate XFS mount on a SATA drive. Example setup (optional is OS disk on SATA): mkdir -p /docker/loop-xfs # path to the filesystem on SATA
Docker command: /usr/bin/docker daemon --graph=/docker/loop-xfs\ --storage-driver=devicemapper \ --storage-opt dm.basesize=${BASESIZE}G \ --storage-opt dm.loopdatasize=5000G \ --storage-opt dm.loopmetadatasize=1000GB
DIRECT-LVM Notes: direct-lvm requires that a logical volume(s) be provisioned on the docker daemon node. The logical volume is then converted to a thinpool to allow docker images and containers to be provisioned with minimal storage usage. The docker-storage-setup script typically handles the logical volume setup for RHEL/CentOS, if installing from the EPEL yum repos. However, when installing from the main docker repo, to leverage the latest version of docker, this script is not included. The docker-storage-setup script is not actually required, as the required LVM commands and docker configuration can be extracted. The instructions below do not include the auto expansion of the logical volumes, which is an additional feature supported by docker-storage-setup. The direct-lvm approach allows for using ext4 or xfs, both were tested. Example setup: pvcreate -ffy /dev/sda4 vgcreate vg-docker /dev/sda4 lvcreate -L 209708s -n docker-poolmeta vg-docker lvcreate -l 60%FREE -n docker-pool vg-docker <<< "y" lvconvert -y --zero n -c 512K --thinpool vg-docker/docker-pool --poolmetadata vg-docker/docker-poolmeta
Docker command (ext4): /usr/bin/docker daemon --storage-driver=devicemapper \ --storage-opt dm.basesize=${BASESIZE}G \ --storage-opt dm.thinpooldev=/dev/mapper/vg--docker-docker--pool\ --storage-opt dm.fs=ext4
Docker command (xfs): /usr/bin/docker daemon --storage-driver=devicemapper \ --storage-opt dm.basesize=${BASESIZE}G \ --storage-opt dm.thinpooldev=/dev/mapper/vg--docker-docker--pool\ --storage-opt dm.fs=xfs
BTRFS Notes: The docker btrfs option requires a btrfs filesystem, which has mixed support depending on OS distribution. Note that btrfs does not honor the dm.basesize setting. Each image and container is represented as a btrfs subvolume. As a result, the usable storage for docker is the total amount of storage available in the btrfs filesystem. Example setup: yum install btrfs-tools -y modprobe btrfs mkfs.btrfs -f /dev/sda4 mount /dev/sda4 /docker/btrfs
Docker command: /usr/bin/docker daemon --graph /docker/btrfs --storage-driver=btrfs
ZFS Notes: The docker zfs storage driver requires a zfs zpool to be created and mounted on the partition or disk where docker data should be stored. Snapshots (read-only) and Clones (read-write) are used to manage the images and containers. zfs does not honor, or even allow, the dm.basesize setting. As a result, the usable storage for docker is the total available space in the zpool. Running zfs on RHEL/CentOS requires the install of an unsigned kernel module. On modern PCs this is a problem as modprobe will fail due to the UEFI SecureBoot feature. The UEFI SecureBoot feature MUST be disabled via the UEFI or BIOS menu, depending on system board manufacturer. Example setup:
yum -ylocalinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm yum -y localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm yum -y install kernel-devel zfs modprobe zfs mkdir -p /docker/zfs zpool destroy -f zpool-docker zpool create -f zpool-docker /dev/sda4 zfs create -o mountpoint=/docker/zfs zpool-docker/docker
Docker command: /usr/bin/docker daemon --graph=/docker/zfs \ --storage-driver=zfs
OVERLAYFS OverlayFS is a modern union filesystem that is similar to AUFS. It is layered on top of an existing filesystem such as ext4 or xfs. OverlayFS promises to be fast, but currently can not be used with RPM on RHEL/CentOS6 images or hosts. This issue is fixed in yum-utils-1.1.31-33.e17, however, this requires that all images be upgraded to use the RHEL/CentOS 7.2 image. Originally, OverlayFS was tested, but not a single image could be successfully built, using both ext4 and xfs. No results are available as part of this test to prove it's speed. Additional testing will be conducted in the future when image upgrades are feasible. OverlayFS also exhibits abnormally high inode usage, increasing the number of inodes on the backing filesystem is necessary. As a follow up, OverlayFS now functions properly with RHEL/CentOS 7.2 based images. However, it was discovered that it does not honor base size or the graph storage location. Instructions below and tools have been updated to reflect these discoveries. Example Setup: modprobe overlay # create the backing filesystem with extra inodes mkfs -t ext4 -N 131072000 /dev/sda4 rm -rf /var/lib/docker/* mount /dev/sda4 /var/lib/docker
Docker command: /usr/bin/docker daemon --storage-driver=overlay \ --storage-opt dm.fs=ext4
AUFS AUFS is the original union filesystem used with docker. It is no longer recommended for production. AUFS requires a custom built kernel with support for AUFS. As a result, this option was not tested. OverlayFS is being touted as the replacement. RESULTS This section contains the result of the testing that was performed as called out previously in this document. BASE SIZE AND DRIVER IMPACT ON BUILD TIMES Build times are erratic making it difficult to truly assess the impact of the various base size and driver combinations. Median values were used in an attempt to normalize results. The following graph shows the build times of the base size and driver combinations. Note that ZFS and BTRFS do not honor the base size parameter, therefore the size listed is for the entire backing filesystem. Summary: BTRFS was consistently faster than all other drivers, but is not yet recommended for production. Direct LVM leveraging Ext4 provided the most flexibility with minimal impact due to base size and is supported in production. DRIVER TYPE IMPACT ON OPERATIONS After building the image, the next steps were to run, stop, and remove the container based on the base image. Below are the results of those actions using a 250GB base size. The following drills down into each of the operation types to show the relative differences between storage drivers. Very little impact was found for all of the direct filesystem based approaches. However, when using the loop-lvm xfs backed driver, stop times were considerably higher. This aligns with the problem statement that the loop-lvm approach is slower at larger base sizes. BASE SIZE IMPACT ON OPERATIONS The following is a breakdown of the impact to operations as the base size increases for storage drivers that support supplying a base size. btrfs and zfs were not tested as base size is not honored by those drivers. loop-lvm xfs As seen below, base size has a direct impact on the amount of time needed to stop a container. No other operations are impacted by the base size. direct-lvm ext4 The base size increasing does not significantly impact direct filesystem approaches. Below outlines the operation times across base image sizes for the direct-lvm ext4 approach. direct-lvm xfs The base size increasing does not significantly impact direct filesystem approaches. Below outlines the operation times across base image sizes for the direct-lvm xfs approach. PARALLEL CONTAINER OPERATIONS IMPACT It is possible to execute docker run, stop, and remove operations in parallel, however, very little benefit is gained by doing so and complicates scheduling of containers. The exception to this result is the stop operation. The stop operation is responsible for a bulk of the time needed to deprovision containers. Running the stop operation in parallel will reduce the overall time needed to run, stop, and remove the container. OVERLAYFS RESULTS OverlayFS was compared to the current recommended storage drive, LVM Direct Ext4. BASE SIZE AND DRIVER IMPACT ON BUILD TIMES As seen below, OverlayFS is nearly twice as fast for build operations than LVM Direct Ext4. As previously mentioned, build times are erratic due to all of the downloads it requires, however, OverlayFS consistently beat the closest competitor. PARALLEL CONTAINER OPERATIONS IMPACT OverlayFS is faster at most operations, but only marginally. A breakdown of each operation follows to show the relative difference between OverlayFS and LVM Direct Ext4. SUMMARY Below is a summary of the pros and cons of each of the storage drivers tested:
Loop-LVM XFS
Pros
No configuration required Decent performance at small base sizes Cons
Poor performance at larger base sizes Not recommended for production Direct-LVM Ext4
Pros More performant than xfs at build, run, and stop operations Consistent performance for all tested base sizes Cons
Requires dedicated storage, as LVM logical volumes Slightly slower than xfs for remove operations Direct-LVM XFS
Pros More performant than ext4 at remove operations Consistent performance for all tested base sizes Cons
Requires dedicated storage, as LVM logical volumes Slower than ext4 at build, run, and stop operations. btrfs
Pros
No need to manage base size, docker can use all the space in the filesystem Most performant for build operations Cons
Not recommended for production Requires dedicated storage, as a btrfs filesystem zfs
Pros
No need to manage base size, docker can use all the space in the zpool Cons
Not recommended for production Requires disabling UEFI SmartBoot at the system level Requires dedicated storage, as a zfs filesystem overlayfs
Pros
Claims to be fast and efficient The “modern” union filesystem Cons
Not yet production ready Not supported with RPM + CentOS 6
Could not properly test due to this issue Potential fix available for RHEL/CentOS 7.2+ images AUFS
Pros
The original Cons
Requires a custom kernel
Could not property test due to this issue
... View more
- Find more articles tagged with:
- Design & Architecture
- docker
- FAQ
Labels:
03-08-2017
04:03 PM
This could occur if you have an overloaded NM and the liveness monitor expiration has occurred. Are you seeing an Nodemanagers in a Lost state? What does resource consumption look like on your nodemanagers when this occurs?
... View more
06-01-2016
06:15 PM
You are correct, use LVM for OS disks, but not data disks. In the end, the filesystem choice doesn't make a huge difference. ext4 everywhere would simply the overall design and allow for the ability to resize filesystems online in the future. Allocating a larger amount of storage to the OS filesystems does simplify the install. Otherwise, during the Ambari install wizard, you need to go through each of the service's configurations and change "/var/log" to one of the data disk mount points (i.e. /opt/dev/sdb as an example above). If you allocated more storage to the OS (and subsequently made /usr say 30GB and /var/log 200GB), you would not have to change as much during the Ambari install. Either approach is viable, so I would suggest discussing with your OS admin team to see if they have a preference. Also note that I'm referring to daemon logs (namenode, resource manager, etc) that end up in /var/log, versus application logs. The yarn settings you show above are for the yarn application logs and local scratch space. You want to follow that same pattern in production.
... View more
06-01-2016
12:10 PM
3 Kudos
The HDP documentation around filesystem selection is out dated. ext4 and XFS are fine choices today. You can use LVM for the OS filesystems. This provides a nice way to shuffle space around on your 2x 300GB OS drives as needed. XFS is perfectly fine here, so you can let RHEL use the default. However, note that XFS filesystems can not be shrunk, whereas with LVM + ext4, filesystems can be expanded and shrunk while online. This is a big gap for XFS. For the datanode disks, do not use RAID or LVM. You want each individual disk mounted as a separate filesystem. You then provide HDFS with a comma separated list of mount points, and HDFS will handle spreading data and load across the disks. If you have 24 data disks per node, you should have 24 filesystems configured in HDFS. XFS is good choice here, since resizing is unlikely to come into play. Also keep in mind that /var/log and /usr have specific needs. /var/log can grow to hundreds of GBs, so moving this logging to one of the data disks may be necessary. The HDP binaries are installed to /usr/hdp, and depending on which components you are installing, could use as much as 6GB per HDP release. Keep this in mind as sufficient space is needed here for upgrades. Hope that helps.
... View more
05-23-2016
07:52 PM
One point of clarification, the Secondary Name Node is not used for High Availability. It was poorly named and only provides checkpointing capabilities. You need to enable Name Node HA (which replaces the Secondary Name Node with a Standby Name Node) for failover to work. Ambari has a wizard to assist in enabling NameNode HA. Once NameNode HA is enabled, jobs will continue if the Primary NameNode fails.
... View more
05-21-2016
11:18 PM
3 Kudos
When installing HDB/HAWQ on Sandbox, it is necessary to relocate the default Ambari postgres database to a postgres instance running on a different port. The following script performs the move in a mostly automated fashion. When prompted by ambari-server setup, select option 4 for the database configuration and fill in the details. Note that this is only intended for Sandbox. Please do not use in production. #!/usr/bin/env bash
#
# Change as needed
#
PGPORT=12346
PGDATA=/var/lib/pgsql/ambari
AMBARI_WEB_USER=admin
AMBARI_WEB_PW=admin
AMBARI_DB_NAME=ambari
AMBARI_DB_USER=ambari
AMBARI_DB_PW=bigdata
#
# Variables
#
PG_INIT_PATH=/etc/init.d/postgresql
DB_BKUP_DIR=/tmp/ambari-db-backup
AMBARI_PROPS=/etc/ambari-server/conf/ambari.properties
#
# Main
#
echo -e "\n#### Stopping ambari-server"
ambari-server stop
echo -e "\n#### Creating the pgpass file"
echo "*:*:*:$AMBARI_DB_USER:$AMBARI_DB_PW" >> $HOME/.pgpass
chmod 600 $HOME/.pgpass
echo -e "\n#### Creating database backup directory"
if [ -d $DB_BKUP_DIR ]; then
rm -rf $DB_BKUP_DIR
fi
mkdir -p $DB_BKUP_DIR
chown 777 $DB_BKUP_DIR
echo -e "\n#### Backing up ambari-server databases"
pg_dump -U $AMBARI_DB_USER -w -f $DB_BKUP_DIR/ambari.sql
echo -e "\n#### Attempting to stop postgres on port $PGPORT, if running"
service postgresql.${PGPORT} stop
echo -e "\n#### Setting up new postgres data directory"
if [ -d $PGDATA ]; then
rm -rf $PGDATA
fi
mkdir -p $PGDATA
chown postgres:postgres $PGDATA
echo -e "\n#### Creating new init script"
sed -e 's|^PGPORT=.*|PGPORT='$PGPORT'|g' -e 's|^PGDATA=.*|PGDATA='$PGDATA'|g' $PG_INIT_PATH > ${PG_INIT_PATH}.${PGPORT}
chmod 775 ${PG_INIT_PATH}.${PGPORT}
echo -e "\n#### Initializing new postgres instance on port $PGPORT"
service postgresql.${PGPORT} initdb
echo -e "\n#### Modify postgres config to listen on all interfaces"
sed -i "s|^#\?listen_addresses.*|listen_addresses = '*'|g" $PGDATA/postgresql.conf
echo -e "\n#### Copy existing pg_hba.conf"
cp /var/lib/pgsql/data/pg_hba.conf $PGDATA/pg_hba.conf
echo -e "\n#### Starting new postgres instance on port $PGPORT"
service postgresql.${PGPORT} start
echo -e "\n#### Creating the ambari db"
su - postgres -c "psql -p $PGPORT -c 'CREATE DATABASE ambari;' -d postgres"
echo -e "\n#### Creating the ambari db user role"
su - postgres -c "psql -p $PGPORT -c \"CREATE ROLE $AMBARI_DB_USER LOGIN PASSWORD '$AMBARI_DB_PW';\" -d ambari"
echo -e "\n#### Restoring ambari database backup"
su - postgres -c "psql -p $PGPORT -f $DB_BKUP_DIR/ambari.sql -d ambari"
echo -e "\n#### Updating jdbc config for ambari-server"
grep -v "server.jdbc" $AMBARI_PROPS >${AMBARI_PROPS}.nojdbc
echo "server.jdbc.port=$PGPORT" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.rca.driver=org.postgresql.Driver" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.rca.url=jdbc:postgresql://localhost:${PGPORT}/ambari" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.driver=org.postgresql.Driver" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.user.name=$AMBARI_DB_USER" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.postgres.schema=ambari" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.hostname=localhost" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.rca.user.passwd=/etc/ambari-server/conf/password.dat" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.rca.user.name=$AMBARI_DB_USER" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.url=jdbc:postgresql://localhost:${PGPORT}/ambari" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.user.passwd=/etc/ambari-server/conf/password.dat" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.database=postgres" >> ${AMBARI_PROPS}.nojdbc
echo "server.jdbc.database_name=ambari" >> ${AMBARI_PROPS}.nojdbc
cp ${AMBARI_PROPS}.nojdbc $AMBARI_PROPS
echo -e "\n#### Stopping existing postgres instance"
service postgresql stop
echo -e "\n#### Running ambari-server setup"
ambari-server setup
echo -e "\n#### Starting ambari-server"
service ambari-server start
... View more
- Find more articles tagged with:
- Ambari
- hdb
- Issue Resolution
- postgres
- Sandbox
- Sandbox & Learning
Labels:
03-30-2016
12:05 PM
8 Kudos
As has been mentioned in this thread, there is no native C# HBase client. however, there are several options for interacting with HBase from C#.
C# HBase Thrift client - Thrift allows for defining service endpoints and data models in a common format and using code generators to create language specific bindings. HBase provides a Thirft server and definitions. There are many examples online for creating a C# HBase Thrift Client. hbase-thrift-csharp Marlin - Marlin is a C# client for interacting with Stargate (HBase REST API) that ultimately became hbase-sdk-for-net. I have not personally tested this against HBase 1.x+, but considering it uses Stargate, I expect it should work. If you are planning to use Stargate and implement your own client, which I would recommend over Thrift, make sure to use protobufs to avoid the JSON serialization overhead. Using a HTTP based approach also makes it much easier to load balance requests over multiple gateways.
Phoenix Query Server - Phoenix is a SQL skin on HBase. Phoenix Query Server is a REST API for submitting SQL queries to Phoenix. Here is some example code, however, I have not yet tested it. hdinsight-phoenix-sharp Simba HBase ODBC Driver - Using ODBC to connect to HBase. I've heard positive feedback on this approach, especially from tools like Tableau. This is not open source and requires purchasing a license. Future: Phoenix ODBC Driver - I've been told a Phoenix ODBC driver is in the works. Unfortunately, no ETA. What we really need is an Entity or LINQ based framework, as that's how C# developers expect to interact with backend data sources. At one point, a member of the community began developing Linq2Hive, but the project appears to be no more. It may be possible to leverage Linq2Hive and the HbaseStorageHandler, but that seems like a really poor pattern. 🙂 I'm sure there are others, but hopefully this helps.
... View more
03-08-2016
06:19 PM
3 Kudos
I expect you just need to add a port forwarding rule to forward 9090 from your host to the VM. Right Click the VM -> Settings -> Network -> Port Forwarding and validate a rule exists for 9090. If that wasn't the issue, check /etc/hosts on your host to ensure there isn't an old sandbox entry in there.
... View more
01-29-2016
08:40 PM
Unfortunately, I have not found any archetype to meet these requirements, but the need is there. The project below seems to meet #4 on the list as a starting point: sparkjava-archetypes
... View more
01-19-2016
06:18 PM
1 Kudo
From what I understand, the Eclipse plugin has not been maintained as new versions of Hadoop have been released. It appears that the command to start the DataNode is missing a required argument: Usage: java DataNode [regular | rollback]
regular : Normal DataNode startup (default).
rollback : Rollback a standard or rolling upgrade.
Refer to HDFS documentation for the difference between standard
and rolling upgrades. The Apache HDT (Hadoop Development Tools) project had plans to fix this, but unfortunately, it has been retired due to lack of contributions. http://hdt.incubator.apache.org/
One option to consider would be to ditch the Eclipse Plugin and leverage "mini clusters" to provide a similar development experience, but without the need to connect to an external cluster or leverage the Eclipse plugin. https://wiki.apache.org/hadoop/HowToDevelopUnitTests
Another option would be to leverage the hadoop-mini-clusters project that I maintain. It simplifies the use of mini clusters by wrapping them in a common Builder pattern. https://github.com/sakserv/hadoop-mini-clusters
Hope that helps.
... View more
12-15-2015
03:18 PM
1 Kudo
It appears you cannot resolve mirrorlist.centos.org via DNS from your virtual machine. Does the follow return a result? nslookup mirrorlist.centos.org If not, I expect you have configured the VM with a Host-Only adapter, which will not allow the VM to access the internet.
... View more
12-12-2015
07:48 AM
2 Kudos
Here is the mini cluster project. hadoop-mini-clusters Here is Dhruv's testing project: iot-integration-tester
... View more
12-03-2015
11:02 PM
4 Kudos
FWIW, XFS is the default in RHEL 7, so I expect an uptick in new clusters.
... View more
12-03-2015
10:53 PM
3 Kudos
Hello Mike, Check that /tmp is not mounted with the noexec flag on that node. sudo mount | grep /tmp If so, remounting without that option should fix this. If removing noexec isn't an option, you can control the directory Java uses for temporary storage through the java.io.tmpdir system property. Give the following a try, replacing the directory with your home directory or another filesystem without the noexec flag. hbase -Djava.io.tmpdir=/some/other/writable/directory shell
... View more
12-03-2015
08:52 PM
4 Kudos
DefaultResourceCalculator only takes memory into account. Here is a brief explanation of what you are seeing (relevant part bolded). Pluggable resource-vector in YARN scheduler The CapacityScheduler has the concept of a ResourceCalculator – a pluggable layer that is used for carrying out the math of allocations by looking at all the identified resources. This includes utilities to help make the following decisions:
Does this node have enough resources of each resource-type to satisfy this request? How many containers can I fit on this node, sorting a list of nodes with varying resources available. There are two kinds of calculators currently available in YARN – the DefaultResourceCalculator and theDominantResourceCalculator. The DefaultResourceCalculator only takes memory into account when doing its calculations. This is why CPU requirements are ignored when carrying out allocations in the CapacityScheduler by default. All the math of allocations is reduced to just examining the memory required by resource-requests and the memory available on the node that is being looked at during a specific scheduling-cycle. You can find more on this topic on our blog: managing-cpu-resources-in-your-hadoop-yarn-clusters
... View more
11-09-2015
04:01 PM
1 Kudo
I don't necessarily agree with this answer. We could avoid needing to change ownership through leveraging proxy users. I hope to find time to write a patch to demonstrate this. I'd also be interested in how many clusters are actually kerberos enabled. I expect it's lower than you think. Data ownership does matter and provides at least rudimentary controls when the user does not or can not enable Kerberos.
... View more
11-05-2015
02:20 PM
When writing data to HDFS in the PutHDFS NiFi Processor, the data is owned by "anonymous". I'm trying to find a good way to control the ownership of data landed via this processor. I looked into Remote Owner and Remote Group, however, those require that the NiFi server is running as the "hdfs" user. This seems like a bad idea to me. I'm curious why this processor doesn't leverage Hadoop Proxy Users, versus enforcing that the NiFi server runs as hdfs? Any other workarounds? My initial thought was to stage the data in HDFS with NiFi and use Falcon to move it to it's final location, however, this seems overkill for users that simply want to ingest the data into its final location. Am I missing something obvious here?
... View more
Labels:
11-03-2015
11:56 PM
1 Kudo
Demo article has been added here: creating-hbase-hfiles-from-an-existing-hive-table
... View more
11-03-2015
11:53 PM
10 Kudos
Hive HBase Generate HFiles Demo scripts available at: https://github.com/sakserv/hive-hbase-generatehfiles Below contains an example of leveraging the Hive HBaseStorageHandler for HFile generation. This pattern provides a means of taking data already stored in Hive, exporting it as HFiles, and bulk loading the HBase table from those HFiles. Overview The HFile generation feature was added in HIVE-6473. It adds the following properties that are then leveraged by the Hive HBaseStorageHandler.
hive.hbase.generatehfiles - true to generate HFiles hfile.family.path - path in HDFS to put the HFiles. Note that for hfile.family.path, the final sudirectory MUST MATCH the column family name. The scripts in the repo called out above can be used with the Hortonworks Sandbox to test and demo this feature. Example The following is an example of how to use this feature. The scripts in the repo above implement the steps below. It is assumed that the user already has data stored in a hive table, for the sake of this example, the following table was used. CREATE EXTERNAL TABLE passwd_orc(userid STRING, uid INT, shell STRING)
STORED AS ORC
LOCATION '/tmp/passwd_orc';
First, decide on the HBase table and column family name. We want to use a single column family. For the example below, the HBase table name is "passwd_hbase", the column family name is "passwd". Below is the DDL for the HBase table created through Hive. Couple of notes:
userid as my row key. :key is special syntax in the hbase.columns.mapping each column (qualifier) is in the form column family:column (qualifier) CREATE TABLE passwd_hbase(userid STRING, uid INT, shell STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,passwd:uid,passwd:shell');
Next, generate the HFiles for the table. Couple of notes again:
The hfile.family.path is where the hfiles will be generated. The final subdirectory name MUST match the column family name. SET hive.hbase.generatehfiles=true;
SET hfile.family.path=/tmp/passwd_hfiles/passwd;
INSERT OVERWRITE TABLE passwd_hbase SELECT DISTINCT userid,uid,shell FROM passwd_orc CLUSTER BY userid;
Finally, load the HFiles into the HBase table: export HADOOP_CLASSPATH=`hbase classpath`
yarn jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/passwd_hfiles passwd_hbase
The data can now be queried from Hive or HBase.
... View more
- Find more articles tagged with:
- bulkload
- Data Ingestion & Streaming
- HBase
- Hive
Labels:
11-03-2015
11:48 PM
This shows promise as well. I plan to give this a try soon. However, the accepted answer avoids needing to go from ORC back to Csv, so it gets the win. 🙂
... View more
11-03-2015
11:47 PM
While I've yet to use this on the large table, it worked very well on a small sample. There were some gotchas that aren't explicitly called out anywhere. I will put together a guide and post it to AH, and link it back here when ready. I've scripted out an example of using this feature here: https://github.com/sakserv/hive-hbase-generatehfiles Thanks!
... View more
11-02-2015
08:49 PM
Looking for approaches for loading HBase tables if all I have is the data in an ORC backed Hive table. I would prefer a bulk load approach, given there are several hundred million rows in the ORC backed Hive table. I found the following, anyone have experience with Hive's HBase bulk load feature? Would it be better to create a CSV table and CTAS from ORC into the CSV table, and then use ImportTsv on the HBase side? HiveHBaseBulkLoad Any experiences here would be appreciated.
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
10-19-2015
06:24 PM
IMO, this doesn't look bad at all. While you could tune the young generation size a bit higher to lessen these, the amount of time spent in GC is pretty low, so it's unlikely to have any impact on long term performance. We'd also need to see entries for the Perm Gen and Old Gen to determine what impact increasing the young gen would have. Let's break it down: This is a Young Generation collection, also known as a minor collection. The total heap used by the young generation hovers around 135mb, which aligns with your setting. The size of the young gen before GC is hovering around 130mb (some times less, as heap needs for objects will determine when the GC is needed). After GC, the heap is 1mb, meaning clean up went very well and most objects are short lived for this application. Ultimately, these collections took on average 5ms each (.005 seconds) each 8 seconds or so (8000 ms), under .0001% of the total run time of the application, which is perfectly fine.
... View more
10-14-2015
06:18 PM
1 Kudo
btw, you can avoid needing to specify the hwx jetty repo but using a single repo group: <repositories>
<repository>
<id>public.repo.hortonworks.com</id>
<name>Public Hortonworks Maven Repo</name>
<url>http://repo.hortonworks.com/content/groups/public/</url>
</repository>
</repositories>
... View more
10-14-2015
06:14 PM
I'm running into similar with the HDP repos, but not the maven central repos. I haven't found an answer yet but will continue to research; could be a bug, could be the HDP repos. I'll follow up when I have more.
... View more
10-14-2015
01:25 PM
5 Kudos
This is a bug in Intellij 14.1 (and many earlier versions). See IDEA-102693 which includes a zip with the fixed maven plugin jars. Replace your intellij jars with those from the zip file. If that doesn't work, take a look at your idea.log (sudo find / -name idea.log to locate it) for any exceptions and research those and/or post your stack trace here.
... View more
10-08-2015
07:10 PM
Can you elaborate? Do you see the actual password in the header or the Base64 encoded string? Basic Auth provides no security with regard to user/password. Base64 encoding is used to handle special characters that could invalidate the entire header.
... View more
10-01-2015
12:14 AM
3 Kudos
The ambari user must be configured for sudo access, but the required access can be restricted. Take careful note of the ambari user's PATH (echo $PATH as ambari), you may need to change the full path of the sudo entires below (mkdir, cat, etc) as Ambari doesn't use fully qualified paths when executing commands. Note that the Default lines need to appear after other Default lines in /etc/sudoers (easiest to just put this at the end). # Add Sudo Rules
ambari ALL=(ALL) NOPASSWD:SETENV: /bin/su hdfs *, /usr/bin/su hdfs *, /bin/su ambari-qa *, /usr/bin/su ambari-qa *, /bin/su ranger *, /usr/bin/su ranger *, /bin/su zookeeper *, /usr/bin/su zookeeper *, /bin/su knox *, /usr/bin/su knox *, /bin/su falcon *, /usr/bin/su falcon *, /bin/su ams *, /usr/bin/su ams *, /bin/su flume *, /usr/bin/su flume *, /bin/su hbase *, /usr/bin/su hbase *, /bin/su spark *, /usr/bin/su spark *, /bin/su accumulo *, /usr/bin/su accumulo *, /bin/su hive *, /usr/bin/su hive *, /bin/su hcat *, /usr/bin/su hcat *, /bin/su kafka *, /usr/bin/su kafka *, /bin/su mapred *, /usr/bin/su mapred *, /bin/su oozie *, /usr/bin/su oozie *, /bin/su sqoop *, /usr/bin/su sqoop *, /bin/su storm *, /usr/bin/su storm *, /bin/su tez *, /usr/bin/su tez *, /bin/su atlas *, /usr/bin/su atlas *, /bin/su yarn *, /usr/bin/su yarn *, /bin/su kms *, /usr/bin/su kms *, /bin/su mysql *, /usr/bin/su mysql *, /usr/bin/yum, /usr/bin/zypper, /usr/bin/apt-get, /bin/mkdir, /usr/bin/mkdir, /usr/bin/test, /bin/ln, /usr/bin/ln, /bin/chown, /usr/bin/chown, /bin/chmod, /usr/bin/chmod, /bin/chgrp, /usr/bin/chgrp, /usr/sbin/groupadd, /usr/sbin/groupmod, /usr/sbin/useradd, /usr/sbin/usermod, /bin/cp, /usr/bin/cp, /usr/sbin/setenforce, /usr/bin/test, /usr/bin/stat, /bin/mv, /usr/bin/mv, /bin/sed, /usr/bin/sed, /bin/rm, /usr/bin/rm, /bin/kill, /usr/bin/kill, /bin/readlink, /usr/bin/readlink, /usr/bin/pgrep, /bin/cat, /usr/bin/cat, /usr/bin/unzip, /bin/tar, /usr/bin/tar, /usr/bin/tee, /bin/touch, /usr/bin/touch, /usr/bin/hdp-select, /usr/bin/conf-select, /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh, /usr/lib/hadoop/bin/hadoop-daemon.sh, /usr/lib/hadoop/sbin/hadoop-daemon.sh, /sbin/chkconfig gmond off, /sbin/chkconfig gmetad off, /etc/init.d/httpd *, /sbin/service hdp-gmetad start, /sbin/service hdp-gmond start, /usr/sbin/gmond, /usr/sbin/update-rc.d ganglia-monitor *, /usr/sbin/update-rc.d gmetad *, /etc/init.d/apache2 *, /usr/sbin/service hdp-gmond *, /usr/sbin/service hdp-gmetad *, /sbin/service mysqld *, /usr/bin/python2.6 /var/lib/ambari-agent/data/tmp/validateKnoxStatus.py *, /usr/hdp/current/knox-server/bin/knoxcli.sh *, /usr/hdp/*/ranger-usersync/setup.sh, /usr/bin/ranger-usersync-stop, /usr/bin/ranger-usersync-start, /usr/hdp/*/ranger-admin/setup.sh *, /usr/hdp/*/ranger-knox-plugin/disable-knox-plugin.sh *, /usr/hdp/*/ranger-storm-plugin/disable-storm-plugin.sh *, /usr/hdp/*/ranger-hbase-plugin/disable-hbase-plugin.sh *, /usr/hdp/*/ranger-hdfs-plugin/disable-hdfs-plugin.sh *, /usr/hdp/current/ranger-admin/ranger_credential_helper.py, /usr/hdp/current/ranger-kms/ranger_credential_helper.py
Defaults exempt_group = ambari
Defaults !env_reset,env_delete-=PATH
Defaults: ambari !requiretty
You will also want to manually install Ambari Agent on all nodes and modify the agent configuration BEFORE starting Ambari Agent for the first time. Otherwise, the agent will start as root and you will need to manually fix the ownership of several directories (don't let Ambari Server install the agents via ssh). In /etc/ambari-agent/conf/ambari-agent.ini modify the run_as_user and server properties and start the agents. Use the manual registeration option when going through the cluster install wizard. I've used the above several times now with success.
... View more