About devers1

devers1 · ‎04-27-2018

The HDP 2.5/2.6 repos include the required tools to implement HDFS-FUSE. But it should be noted, that as of today, this only runs in "user space", and as such, does not perform as well as something like the NFS gateway option. Also, there is currently no option for an Ambari plugin to manage the service. I wrote a recent article that goes over this on HDP (current release) that walks through this. Using HDFS-FUSE for POSIX directory mounts

devers1 · ‎04-05-2018

Operational Challenges Legacy applications that lack the ability to leverage the REST interface for HDFS can limit data access for storage. One of the ways to deal with this is a mountable filesystem method for Hadoop. Additions to Linux kernel for native HDFS access are not yet there. Therefore we need to use the “user space” capabilities with something like FUSE to provide this functionality. https://en.wikipedia.org/wiki/Filesystem_in_Userspace But before we begin, we need to understand some limitations of the HDFS-FUSE implementation: Because this runs in user space (vs kernel space), performance is not on par with what native API implementations can provide For large data transfers, continuous ingest, and in particular - streaming operations, users should look at solutions like Apache NIFI And last but not least, the use of HDFS-FUSE does not currently support “append” operations for a file So with those limitations understood, lets begin getting things setup on our cluster. Installation Procedure To begin with, we need to install the packages from the HDP repositories. This article is going to focus on HDP 2.6.4, but the same holds true for earlier releases (HDP 2.5.0 has been tested and works similarly) We are also going to assume that users have been added to the cluster and have access to both local directories as well as HDFS storage. As root (or with elevated privileges), install the requisite packages. [root@scratch ~]# yum install hadoop-hdfs-fuse NOTE: You may need to validate the paths for both “PATH” and “LD_LIBRARY_PATH” to the location for the requisite Hadoop & Java libraries and executables. If using the Oracle JDK & HDP 2.6.4, they might look similar to this: Now we need to create our mount point on the Linux filesystem that users can access. This example uses a single active NameNode. Be sure to verify that we can see the file from command line. Now we will upload a file through the Ambari Files Views first to validate our access (a simple CSV file is sufficient) [demouser@scratch ~]$ ls -l /hadoop-fuse/user/demouser total 2 -rw-r--r-- 1 demouser hadoop 2641 Apr 3 14:45 sample_color.csv Now we can copy a file from our local filesystem into our HDFS user directory. [demouser@scratch ~]$ cp test.csv /hadoop-fuse/user/demouser We can verify that both files are visible in Ambari as well as command line. [demouser@scratch ~]$ ls -l /hadoop-fuse/user/demouser total 3 -rw-r--r-- 1 demouser hadoop 2641 Apr 3 14:45 sample_color.csv -rw-r--r-- 1 demouser hadoop 118 Apr 3 14:46 test.csv We will confirm that our Linux user can see the contents of the file that was uploaded via Ambari. [demouser@scratch ~]$ head /hadoop-fuse/user/demouser/sample_color.csv f_name,l_name,color,color_date,color_value Jim,Smith,Blue,2016-03-10 12:01:01,2.0 Ben,Johnson,Brown,2016-03-10 12:01:01,3.0 Roy,Wallace,Black,2016-03-10 12:01:01,4.0 Gary,Kingston,White,2016-03-10 12:01:01,5.0 Shawn,Marty,Green,2016-03-10 12:01:01,6.0 Larry,Stonewall,Yellow,2016-03-11 12:01:01,7.0 Freddy,VanLeap,Purple,2016-03-11 12:01:01,8.0 Scott,Benson,Red,2016-03-11 12:01:01,2.0 Martha,Tory,Orange,2016-03-11 12:01:01,3.0 And finally, we will verify the contents of the file uploaded via command line by previewing it in Ambari. And if we want to test out creating a HIVE table from out data set loaded from Linux command line: (test.csv was loaded in via command line using just the copy command) And now we can inspect (sample) the data in our table. Addendum Once you are satisfied with the location and permissions, you can have this mount at boot or run it as part of a secondary startup script (i.e. rc.local if it is enabled on CentOS/RHEL 7+) to mount on reboot. But it is best to wait until the NameNode is up and running before your proceed with this automation. hadoop-fuse-dfs#dfs://<name_node_hostname>:<namenode_port> <mount_point> fuse allow_other,usetrash,rw 2 0 For more information on how this works, see the Apache Hadoop page for Mountable HDFS: https://wiki.apache.org/hadoop/MountableHDFS My next article on this will include how this can work with NameNode-HA and secured cluster with Kerberos (Authentication) and LDAP (Ranger Authorization).

devers1 · ‎03-06-2017

Can you share moe about what version of Spark you are using? HDP added support for ORC in Spark 1.4 Please see the following article: https://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ Here is a bit of code that shows how this works: val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)

devers1 · ‎10-16-2016

At this time, Cloudbreak cannot deploy a stand-alone HDF 2.0 cluster by itself. This is something that will be available and fully supported in the next release. Be aware, that when you install HDP 2.5 via the Cloudbreak AMI that is out on AWS marketplace, you will only get OpenJDK 1.7 and not the 1.8, which is required for HDF 2. You will need at add an edge node, upgrade it to 1.8 JDK, and then remove the 1.7 JDK before you can install HDF using the (unsupported/demo) method described below: Demo Ambari service to deploy/manage NiFi on HDP

devers1 · ‎10-05-2016

I can confirm that the work around does enable applications/tools like Excel and Tableau to function normally.

devers1 · ‎08-09-2016

Good info. The concurrency metric you state is inline with other recommendations I have seen when dealing with large numbers of updates. Thankyou!

devers1 · ‎08-09-2016

Thanks Scott! Does there exist a working example or set of metrics published to help admins with determining the frequency or identifying a threshold? I see several generalized statements but not so much with test numbers. Maybe a follow up HCC article in the making.

devers1 · ‎08-08-2016

All, If I turn on ACID for HIVE, is there a performance impact? And if so, is there some best practices to mitigate or address this with tools like tuning in TEZ or the number of mappers? The goal is to update tables/records and identify if it makes more sense to do nightly batches vs incremental updates throughout the day.

devers1 · ‎05-13-2016

Page 20 of the PDF explains how to further enable logging. ODBC user guide for HIVE

devers1 · ‎05-13-2016

I don't believe the password for hive is "hue" Did you try maria_dev/maria_dev for name and password?

Online	Offline
Last Visited	‎08-20-2019 02:58 PM

Member Since	‎05-13-2016 05:36 AM
Last Visited	‎08-20-2019 02:58 PM
Posts	27
Kudos received	5

Cloudera Community

Re: Can we deploy HDF with CloudBreak?

Re: Do we support hdfs fuse?

Using HDFS-FUSE for POSIX directory mounts

Re: read orc table from spark

Re: Can we deploy HDF with CloudBreak?

Re: Unable to connect Excel 2016 for MAC to Hive

Re: HIVE and ACID table performance for updates

Re: HIVE and ACID table performance for updates

HIVE and ACID table performance for updates

Re: Hortonworks Hive ODBC Driver DSN setup

Re: Hortonworks Hive ODBC Driver DSN setup