Member since
05-13-2016
27
Posts
9
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1519 | 10-16-2016 09:14 PM |
04-05-2018
04:41 PM
1 Kudo
Operational Challenges Legacy applications that lack the ability to leverage the REST interface for HDFS can limit data access for storage. One of the ways to deal with this is a mountable filesystem method for Hadoop. Additions to Linux kernel for native HDFS access are not yet there. Therefore we need to use the “user space” capabilities with something like FUSE to provide this functionality.
https://en.wikipedia.org/wiki/Filesystem_in_Userspace But before we begin, we need to understand some limitations of the HDFS-FUSE implementation:
Because this runs in user space (vs kernel space), performance is not on par with what native API implementations can provide
For large data transfers, continuous ingest, and in particular - streaming operations, users should look at solutions like Apache NIFI
And last but not least, the use of HDFS-FUSE does not currently support “append” operations for a file
So with those limitations understood, lets begin getting things setup on our cluster. Installation Procedure To begin with, we need to install the packages from the HDP repositories.
This article is going to focus on HDP 2.6.4, but the same holds true for earlier releases (HDP 2.5.0 has been tested and works similarly)
We are also going to assume that users have been added to the cluster and have access to both local directories as well as HDFS storage. As root (or with elevated privileges), install the requisite packages. [root@scratch ~]# yum install hadoop-hdfs-fuse NOTE: You may need to validate the paths for both “PATH” and “LD_LIBRARY_PATH” to the location for the requisite Hadoop & Java libraries and executables.
If using the Oracle JDK & HDP 2.6.4, they might look similar to this: Now we need to create our mount point on the Linux filesystem that users can access. This example uses a single active NameNode. Be sure to verify that we can see the file from command line. Now we will upload a file through the Ambari Files Views first to validate our access (a simple CSV file is sufficient) [demouser@scratch ~]$ ls -l /hadoop-fuse/user/demouser total 2 -rw-r--r-- 1 demouser hadoop 2641 Apr 3 14:45 sample_color.csv Now we can copy a file from our local filesystem into our HDFS user directory. [demouser@scratch ~]$ cp test.csv /hadoop-fuse/user/demouser We can verify that both files are visible in Ambari as well as command line.
[demouser@scratch ~]$ ls -l /hadoop-fuse/user/demouser
total 3 -rw-r--r-- 1 demouser hadoop 2641 Apr 3 14:45 sample_color.csv
-rw-r--r-- 1 demouser hadoop 118 Apr 3 14:46 test.csv We will confirm that our Linux user can see the contents of the file that was uploaded via Ambari. [demouser@scratch ~]$ head /hadoop-fuse/user/demouser/sample_color.csv
f_name,l_name,color,color_date,color_value
Jim,Smith,Blue,2016-03-10 12:01:01,2.0
Ben,Johnson,Brown,2016-03-10 12:01:01,3.0
Roy,Wallace,Black,2016-03-10 12:01:01,4.0
Gary,Kingston,White,2016-03-10 12:01:01,5.0
Shawn,Marty,Green,2016-03-10 12:01:01,6.0
Larry,Stonewall,Yellow,2016-03-11 12:01:01,7.0
Freddy,VanLeap,Purple,2016-03-11 12:01:01,8.0
Scott,Benson,Red,2016-03-11 12:01:01,2.0
Martha,Tory,Orange,2016-03-11 12:01:01,3.0 And finally, we will verify the contents of the file uploaded via command line by previewing it in Ambari. And if we want to test out creating a HIVE table from out data set loaded from Linux command line: (test.csv was loaded in via command line using just the copy command)
And now we can inspect (sample) the data in our table.
Addendum Once you are satisfied with the location and permissions, you can have this mount at boot or run it as part of a secondary startup script (i.e. rc.local if it is enabled on CentOS/RHEL 7+) to mount on reboot. But it is best to wait until the NameNode is up and running before your proceed with this automation. hadoop-fuse-dfs#dfs://<name_node_hostname>:<namenode_port> <mount_point> fuse allow_other,usetrash,rw 2 0 For more information on how this works, see the Apache Hadoop page for Mountable HDFS: https://wiki.apache.org/hadoop/MountableHDFS My next article on this will include how this can work with NameNode-HA and secured cluster with Kerberos (Authentication) and LDAP (Ranger Authorization).
... View more
Labels:
08-10-2016
06:10 PM
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Configuration has a list of configuration options. For example, hive.compactor.delta.num.threshold, lets you control how often minor compaction runs wrt number of SQL operations. Each SQL insert/update/delete generates 1 delta file. Each minor compaction, combines whatever delta files it finds into a single new delta that includes all the info. There is not specific guidance available since depends on specific set up/requirements and has to tested for. There other options listed there that can help.
... View more
05-13-2016
06:26 PM
@devers If you mean a way to decrypt a file that has been encrypted with HDFS encryption, then no. The encryption and decryption with HDFS as-rest encryption is more complex. The EEK is stored with the file, and you have to talk to the KMS to get the decrypted key, etc. You can use HDFS encryption with Hive and Spark to take care of this for you. If you want to generate a key pair and use that for both Hive and Spark to encrypt/decrypt data, that can be done, but would be part of loading and working with the data. You'd need to define a UDF for Hive to use for decryption so you could reference it with a select statement, and you'd need to use libraries in Scala or Python for Spark to decrypt the data. Both would have to have access to the keys for decryption, though, and that may be difficult to architect in a secure fashion.
... View more
05-13-2016
06:44 PM
Page 20 of the PDF explains how to further enable logging. ODBC user guide for HIVE
... View more