First, the SparkR Backend Handler shouldn't be so fragile as to fail simply because I got stuck in a one-hour meeting which interrupted me in the middle of trying to construct a script. SparkR seems to be the red-headed bastard stepchild of the Spark API family and it definitely shows... so someone from Hortonworks should either kill it forever and force me to switch to python, or put your money where your software is and make this puppy more robust... But even though Hortonworks may ignore R & SparkR, I now have to kill the hanging yarn application. Thank the lord I have 'sudo' privileges ...which used to work! Until 2.6.5! Now when I run the command to kill the application_Id, I'm getting the following error: Killing application application_1536084816396_0023
Exception in thread "main" org.apache.hadoop.yarn.exceptions.YarnException: java.security.AccessControlException: User hdfs cannot perform operation MODIFY_APP on application_1536084816396_0023 Can anyone tell me why my sudo call is now interpreting me as the 'hdfs' user?
... View more
Please be conscious of answers... We're very newbie to HortonWorks. We've got 3 x 6 VERY LARGE files that need to be joined and stored so that we can query it as an EXTERNAL table from Hive. The cluster is 6 compute nodes, 8 cores and 128 GB RAM on each node, and a 5-node Isilon feeding the HDFS filesystem. Each of the compute nodes have 3 partitions of 1 TB dedicated to /, /data, & /var. We're running Ambari, and I'm running this through an enabled Pig View, though not on Tez cause I don't want to struggle with the configuration to get Pig enabled on Tez at this current moment. We've got 3 directories, each with 6 partitions of fixed-width files. The 3 directories each have their own fixed-width format. Total size of 3 directories: 120 GB, 210 GB, 700 GB. Yes, this is Big Data, and I thought that's what HortonWorks is supposed to be able to manage. The layout of each directory generally looks like this -rwxrwxrwx 3 root 1000000 19332000000 2016-07-14 17:31 file01.txt -rwxrwxrwx 3 root 1000000 19332000000 2016-07-14 17:42 file02.txt ... I'm following the suggestions in this website: http://deathbytape.com/articles/2015/08/21/pig-hive-query-data.html The pig script is a bit of a monster in terms of length, but its relatively simple. Pseudocode: 1) Define 3 schemas using LOAD and org.apache.pig.piggybank.storage.FixedWidthLoader(). 2) Join 3 files on their respective keys. 3) Describe final table and Dump 5 lines to ensure everything looks alright. 4) STORE final table [USING OrcStorage(); as suggested by the hyperlink for speed and efficiency] in a new directory so that we can eventually build a Hive EXTERNAL table to call the schema we've defined. Again: data1 =
LOAD 'hdfs://fileshare:8020/directory1' USING org.apache.pig.piggybank.storage.FixedWidthLoader( '1-10 ,11-27 ,28-31 ,32-35 ,36-39 ,40-47 ,48-50 ,51-54 ,55-58 ... ', '', 'col1a long, col2a int, ... etc') data2 = LOAD 'hdfs://fileshare:8020/directory2' USING org.apache.pig.piggybank.storage.FixedWidthLoader( '1-10 ,11-19 , ... ', '', 'col1b long, col2b int, ... etc') data3 = LOAD 'hdfs://fileshare:8020/directory3' USING org.apache.pig.piggybank.storage.FixedWidthLoader( '1-10 ,11-27 , ... ', '', 'col1c long, col2c int, ... etc') ALLDATA = JOIN
data1 BY encrypted_keya, data2 BY encrypted_keyb, data3 BY
ALLDATA; alias_lim =
LIMIT ALLDATA 5; DUMP alias_lim; STORE ALLDATA
OrcStorage(); This BROKE two of the compute nodes.... filled them both to capacity... A total of 120+210+700 GB = 1.03 TB put 3+ TB on two of the compute nodes. This implies 6 TB of junk created for running a join meant to create a 1 TB table. So now I have a lot of questions. A) How is this robust? Hadoop is now almost a decade old, and in the little bit of learning I've done, it is supposed to be running things like YARN, which I've learned is a "Resource Management" workhorse... but the empirical results here demonstrate that it doesn't know how to manage resources. B) Why would 6 TB be created to manage the simple instructions required to do an inner join of approximately 1 TB of data? C) HOW DO I GET MY COMPUTE NODES BACK TO A FUNCTIONAL STATE? There is now 6+ Terabytes of junk that I have to clean from the partitions since the two VMs running these functions have been crippled by the query. Let's say I reboot the VMs. Now what?? Can I just "rm -rf" everything in a particular directory?? Which one? Will this break the entire HortonWorks install? This seems to be so much more trouble than it's worth. D) Last question from a bird's eye view: In general, how fragile is Hadoop? In a normal ETL environment, it seems that testing needs to be done to determine the most efficient way of ingesting and transforming data. If I'm going to test multiple methods of ingesting a TB of data, I can't take a week of time to tiptoe around 19 levels of configurations before feeling "safe" enough to run some commands because it might just blow up the cluster and force a reinstallation of the entire configuration, HortonWorks distro, and permissions, etc. Who has time for this ridiculous amount of overhead??
... View more