About LesterMartin

LesterMartin · ‎07-19-2016

I know the tutorial at http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#section_6 does this slightly different as they have the Spark code save a DF as an ORC file (step 4.5.2) and then they run a Hive LOAD command (step 4.5.3), but your INSERT-AS-SELECT model sounds like it should work. Maybe someone could test it out if you supplied a simple working example of what you have so they could try to resolve the issue for you.

LesterMartin · ‎07-19-2016

Maybe a small sample dataset would help so someone could try to run it as maybe the gotcha is not with the STORE, but the presence of it means all the transformations have to run. If it is a problem prior to the STORE, then the error (whatever it is since we don't have the output of that) should surface just the same if you used DUMP to display the data. If DUMP (and not STORE) is working, then probably just a permissions issue with the location you are storing. Again, supply a small sample dataset and I'm sure somebody will chew on it for a few minutes to see if they can figure it out.

LesterMartin · ‎07-19-2016

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper. That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents. a,b,c a,b,c a,b,c I then ran the following simple script (tried it with MR as well as Tez as the execution engine). a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile'); DUMP a; And I got the following (expected) output. (file1.txt,a,b,c) (file1.txt,a,b,c) (file1.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) Good luck!

LesterMartin · ‎07-19-2016

A quick way to determine specific versions of core Hadoop (and all the components making up HDP) is to visit the particular HDP version's release notes under http://docs.hortonworks.com. When you are on a box itself, you can read the "cookie crumbs" such as shown below that does show that HDP 2.4.2 uses Hadoop 2.7.1 as you identified above. Hint: look at the long jar name that includes the Apache version number followed by the HDP version number. [root@ip-172-30-0-91 hdp]# pwd/usr/hdp [root@ip-172-30-0-91 hdp]# ls 2.4.2.0-258 current [root@ip-172-30-0-91 hdp]# cd current/hadoop-hdfs-client [root@ip-172-30-0-91 hadoop-hdfs-client]# ls hadoop-hdfs-2* hadoop-hdfs-2.7.1.2.4.2.0-258.jar hadoop-hdfs-2.7.1.2.4.2.0-258-tests.jars As for a Hadoop 2.8 release date, I'm sure not the person who can comment on that, but you can go to https://issues.apache.org/jira/browse/HADOOP/fixforversion/12329058/ to see all the JIRAs that are currently slated to be part of it. Good luck!

LesterMartin · ‎07-19-2016

THEORETICALLY.... you could move the underlying files on a particular DataNode and put them on another DataNode, but... you'd have to have that DataNode processes not running during that. When the other DataNode started with the files you have moved it, it will send a block report that contains the blocks you copied. If that was done in pretty tight synchronization with taking down the original DataNode it ~might~, again THEORETICALLY, work, but... DON'T DO THAT! Seriously, that is a bunny trail that you would not really want to explore outside of just a learning exercise in a non-production environment to help you understand how the NN and DN processes interoperate. Maintenance mode is not your answer here as that who premise is to help Ambari Monitoring not send bad alarms about services you intentionally want unavailable. Generally speaking, decommissioning a DataNode is the way to go as it would give the NameNode the time to not put new blocks on the DataNode being decommissioned while redistributing them so you are never under-replicated. If you just delete the node, then you'll be under-replicated until the NameNode can resolve that for you.

LesterMartin · ‎06-17-2016

What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals. For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data. [root@sandbox hcc]# cat raw_sales.txt CatZ,Prod22-cZ,30,60 CatA,Prod88-cA,15,50 CatY,Prod07-cY,20,40 CatB,Prod18-cB,10,50 CatX,Prod29-cZ,40,60 CatC,Prod09-cC,80,140 CatZ,Prod83-cZ,20,60 CatA,Prod17-cA,25,50 CatY,Prod98-cY,10,40 CatB,Prod99-cB,30,50 CatX,Prod19-cZ,10,60 CatC,Prod73-cC,50,140 CatZ,Prod52-cZ,10,60 CatA,Prod58-cA,15,50 CatY,Prod57-cY,10,40 CatB,Prod58-cB,10,50 CatX,Prod59-cZ,10,60 CatC,Prod59-cC,10,140 That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ. Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done. [root@sandbox hcc]# cat salesHCC.pig rawSales = LOAD 'raw_sales.txt' USING PigStorage(',') AS (category: chararray, product: chararray, sales: long, total_sales_category: long); -- group them by the total sales / category combos grpByCatTotals = GROUP rawSales BY (total_sales_category, category); -- put these groups in order from highest to lowest sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC; -- just keep the top N topSalesCats = LIMIT sortGrpByCatTotals 2; -- do your original logic to get the top sales within the categories topProdsByTopCats = FOREACH topSalesCats { sorted = ORDER rawSales BY sales DESC; top = LIMIT sorted 2; GENERATE group, FLATTEN(top); } DUMP topProdsByTopCats; The output is as initially expected. [root@sandbox hcc]# pig -x tez salesHCC.pig ((140,CatC),CatC,Prod09-cC,80,140) ((140,CatC),CatC,Prod73-cC,50,140) ((60,CatZ),CatZ,Prod22-cZ,30,60) ((60,CatZ),CatZ,Prod83-cZ,20,60) I hope this was what you were looking for. Either way, good luck!

LesterMartin · ‎06-10-2016

I'm not aware of the concept of Spark's Accumulators exposed as "first-class" objects in Pig and have always advised that you would need to build a UDF for such activities if you couldn't simply get away with filtering the things to count (such as "good" records and "rejects") into separate aliases then count them up. Here is a blog post going down the UDF path; https://dzone.com/articles/counters-apache-pig. Good luck & I'd love to hear if there was something I've been missing all along directly from Pig.

LesterMartin · ‎06-06-2016

For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.

LesterMartin · ‎05-27-2016

I've got a weird/wild one for sure and wondering if anyone has any insight. Heck, I'm giving out "BONUS POINTS" for this one. I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks. When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below. >>> rdd1 = sc.textFile("statePopulations.csv",8) >>> rdd1.getNumPartitions() 8 >>> rdd1 = sc.textFile("statePopulations.csv",9) >>> rdd1.getNumPartitions() 10 >>> rdd1 = sc.textFile("statePopulations.csv",10) >>> rdd1.getNumPartitions() 11 I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have. >>> rdd2 = sc.textFile("/proto/2000.csv") >>> rdd2.getNumPartitions() 5 >>> rdd2 = sc.textFile("/proto/2000.csv",9) >>> rdd2.getNumPartitions() 9 >>> rdd2 = sc.textFile("/proto/2000.csv",45) >>> rdd2.getNumPartitions() 45 Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?) Anyone know what's going on?

LesterMartin · ‎05-26-2016

Hey Vijay, yep, this might be too big of a set of questions for HCC. My suggestion is to search for particular topics to see if they are already being addressed and then ultimately, imagine these as separate discrete questions. For example, see https://community.hortonworks.com/questions/35539/snapshots-backup-and-dr.html as a pointed set of questions around snapshots; ok... that one had a bunch of Q's in one, too. 😉 Another alternative is to get hold of a solutions engineer from a company like (well, like Hortonworks!) to try to help you through all of these what-if questions. Additionally, a consultant can help you build an operational "run book" that addresses all of these concerns in a customized version for your org. Good luck!

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: Not able to view inserted data using spark in ...

Re: Apache PIG - When insert STORE function it giv...

Re: Apache Pig - Load 80 files into another direco...

Re: Understanding Apache Hadoop releases

Re: Is there a way to take the data -node willing ...

Re: Pig latin / Order after group by

Re: Pig Accumulator in Spark

Re: What is the Best Practice for Loading Files in...

increasing textFile() partitioning number anomoly

Re: Questions on Disaster Recovery