About LesterMartin

LesterMartin · ‎06-03-2017

Ahhh.... for my sins it looks like I ran into the same problem as you did, @Joan Viladrosa (shown below). Did you get anywhere with this? Do you have a workaround while I try to see if anyone knows what's up?

LesterMartin · ‎06-03-2017

There are older version of the Sandbox that might run within a 8GB machine, but probably not any that will have Spark, or at least, a relatively modern version of Spark. The Sandbox does have a cloud-hosted option. All of this is detailed at https://hortonworks.com/downloads/#sandbox. Good luck and happy Hadooping (and Sparking)!

LesterMartin · ‎06-03-2017

See @Artem Ervits' answer in https://community.hortonworks.com/questions/14080/hadoop-nodes-with-different-characteristics.html on this question. Ambari has "config groups" features that will help you out here. Good luck and happy Hadooping!

LesterMartin · ‎06-02-2017

You're not going to like it, but it is as simply as upper-casing "count" to be "COUNT". 😉 This brings up the larger issue around case-sensitivity in Pig. Generally, speaking case only really matters on alias names and things that end up being Java class names. Functions fall into that bucket, so just upper-case it and it'll work. Additionally, you could simplify the code a bit to just do COUNT(october) instead of COUNT(october.s_station). Good luck and happy Hadooping!

LesterMartin · ‎05-30-2017

@Anirban Das Deb, yep, the notes in https://community.hortonworks.com/questions/65370/hdp23-pig-hive-rev6-vm-for-self-paced-learning.html probably helped you get tiny psuedo cluster running on Docker rebuilt again. As for your specific problem of running gedit to access and save files on that Docker image, the lab guide walks you through how to get going. I documented the essential steps in https://community.hortonworks.com/questions/66151/devph-folder-in-self-paced-learning-vm.html (scroll down towards the bottom as my answer isn't marked as "Best"). That said, if this helps you out, maybe you can "Accept" it on this one. 😉

LesterMartin · ‎05-24-2017

Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.

LesterMartin · ‎05-24-2017

Agreed that the example use case could be solved more simply (real world demands KISS principle, but sometimes simple examples are overkill). The point was to make sure you understood how it can be used. For a slightly more meaty example, check out the one in https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/

LesterMartin · ‎05-24-2017

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

LesterMartin · ‎05-23-2017

Yep, know this "classic" script pretty well. You can find it on @Alan Gates's github account at https://github.com/alanfgates/programmingpig/blob/master/examples/ch6/distinct_symbols.pig. The absolute best way to understand things (and very often answer questions) is to simply run something and observe the behavior. To help with this, I loaded up some simple data that equates to just three distinct trading symbols; SYM1, SYM2 and SYM3. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NYSE_daily.csv NYSE,SYM1,100,100 NYSE,SYM1,200,200 NYSE,SYM2,100,100 NYSE,SYM2,200,200 NYSE,SYM3,100,100 NYSE,SYM3,200,200 [maria_dev@sandbox 104217]$ I got the script ready to run. [maria_dev@sandbox 104217]$ cat distinct_symbols.pig daily = load '/user/maria_dev/hcc/104217/NYSE_daily.csv' USING PigStorage(',') as (exchange, symbol); -- skip other fields grpd = group daily by exchange; describe grpd; -- to show where "daily" below comes from dump grpd; uniqcnt = foreach grpd { sym = daily.symbol; uniq_sym = distinct sym; generate group, COUNT(uniq_sym); }; describe uniqcnt; dump uniqcnt; [maria_dev@sandbox 104217]$ The first thing you seem to have trouble with is where does "daily" come from. As this output from the describe and dump of grpd shows, it is made up of two attributes; group and daily (where daily is the contents of of all records from the NYSE which is all records since that's all we have in this file). grpd: { group: bytearray, daily: {(exchange: bytearray,symbol: bytearray)} } ( NYSE, {(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)} ) So, there is only one tuple in the grpd alias (there is only one distinct exchange) that we get to loop through. While inside the loop (for the single row it has) sym ends up being all six rows from (grpd current row).daily.symbol and then uniq_sym ends up being the three distinct rows of which we use to generate the second (unnamed) attribute in uniqcnt. From there, we can describe and dump grpd. uniqcnt: {group: bytearray,long} (NYSE,3) To help illustrate it more, add the following file to the same input directory in HDFS. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NASDAQ_daily.csv NASDAQ,HDP,10,1000 NASDAQ,ABC,1,1 NASDAQ,XYZ,1,1 NASDAQ,HDP,10,2000 [maria_dev@sandbox 104217]$ Then change the pig script to just read the directory that has two files now and you'll get this updated output. grpd: {group: bytearray,daily: {(exchange: bytearray,symbol: bytearray)}} (NYSE,{(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)}) (NASDAQ,{(NASDAQ,HDP),(NASDAQ,XYZ),(NASDAQ,ABC),(NASDAQ,HDP)}) uniqcnt: {group: bytearray,long} (NYSE,3) (NASDAQ,3) Hope this helps. Good luck and happy Hadooping!

LesterMartin · ‎05-22-2017

Is that tuple definition of key and timestamp part of the declareOutputFields() method of your spout? My Topology code (snippet below) did NOT have a chance to declare output fields from my Kafka bolt (or maybe I just didn't wire it up right). TopologyBuilder builder = new TopologyBuilder(); BrokerHosts hosts = new ZkHosts("zk1:2181,zk2:2181,zk3:2181"); SpoutConfig sc = new SpoutConfig(hosts, "s20-logs", "/s20-logs", UUID.randomUUID().toString()); sc.scheme = new SchemeAsMultiScheme(new StringScheme()); KafkaSpout spout = new KafkaSpout(sc); builder.setSpout("log-spout", spout, 1); builder.setBolt("message-tokenizer", new MessageTokenizerBolt(), 1) .shuffleGrouping("log-spout"); My Kafka messages were just a long tab-separated string of values so my MessageTokenizerBolt (shown below) broke this apart and declared fields, such as ip-address, that could then later be used in a fieldsGrouping() further in the topology. public class MessageTokenizerBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String[] logElements = StringUtils.split(tuple.getString(0), '\t'); String ipAddress = logElements[2]; String messageType = logElements[3]; String messageDetails = logElements[4]; basicOutputCollector.emit(new Values(ipAddress, messageType, messageDetails)); } public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("ip-address", "message-type", "message-details")); } } I'm guessing this isn't your problem as I was thinking you'd get a runtime exception if the field name you were trying to group on wasn't declared as a field name in the stream you are listening to. Maybe you can provide a bit more of your code?

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: When using storm-kafka version 1.1.0.2.6.0.3-8...

Re: HDPCD Spark certificatiom VM Requirement

Re: slave nodes with different hardware

Re: grouping a relation in pig

Re: Practitioner - Partnerworks HDP Developer | Ca...

Re: Pig- fragment-replicate join

Re: Pig - Nested foreach

Re: Pig- fragment-replicate join

Re: Pig - Nested foreach

Re: Storm filtering kafka events for calculating m...