Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7114 | 06-03-2019 09:31 PM | |
1725 | 05-22-2019 02:38 AM | |
2175 | 05-22-2019 02:21 AM | |
1358 | 05-04-2019 08:17 PM | |
1672 | 04-14-2019 12:06 AM |
06-03-2017
08:49 PM
Ahhh.... for my sins it looks like I ran into the same problem as you did, @Joan Viladrosa (shown below). Did you get anywhere with this? Do you have a workaround while I try to see if anyone knows what's up?
... View more
06-03-2017
02:57 PM
There are older version of the Sandbox that might run within a 8GB machine, but probably not any that will have Spark, or at least, a relatively modern version of Spark. The Sandbox does have a cloud-hosted option. All of this is detailed at https://hortonworks.com/downloads/#sandbox. Good luck and happy Hadooping (and Sparking)!
... View more
06-03-2017
02:55 PM
2 Kudos
See @Artem Ervits' answer in https://community.hortonworks.com/questions/14080/hadoop-nodes-with-different-characteristics.html on this question. Ambari has "config groups" features that will help you out here. Good luck and happy Hadooping!
... View more
06-02-2017
04:32 PM
1 Kudo
You're not going to like it, but it is as simply as upper-casing "count" to be "COUNT". 😉 This brings up the larger issue around case-sensitivity in Pig. Generally, speaking case only really matters on alias names and things that end up being Java class names. Functions fall into that bucket, so just upper-case it and it'll work. Additionally, you could simplify the code a bit to just do COUNT(october) instead of COUNT(october.s_station). Good luck and happy Hadooping!
... View more
05-30-2017
02:37 PM
1 Kudo
@Anirban Das Deb, yep, the notes in https://community.hortonworks.com/questions/65370/hdp23-pig-hive-rev6-vm-for-self-paced-learning.html probably helped you get tiny psuedo cluster running on Docker rebuilt again. As for your specific problem of running gedit to access and save files on that Docker image, the lab guide walks you through how to get going. I documented the essential steps in https://community.hortonworks.com/questions/66151/devph-folder-in-self-paced-learning-vm.html (scroll down towards the bottom as my answer isn't marked as "Best"). That said, if this helps you out, maybe you can "Accept" it on this one. 😉
... View more
05-24-2017
06:25 PM
Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.
... View more
05-24-2017
04:28 PM
Agreed that the example use case could be solved more simply (real world demands KISS principle, but sometimes simple examples are overkill). The point was to make sure you understood how it can be used. For a slightly more meaty example, check out the one in https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
... View more
05-24-2017
03:18 PM
Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.
... View more
05-23-2017
01:11 PM
Yep, know this "classic" script pretty well. You can find it on @Alan Gates's github account at https://github.com/alanfgates/programmingpig/blob/master/examples/ch6/distinct_symbols.pig. The absolute best way to understand things (and very often answer questions) is to simply run something and observe the behavior. To help with this, I loaded up some simple data that equates to just three distinct trading symbols; SYM1, SYM2 and SYM3. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NYSE_daily.csv
NYSE,SYM1,100,100
NYSE,SYM1,200,200
NYSE,SYM2,100,100
NYSE,SYM2,200,200
NYSE,SYM3,100,100
NYSE,SYM3,200,200
[maria_dev@sandbox 104217]$
I got the script ready to run. [maria_dev@sandbox 104217]$ cat distinct_symbols.pig
daily = load '/user/maria_dev/hcc/104217/NYSE_daily.csv'
USING PigStorage(',')
as (exchange, symbol); -- skip other fields
grpd = group daily by exchange;
describe grpd; -- to show where "daily" below comes from
dump grpd;
uniqcnt = foreach grpd {
sym = daily.symbol;
uniq_sym = distinct sym;
generate group, COUNT(uniq_sym);
};
describe uniqcnt;
dump uniqcnt;
[maria_dev@sandbox 104217]$
The first thing you seem to have trouble with is where does "daily" come from. As this output from the describe and dump of grpd shows, it is made up of two attributes; group and daily (where daily is the contents of of all records from the NYSE which is all records since that's all we have in this file). grpd: { group: bytearray, daily: {(exchange: bytearray,symbol: bytearray)} }
( NYSE, {(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)} ) So, there is only one tuple in the grpd alias (there is only one distinct exchange) that we get to loop through. While inside the loop (for the single row it has) sym ends up being all six rows from (grpd current row).daily.symbol and then uniq_sym ends up being the three distinct rows of which we use to generate the second (unnamed) attribute in uniqcnt. From there, we can describe and dump grpd. uniqcnt: {group: bytearray,long}
(NYSE,3) To help illustrate it more, add the following file to the same input directory in HDFS. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NASDAQ_daily.csv
NASDAQ,HDP,10,1000
NASDAQ,ABC,1,1
NASDAQ,XYZ,1,1
NASDAQ,HDP,10,2000
[maria_dev@sandbox 104217]$
Then change the pig script to just read the directory that has two files now and you'll get this updated output. grpd: {group: bytearray,daily: {(exchange: bytearray,symbol: bytearray)}}
(NYSE,{(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)})
(NASDAQ,{(NASDAQ,HDP),(NASDAQ,XYZ),(NASDAQ,ABC),(NASDAQ,HDP)})
uniqcnt: {group: bytearray,long}
(NYSE,3)
(NASDAQ,3) Hope this helps. Good luck and happy Hadooping!
... View more
05-22-2017
08:20 PM
Is that tuple definition of key and timestamp part of the declareOutputFields() method of your spout? My Topology code (snippet below) did NOT have a chance to declare output fields from my Kafka bolt (or maybe I just didn't wire it up right). TopologyBuilder builder = new TopologyBuilder();
BrokerHosts hosts = new ZkHosts("zk1:2181,zk2:2181,zk3:2181");
SpoutConfig sc = new SpoutConfig(hosts,
"s20-logs", "/s20-logs",
UUID.randomUUID().toString());
sc.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout spout = new KafkaSpout(sc);
builder.setSpout("log-spout", spout, 1);
builder.setBolt("message-tokenizer",
new MessageTokenizerBolt(), 1)
.shuffleGrouping("log-spout");
My Kafka messages were just a long tab-separated string of values so my MessageTokenizerBolt (shown below) broke this apart and declared fields, such as ip-address, that could then later be used in a fieldsGrouping() further in the topology. public class MessageTokenizerBolt extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
String[] logElements = StringUtils.split(tuple.getString(0), '\t');
String ipAddress = logElements[2];
String messageType = logElements[3];
String messageDetails = logElements[4];
basicOutputCollector.emit(new Values(ipAddress, messageType, messageDetails));
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("ip-address", "message-type", "message-details"));
}
} I'm guessing this isn't your problem as I was thinking you'd get a runtime exception if the field name you were trying to group on wasn't declared as a field name in the stream you are listening to. Maybe you can provide a bit more of your code?
... View more