Member since
05-02-2019
319
Posts
144
Kudos Received
58
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3720 | 06-03-2019 09:31 PM | |
763 | 05-22-2019 02:38 AM | |
1086 | 05-22-2019 02:21 AM | |
618 | 05-04-2019 08:17 PM | |
802 | 04-14-2019 12:06 AM |
08-09-2017
06:19 PM
The Essentials course is also offered in a self-paced "online" format for free; info at http://public.hortonworksuniversity.com/hdp-overview-apache-hadoop-essentials-self-paced-training.
... View more
07-31-2017
12:34 PM
Yep, this could work, but for a big cluster I could imagine this being time-consuming. The initial recursive listing (especially since it will represent down to the file level) could be quite large for any file system of any size. The more time-consuming effort would be to run the "hdfs dfs -count" command over and over and over. But... like you said, this should work. Preferably, I'd want the NN to just offer a "show me all quoto details" or at least just "show me directories w/quotas". Since this function is not present, Maybe there is a performance hit for NN to quickly determine this that I'm not considering as seems lightweight to me. Thanks for your suggestion.
... View more
07-31-2017
09:20 AM
1 Kudo
The HDFS Quota Guide, http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html, shows how to list details of quotas at a specific directory where the quota is listed, but is there a way to see all quotas with one command (or at least a way to list all directories that have quotas, something like the way you can list all snapshottable dirs, which I could then programmatically iterate through and check individual quotas? My "hunch" was that I could just check on the / directory and see a roll-up of the two specific quotas showed first, but as expected it is only showing the details of that dir's quota (if it exist). [hdfs@node1 ~]$ hdfs dfs -count -v -q /user/testeng
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
400 399 none inf 1 0 0 /user/testeng
[hdfs@node1 ~]$ hdfs dfs -count -v -q /user/testmar
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
none inf 134352500 134352500 1 0 0 /user/testmar
[hdfs@node1 ~]$
[hdfs@node1 ~]$
[hdfs@node1 ~]$ hdfs dfs -count -v -q /
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
9223372036854775807 9223372036854775735 none inf 49 23 457221101 /
[hdfs@node1 ~]$
... View more
- Tags:
- Hadoop Core
- HDFS
Labels:
- Labels:
-
Apache Hadoop
07-13-2017
02:08 PM
I'm using Ambari 2.4.2.0 (and Capacity Scheduler Ambari View 1.0.0) which DOES have the "Save and Refresh Queues". That's not the problem. What is concerning is that over in the YARN service page Ambari wants to restart the RMs as shown in the attached screenshot. Probably doesn't need to be done, BUT this causes long-running grief for operators who don't want to see all of these warning messages to restart things. Thoughts?
... View more
07-12-2017
02:35 PM
Before the nice new Ambari View, we could get away with "Refreshing Capacity Queues" (or some such service-level command), but with the new (and very nice!) Ambari View, even the simplest changes to the queue definitions gets represented in Ambari as a need to restart the Resource Manager? Is this a bug? Was this behavior present before (i.e. we can refresh the queues after editing the simple text box and Ambari still wants to restart the RMs)?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache YARN
07-10-2017
07:05 PM
Gotcha; using the Ambari View. It still seems that it is not getting invoked properly. Can you provide a screenshot of the Ambari View; especially the section with the -useHCatalog argument? Did you try it with, and without, the "use Tez" checkbox selected? While this code looks good, it is often a good idea to try the code out from the CLI just to remove one less variable (again, the code looks simple and direct enough that I don't think this would provide much value other than showing you it can run).
... View more
07-10-2017
01:26 AM
The "yarn jar" warning is nothing to worry about and the output you received suggests that you were unable to launch the script. My guess is your command-line interaction was incorrect. It should have been something like the following. pig -useHCatalog yourscript.pig You can see some examples of this at https://martin.atlassian.net/wiki/x/AgCfB (including running via Tez). If you are doing this, please show the exact command your ran. If running from the Ambari View, be sure to add the -useHCatalog argument as shown in Step 5.4 of https://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/.
... View more
07-06-2017
01:33 PM
I think there is always an interest in your approach of doing real-time inserts/updates/deletes into HBase and then front that with a Hive table, but... I don't believe you will get the kind of performance you are expecting when you start joining that table with first-class Hive tables, not to mention do any kind of analytical query (ok, any query that doesn't just read based on the rowKey). Not saying that isn't a valid approach, but you'd sure want to do some testing and even then you might find yourself doing the updates against HBase and periodically dumping that data into something you could use in a more first-class manner with Hive (and then you lost your real-time updates). I do agree with the others who have commented on this Q about looking to Hive INSERT/UPDATE/DELETE options as well as the newly supported MERGE command. Plenty of testing will be needed to make sure this is your solution, but this is clearly the most developer-friendly model to chase and significant effort has gone into getting this working thus far and I expect more efforts to continue to broaden the scope and decrease the prerequisites. Regarding Approach #2 and the incremental update blog post from 2014, I invite you to take a look at my materials from my 2015 Summit talk on this topic, https://martin.atlassian.net/wiki/x/GYBzAg, as I think there are a few options if you go down this "classical" data update path that could be considered (mostly based on size of data across the table and percentage of data being changed & the skewing of those updates; not to mention how frequent you need to sync-up with your source table). Good luck and happy Hadooping!
... View more
07-06-2017
01:18 PM
Great question and unfortunately, I don't think there is a well agreed upon formula/calculator out there as "it depends" is so often the rule. Some considerations are that the datanode doesn't really know about the directory structure; it just stores (and copies, deletes, etc) blocks as directed by the datanode (often indirectly since clients write actual blocks). Additionally, the checksums at the block level are actually stored on disk alongside the files for the data contained in a given block. It looks like there's some good info in the following HCC Q's that might be of help to you. https://community.hortonworks.com/questions/64677/datanode-heapsize-computation.html https://community.hortonworks.com/questions/45381/do-i-need-to-tune-java-heap-size.html https://community.hortonworks.com/questions/78981/data-node-heap-size-warning.html Good luck and happy Hadooping!
... View more
06-22-2017
06:37 PM
Instead of year as (year:int) try (int) year as castedYear:int
... View more
06-14-2017
04:13 PM
Hi Calvin. Before I type anything else please realize that I do not know YOUR specific use cases, but... I doubt anyone will argue with me that there are going to be very few of them that would really make sense to run on a single-node (aka pseudo) cluster. If all of your data can fit on one machine and all run within the constraints of 8GB of memory, then... quite possibly you just don't need Hadoop for that scenario. Additionally, even HDFS cannot do what it is supposed to in a single-node configuration since it has no additional nodes for replication to occur on. All that said, the HDP Sandbox is a way to jumpstart your initial hands-on efforts with Hadoop and to provide a playground for our publicly available tutorials and similarly sized & scoped investigative activities you may undertake. A full HDP stack takes many more resources than is typical in a single server with characteristics like a simple laptop or desktop. The Sandbox team makes MANY configuration adjustments to try to shoehorn the whole stack into a single image. In fact, you'll notice that not all service are running at any given time which is a pattern I'd recommend (start only what you need for an experiment and stop everything else). All that said, please do realize we are still talking about "commodity hardware", but we are not talking about "tiny hardware". Most on-prem servers are quite big and https://community.hortonworks.com/questions/37565/cluster-sizing-calculator.html points you to https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_cluster-planning/content/ch_hardware-recommendations_chapter.html which makes some suggestions on pilot clusters and full production ones. You'll also read some additional thoughts on the whole "commodity hardware" versus "enterprise data center server" characteristics in that documentation. Good luck and happy Hadooping!
... View more
06-07-2017
08:06 PM
Excellent. Truthfully, the case sensitivity is a bit weird in Pig -- kind of like the rules of the English language. Hehe!
... View more
06-06-2017
03:25 PM
Regarding the on-demand offerings we have, we do have an HDP Essentials course, but currently it is only available via the larger, bundled Self-Paced Learning Library described at https://hortonworks.com/self-paced-learning-library/. We are working towards offering individual on-demand courses, but not there yet. You could register for it individually via our live (remote in most cases) delivery options shown at https://hortonworks.com/services/training/class/hadoop-essentials/.
... View more
06-04-2017
08:52 PM
I'd raise a separate HCC question for help with that. That way we'll get the targeted audience and your Q's won't be buried within this one that most will read as a cert question. That's a fancy way to say I haven't set that particular version up myself and wouldn't be much help until after I got my hands dirty with it. 😉
... View more
06-04-2017
05:49 PM
could you add some small sample files (or links of where to grab them) for the timesheet and driver CSV files, too?
... View more
06-04-2017
04:38 AM
It did the trick for me. I sure hope it helps out @Joan Viladrosa, too! Thanks, Sriharsha!
... View more
06-03-2017
08:49 PM
Ahhh.... for my sins it looks like I ran into the same problem as you did, @Joan Viladrosa (shown below). Did you get anywhere with this? Do you have a workaround while I try to see if anyone knows what's up?
... View more
06-03-2017
02:57 PM
There are older version of the Sandbox that might run within a 8GB machine, but probably not any that will have Spark, or at least, a relatively modern version of Spark. The Sandbox does have a cloud-hosted option. All of this is detailed at https://hortonworks.com/downloads/#sandbox. Good luck and happy Hadooping (and Sparking)!
... View more
06-03-2017
02:55 PM
2 Kudos
See @Artem Ervits' answer in https://community.hortonworks.com/questions/14080/hadoop-nodes-with-different-characteristics.html on this question. Ambari has "config groups" features that will help you out here. Good luck and happy Hadooping!
... View more
06-02-2017
04:32 PM
1 Kudo
You're not going to like it, but it is as simply as upper-casing "count" to be "COUNT". 😉 This brings up the larger issue around case-sensitivity in Pig. Generally, speaking case only really matters on alias names and things that end up being Java class names. Functions fall into that bucket, so just upper-case it and it'll work. Additionally, you could simplify the code a bit to just do COUNT(october) instead of COUNT(october.s_station). Good luck and happy Hadooping!
... View more
05-30-2017
02:37 PM
1 Kudo
@Anirban Das Deb, yep, the notes in https://community.hortonworks.com/questions/65370/hdp23-pig-hive-rev6-vm-for-self-paced-learning.html probably helped you get tiny psuedo cluster running on Docker rebuilt again. As for your specific problem of running gedit to access and save files on that Docker image, the lab guide walks you through how to get going. I documented the essential steps in https://community.hortonworks.com/questions/66151/devph-folder-in-self-paced-learning-vm.html (scroll down towards the bottom as my answer isn't marked as "Best"). That said, if this helps you out, maybe you can "Accept" it on this one. 😉
... View more
05-30-2017
12:49 PM
2 Kudos
You're right that there are (plenty of) times when you need to do some cleansing/transforming/enhancing/etc of your data and you're also right that you have multiples tools and approaches to this. I talked about this (at a high-level) in my recent www.devnexus.com preso that you can find at https://www.slideshare.net/lestermartin/transformation-processing-smackdown-spark-vs-hive-vs-pig. The good (and bad) news is that you get to make some choices here which I believe are usually decided upon based on your, and your team's, experiences and desires as much as anything. If you have some specific scenarios you want help on, it might be best to open a specific HCC question for each of them and you'll probably get a more targeted response as this question appears to be rather high-level and the answers could quickly become subjective; again, based on individuals' experiences and preferences. Good luck and happy Hadooping!
... View more
05-24-2017
06:25 PM
Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.
... View more
05-24-2017
04:28 PM
Agreed that the example use case could be solved more simply (real world demands KISS principle, but sometimes simple examples are overkill). The point was to make sure you understood how it can be used. For a slightly more meaty example, check out the one in https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
... View more
05-24-2017
03:22 PM
I'm not aware of it myself. Might be useful to chime in on the Zeppelin Scheduler discussion at https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html with this question.
... View more
05-24-2017
03:18 PM
Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.
... View more
05-23-2017
01:11 PM
Yep, know this "classic" script pretty well. You can find it on @Alan Gates's github account at https://github.com/alanfgates/programmingpig/blob/master/examples/ch6/distinct_symbols.pig. The absolute best way to understand things (and very often answer questions) is to simply run something and observe the behavior. To help with this, I loaded up some simple data that equates to just three distinct trading symbols; SYM1, SYM2 and SYM3. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NYSE_daily.csv
NYSE,SYM1,100,100
NYSE,SYM1,200,200
NYSE,SYM2,100,100
NYSE,SYM2,200,200
NYSE,SYM3,100,100
NYSE,SYM3,200,200
[maria_dev@sandbox 104217]$
I got the script ready to run. [maria_dev@sandbox 104217]$ cat distinct_symbols.pig
daily = load '/user/maria_dev/hcc/104217/NYSE_daily.csv'
USING PigStorage(',')
as (exchange, symbol); -- skip other fields
grpd = group daily by exchange;
describe grpd; -- to show where "daily" below comes from
dump grpd;
uniqcnt = foreach grpd {
sym = daily.symbol;
uniq_sym = distinct sym;
generate group, COUNT(uniq_sym);
};
describe uniqcnt;
dump uniqcnt;
[maria_dev@sandbox 104217]$
The first thing you seem to have trouble with is where does "daily" come from. As this output from the describe and dump of grpd shows, it is made up of two attributes; group and daily (where daily is the contents of of all records from the NYSE which is all records since that's all we have in this file). grpd: { group: bytearray, daily: {(exchange: bytearray,symbol: bytearray)} }
( NYSE, {(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)} ) So, there is only one tuple in the grpd alias (there is only one distinct exchange) that we get to loop through. While inside the loop (for the single row it has) sym ends up being all six rows from (grpd current row).daily.symbol and then uniq_sym ends up being the three distinct rows of which we use to generate the second (unnamed) attribute in uniqcnt. From there, we can describe and dump grpd. uniqcnt: {group: bytearray,long}
(NYSE,3) To help illustrate it more, add the following file to the same input directory in HDFS. [maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NASDAQ_daily.csv
NASDAQ,HDP,10,1000
NASDAQ,ABC,1,1
NASDAQ,XYZ,1,1
NASDAQ,HDP,10,2000
[maria_dev@sandbox 104217]$
Then change the pig script to just read the directory that has two files now and you'll get this updated output. grpd: {group: bytearray,daily: {(exchange: bytearray,symbol: bytearray)}}
(NYSE,{(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)})
(NASDAQ,{(NASDAQ,HDP),(NASDAQ,XYZ),(NASDAQ,ABC),(NASDAQ,HDP)})
uniqcnt: {group: bytearray,long}
(NYSE,3)
(NASDAQ,3) Hope this helps. Good luck and happy Hadooping!
... View more
05-22-2017
08:20 PM
Is that tuple definition of key and timestamp part of the declareOutputFields() method of your spout? My Topology code (snippet below) did NOT have a chance to declare output fields from my Kafka bolt (or maybe I just didn't wire it up right). TopologyBuilder builder = new TopologyBuilder();
BrokerHosts hosts = new ZkHosts("zk1:2181,zk2:2181,zk3:2181");
SpoutConfig sc = new SpoutConfig(hosts,
"s20-logs", "/s20-logs",
UUID.randomUUID().toString());
sc.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout spout = new KafkaSpout(sc);
builder.setSpout("log-spout", spout, 1);
builder.setBolt("message-tokenizer",
new MessageTokenizerBolt(), 1)
.shuffleGrouping("log-spout");
My Kafka messages were just a long tab-separated string of values so my MessageTokenizerBolt (shown below) broke this apart and declared fields, such as ip-address, that could then later be used in a fieldsGrouping() further in the topology. public class MessageTokenizerBolt extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
String[] logElements = StringUtils.split(tuple.getString(0), '\t');
String ipAddress = logElements[2];
String messageType = logElements[3];
String messageDetails = logElements[4];
basicOutputCollector.emit(new Values(ipAddress, messageType, messageDetails));
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("ip-address", "message-type", "message-details"));
}
} I'm guessing this isn't your problem as I was thinking you'd get a runtime exception if the field name you were trying to group on wasn't declared as a field name in the stream you are listening to. Maybe you can provide a bit more of your code?
... View more
05-21-2017
10:33 PM
I surely don't have an answer for this one, but you could ~play~ with it by hand-jamming what you think is the appropriate jar into the worker nodes as I did with another jar issue as described in https://martin.atlassian.net/wiki/x/JbXqBQ. Then you'd have an easy case to submit which the support team could easily reproduce (and get fixed!). Good luck and happy Hadooping/Storming.
... View more
05-21-2017
10:28 PM
Here's a snippet from a working scenario similar to yours that works just fine for me. TopologyBuilder builder = new TopologyBuilder();
BrokerHosts hosts = new ZkHosts("zk1:2181,zk2:2181,zk3:2181");
SpoutConfig sc = new SpoutConfig(hosts,
"s20-logs", "/s20-logs",
UUID.randomUUID().toString());
sc.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout spout = new KafkaSpout(sc);
builder.setSpout("log-spout", spout, 1); If you notice, you'll see that my BrokerHosts are really the list of ZooKeeper instances. I'm running this on a HDP 2.5 cluster which is Storm 1.0.1 and I constructed this ZkHosts class after reviewing the notes in http://storm.apache.org/releases/1.0.1/storm-kafka.html. Might be worth a try for you. Either way, good luck and Happy Hadooping/Storming!
... View more