Member since
05-02-2019
319
Posts
145
Kudos Received
58
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4319 | 06-03-2019 09:31 PM | |
900 | 05-22-2019 02:38 AM | |
1220 | 05-22-2019 02:21 AM | |
695 | 05-04-2019 08:17 PM | |
916 | 04-14-2019 12:06 AM |
06-04-2019
01:06 PM
You could try to do a single INSERT INTO statement per partition and run as many of these simultaneously as your cluster has resources for.
... View more
06-03-2019
09:31 PM
Yep, create a new one defined the way you want the partitions to be and then insert into that new one using dynamic partitioning and you'll be good to go. Good luck and happy Hadooping.
... View more
05-22-2019
02:38 AM
I would strongly suggest you look at HBase's snapshotting model as detailed at https://hbase.apache.org/book.html#ops.snapshots. The snapshot create process is very fast as it does NOT create a copy of the underlying HFiles on HDFS (just keeps HDFS snapshot "pointers" to them). Then you can use the ExportSnapshot process that will copy the needed underlying HFiles over to the second HBase cluster. This model won't utilize any extra space on the source cluster (well, delete the snapshot once you are done!) or on the target cluster as you'll have to get all those HFiles created which is what this process does. Good luck and happy HBasing!
... View more
05-22-2019
02:26 AM
Are you trying to replace the functionality of creating the host/date paths and files and/or are you trying to want NiFi to recursively parse the growing directories to get at the underlying syslog.log files?
... View more
05-22-2019
02:21 AM
1 Kudo
As @gdeleon suggested... "that dog won't hunt". Basically, you'll need at least two YARN containers for each Hive user/query going on to house the applicationMaster and another container to start doing some actual work (the first one there is getting their application into the "Running" state). The "Accepted" state means those users were able to get a container for their applicationMasters, but then there isn't enough space for YARN to grant enough actual containers to do much else. Again, it is just isn't designed for this. A better solution would be to let each student have their own HDP Sandbox (and the won't need to allocate 32GB VMs). Good luck and happy Hadooping!
... View more
05-06-2019
08:53 PM
Hey @Matt Clarke, if there is a better way to do this w/o RPG as you suggested in your answer over in https://community.hortonworks.com/questions/245373/nifi-cluster-listensmtp.html, would you have time to update this article to account for that? I point folks to this link all the time. Thanks!
... View more
05-04-2019
08:38 PM
Missing the ambari server name before the ":8080"?
... View more
05-04-2019
08:17 PM
Probably for using HCatalog with can be extremely useful for Pig programmers even if they don't want to use Hive and just leverage this for schema management instead of defining AS clauses in their LOAD commands? Just as likely this is something hard-coded into Ambari? If you really don't want Hive, I bet you can just delete it after installation. For giggles, I stood up an HDFS-only HDP 3.1.0 cluster for https://community.hortonworks.com/questions/245432/is-it-possible-to-install-only-hdfs-on-linux-machi.html?childToView=245544#answer-245544 and just added Pig (required YARN, MR, Tez & ZK, but that makes sense!) and did NOT require Hive to be added as seen below. Good luck and happy Hadooping!
... View more
05-04-2019
07:50 PM
Sounds like this should be running as an Isolated Processor and be configured to run on the Primary Node only instead of All Nodes. Then, to take full advantage of both of the NiFi nodes you have, you'll want to create a Remote Processor Group back on yourself much like explained in https://community.hortonworks.com/articles/97773/how-to-retrieve-files-from-a-sftp-server-using-nif.html. Good luck and happy Flowfiling!
... View more
05-04-2019
07:36 PM
Sure... why not!?!? 🙂 I just installed HDP 3.1.0 via Ambari as barebones as I could on a small 5 nodes cluster (1 master and 4 workers). It did make me install Ambari Metrics, SmartSense and ZK, but I was able to delete those after everything was installed as shown in my screenshot. That said, I'd leave those in (and ZK will be required to do HDFS HA), but wanted to make the point that you could JUST have HDFS. Good luck and happy Hadooping!
... View more
05-04-2019
06:52 PM
I didn't see such a property when I looked at http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ListFTP/index.html, but some quick solutions to jumpstart this could be to simply replace this processor with a new one (which will have its own state management) or if you using "timestamps" for the "Listing Strategy" property then you could always do a linux "touch" command on the files on the FTP server which should trick the processors to grab them again. Good luck and happy Flowfiling!
... View more
05-04-2019
06:46 PM
The best answer to "will it work?" is "what did your testing show?". Give it a try and let us all know. I'm guessing you just have some old legacy code you can't change? I'm also guessing it will work! So... to be more precise, the 2.6.5 release notes component page, https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_release-notes/content/comp_versions.html, says Hadoop 2.7.3 is being used. The Apache Hadoop page for MRv1 compatibility with MRv2, https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html, suggests you are probably in good shape for previously compiled MRv1 apps from running on MRv2, but doesn't quite guarantee if you need to compile them against MRv2 you'll be so lucky. Good luck and happy Hadooping!
... View more
05-04-2019
06:39 PM
I believe your biggest problem is that you are trying to use the HDP Sandbox for something of any decent size. That environment wasn't necessary built for you to runs 100's of GB datasets (which itself is surely not all that "big" of Big Data). The Sandbox also has a bunch of configuration settings focused on running a pseudo-cluster (all on one box) which is NOT idea for any job of any real size. You did go down the right path of changing the max amount of memory that YARN can use, but at the end of the day, your box only has two CPUs and you really can't run that many containers anyways. You'd probably need to go change the size of the TEZ containers, too, for Hive/TEZ to ask for more than whatever the tiny configuration from the Sandbox is granting you. I don't know the costing model, but I'm betting 4 boxes with 16GB each would be cheaper than this 64GB one you are using now and that would allow you to spread out the workload across multiple machines (and yes, you'd have to install HDP via Ambari, the the http://docs.hortonworks.com site can help a LOT). Good luck and happy Hadooping!
... View more
05-04-2019
06:29 PM
As the blueprints focus heavily on the host_groups concept which usually means some highly specialized master setups and then a more generic worker model, I feel that using blueprints beyond the initial cluster layout really works best for when you are adding more workers. My INITIAL recommendation (I'm arm-chair QBing this w/o all the details) would be to simple go to Ambari's UI and add a new host via the wizard process and then assign whatever master and worker processes you need.
... View more
05-04-2019
06:21 PM
could you provide some additional details? screenshots of the Ambari dashboard, HDFS service page, host list, etc.?
... View more
05-01-2019
03:50 PM
It is confusing what triggers this task to run. Do you have any additional info on that or know if there is any way to configure it more precisely?
... View more
04-16-2019
02:52 PM
I need to play with the S3 processors a bit to be more helpful, but wondering if there is any issue getting these files in a NiFi cluster and if you should be marking the pull processor to be an "isolated processor" to run only on the "primary node" as Brian describes in his answer to https://community.hortonworks.com/questions/99328/nifi-isolated-processors-in-a-clustered.html. Worth giving it a try first to see if that's part of the problem.
... View more
04-14-2019
12:06 AM
Not sure how you got into this shape, but the balancer can fix it. https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
... View more
04-12-2019
02:26 PM
Yes, the collect() ACTION does require everything to come to the driver, but there is a better ACTION for you. Try using foreach() which is like an RDD's map() function in that it works on each partition of the underlying dataset independent of the other partitions (so you can run it wide!!). It returns nothing, which is probably what you want it to do anyways. Good luck and happy Sparking!
... View more
03-18-2019
02:55 PM
You just need to align your LOCATION clause of EXTERNALS TABLE's DDL to point to your /FLIGHT folder. Hive will crawl all the subfolders. You might also consider using PARTITION BY and instead of having folders for year, month a day. This let's you do things like WHERE my_partition_col > '19991115' AND my_partition_col < '20010215' which would be much tougher if you partition by specific year, month, and day values.
... View more
03-06-2019
11:51 PM
Pro: cheap. Con: not scalable. I know... you already knew that! 😉 Also, I'm not trying to be a smart @$$, but I do mean what I say at https://twitter.com/LesterMartinATL/status/504004795557236736 about these tiny clusters (yes, 3 worker nodes is a tiny cluster). A 3 node cluster might make sense for some problems, but you'll never be able to do anything at scale on it and it surely won't be performance-oriented for anything at the edge of what it can process. Regarding your 3TB capacity; does that mean each node has 3TB of capacity dedicated to itself (raw 9TB, but effective 3TB with replication factor of 3), or that each box has 1TB? I'm asking because for things like Terasort, we have to also consider the job's intermediate data (i.e. the info moving from the mappers to the reducers which is this case will be 500GB itself) as well as the final output back on HDFS; yes, another 500GB. The intermediary data isn't stored on HDFS, but if the input and output are then that 500GB is now 1TB (in + out) and will really be 3TB all by themselves if both of these are set with replication factor of 3. Even with replication factor of 1, this all smells problematic to me on this small cluster. If you only have 1TB of disk space on each node, then surely this will never run as just mentioned. Even if the space wasn't an issue, you'll need to run something like 3900 mappers just to process that (if my math is right of dividing 500GB by 128MB block size) plus a shed-load of reducers and that would take forever on three nodes. It has been many years since I was regularly running Terasort, but a very old heuristic I would use was a max of 30 minutes for a 10 worker node cluster comprised of boxes that had 128GB of ram and 10-12 disks. Clearly there are many variables at hand and I'm not sure looking at your specific output would immediately shed light on the exact problem, but what I would recommend is to start small and scale up. Run a 500MB gen and sort. Then double that to 1GB and then double to 2GB, 4GB, and so on to make sure your specific cluster's results make sense compared with the last run and I believe you'll see a good pattern. Eventually this will all run out of horsepower (aka nodes), but will give you a better benchmark. That is, that last good sized run will give you a number that should be cut in half when you double the size of the worker nodes! Good luck and happy Hadooping!
... View more
03-06-2019
11:26 PM
While I'm doubtful these three directories are the very best answer to this problem, but the old "three directories for the NN metadata" came about long before a solid HA solution was available and as https://twitter.com/LesterMartinATL/status/527340416002453504 points out, it was (and actually still is) all about disaster recovery. The old adage was to configure the NN to write to three different disks (via the directories) -- two local and one off the box such as a remote mount point. Why? Well... as you know that darn metadata is the keys to the whole file system and if it ever gets lost then ALL of your data is non-recoverable!! I personally think this is still valuable even with HA as the JournalNodes are focused on the edits files and do a great job of having that information on multiple machines, but the checkpoint image files only exist on the two NN nodes in HA configuration and, well... I just like to sleep better at night. Good luck and happy Hadooping!
... View more
03-06-2019
11:15 PM
Welcome to Phoenix... where the cardinal rule is if you are going to use Phoenix, then for that table, don't look at it or use it directly from the HBase API. What you are seeing is pretty normal. I don't see your DDL, but I'll give you an example to compare against. Check out the DDL at https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.sql and focus on the CORE column which is a BIGINT and the ACTIVE_VISITOR column which is INTEGER. Here's the data that gets loaded into it; https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.csv. Here's what it looks like via Phoenix... Here's what it looks like through HBase shell (using the API)... Notice the CORE and ACTIVE_VISITOR values looking a lot like your example? Yep, welcome to Phoenix. Remember, use Phoenix only for Phoenix tables and you'll be all right. 🙂 Good luck and happy Hadooping/HBasing!
... View more
03-06-2019
11:01 PM
If the compressed file was of just one file, the Pig approach shown in https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop might have been useful. No matter what you do, you'll have to do this in a single mapper from whatever data access framework you use this it won't be a parallelized job, but I understand your desire to save the time and network from the pull from HDFS and then put back in once extracted. The Java Map/Reduce example at http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ is also assuming the compressed file is a single file, but maybe it could be a start for some custom work you might be able to do. Good luck and happy Hadooping!
... View more
02-20-2019
08:25 PM
1 Kudo
There are a TON of variables at play here. First up, the "big" dataset isn't really all that big for Hive or Spark and that will always play into the variables. My *hunch* (just a hunch) is that your Hive query from beeline is able to use an existing session and that it is able to get access to as many containers as it would like. Conversely, Zeppelin may have a SparkContext that has a smaller number of executors than your Hive query can get access to. Of course, the "flaw in my slaw" is that these datasets are relatively small anyways. Spark's "100x improvement" line is always related to reiterative (aka ML/AI) processing, but for traditional querying and data pipelining, Spark runs faster when there is a bunch of tasks (mappers and reducers) that need to run and it can transition between those milliseconds within the pre-allocated executor containers instead of seconds that Hive has to burn talking to YARN's RM to get the needed containers. I realize this isn't as much of an answer as you were looking for more than it was an opinion piece now that a I review it before hitting "post answer". 🙂 Either way, good luck and happy Hadooping/Sparking!
... View more
02-20-2019
08:10 PM
With Hive 3 pushing hard with fully managed tables with native file formats as transactional tables, see https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/managing-hive/content/hive_acid_operations.html for more info, this "direct from Spark to Hive" approach will get much harder do to the underlying "delta files" that get created when data is added/modified/removed from a Hive table. The Spark LLAP Connector will aid in this integration. That said, historically, the better answer is often to simple save your DF from Spark to HDFS, wrap it with an External Hive table and then do and INSERT INTO your existing Hive table with a SELECT * FROM your new external table. This lets Hive do all the heavy lifting and file conversions as needed and takes care of any partitioning and/or bucketing that you have in place. Good luck and happy Hadooping!
... View more
02-20-2019
07:58 PM
1 Kudo
That pseudo-cluster itself is a scalability bottleneck. 😉 Storm likes to scale by having many Supervisor (worker) processes). As for the specific stats on your component, you could drill into the Storm UI and find your topology and then drill into it to find your bolt to see how it is doing and gauge for yourself it is working well enough. You'll get rolling statistics of how long it takes to process things such as shown below. You could also scale up the number of instances the bolt has, but again, the single-server pseudo-cluster is likely going to be your first bottleneck.
... View more
02-20-2019
07:44 PM
From the legacy Hortonworks Professional Services team, we would call this an Ambari Takeover. I'm not sure if there is a formal documented procedure available on the web, but our support and or consulting teams could help with this. Here's an article I found with a few seconds of googling about this concept in general; http://www.adaltas.com/en/2018/11/15/hadoop-cluster-takeover-with-apache-ambari/. Good luck & happy Hadooping!
... View more
02-20-2019
07:35 PM
1 Kudo
I have never used it to do both at the same time and https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_data-access/content/using_sqoop_to_move_data_into_hive.html says "HDFS or Hive". Good luck and happy Hadooping!
... View more
02-20-2019
07:24 PM
Where the install instructions at https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.3.1/installing-upgrading-hdf.html not useful? Also, just a callout that NiFi itself is a master-less cluster config. I'm assuming that maybe you want to use the "master" for something like Ambari only?
... View more