Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4713 | 06-03-2019 09:31 PM | |
992 | 05-22-2019 02:38 AM | |
1341 | 05-22-2019 02:21 AM | |
783 | 05-04-2019 08:17 PM | |
1012 | 04-14-2019 12:06 AM |
06-04-2019
01:06 PM
You could try to do a single INSERT INTO statement per partition and run as many of these simultaneously as your cluster has resources for.
... View more
06-03-2019
09:31 PM
Yep, create a new one defined the way you want the partitions to be and then insert into that new one using dynamic partitioning and you'll be good to go. Good luck and happy Hadooping.
... View more
05-22-2019
02:38 AM
I would strongly suggest you look at HBase's snapshotting model as detailed at https://hbase.apache.org/book.html#ops.snapshots. The snapshot create process is very fast as it does NOT create a copy of the underlying HFiles on HDFS (just keeps HDFS snapshot "pointers" to them). Then you can use the ExportSnapshot process that will copy the needed underlying HFiles over to the second HBase cluster. This model won't utilize any extra space on the source cluster (well, delete the snapshot once you are done!) or on the target cluster as you'll have to get all those HFiles created which is what this process does. Good luck and happy HBasing!
... View more
05-22-2019
02:26 AM
Are you trying to replace the functionality of creating the host/date paths and files and/or are you trying to want NiFi to recursively parse the growing directories to get at the underlying syslog.log files?
... View more
05-22-2019
02:21 AM
1 Kudo
As @gdeleon suggested... "that dog won't hunt". Basically, you'll need at least two YARN containers for each Hive user/query going on to house the applicationMaster and another container to start doing some actual work (the first one there is getting their application into the "Running" state). The "Accepted" state means those users were able to get a container for their applicationMasters, but then there isn't enough space for YARN to grant enough actual containers to do much else. Again, it is just isn't designed for this. A better solution would be to let each student have their own HDP Sandbox (and the won't need to allocate 32GB VMs). Good luck and happy Hadooping!
... View more
05-06-2019
08:53 PM
Hey @Matt Clarke, if there is a better way to do this w/o RPG as you suggested in your answer over in https://community.hortonworks.com/questions/245373/nifi-cluster-listensmtp.html, would you have time to update this article to account for that? I point folks to this link all the time. Thanks!
... View more
05-04-2019
08:17 PM
Probably for using HCatalog with can be extremely useful for Pig programmers even if they don't want to use Hive and just leverage this for schema management instead of defining AS clauses in their LOAD commands? Just as likely this is something hard-coded into Ambari? If you really don't want Hive, I bet you can just delete it after installation. For giggles, I stood up an HDFS-only HDP 3.1.0 cluster for https://community.hortonworks.com/questions/245432/is-it-possible-to-install-only-hdfs-on-linux-machi.html?childToView=245544#answer-245544 and just added Pig (required YARN, MR, Tez & ZK, but that makes sense!) and did NOT require Hive to be added as seen below. Good luck and happy Hadooping!
... View more
05-04-2019
07:50 PM
Sounds like this should be running as an Isolated Processor and be configured to run on the Primary Node only instead of All Nodes. Then, to take full advantage of both of the NiFi nodes you have, you'll want to create a Remote Processor Group back on yourself much like explained in https://community.hortonworks.com/articles/97773/how-to-retrieve-files-from-a-sftp-server-using-nif.html. Good luck and happy Flowfiling!
... View more
05-04-2019
06:52 PM
I didn't see such a property when I looked at http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ListFTP/index.html, but some quick solutions to jumpstart this could be to simply replace this processor with a new one (which will have its own state management) or if you using "timestamps" for the "Listing Strategy" property then you could always do a linux "touch" command on the files on the FTP server which should trick the processors to grab them again. Good luck and happy Flowfiling!
... View more
05-04-2019
06:39 PM
I believe your biggest problem is that you are trying to use the HDP Sandbox for something of any decent size. That environment wasn't necessary built for you to runs 100's of GB datasets (which itself is surely not all that "big" of Big Data). The Sandbox also has a bunch of configuration settings focused on running a pseudo-cluster (all on one box) which is NOT idea for any job of any real size. You did go down the right path of changing the max amount of memory that YARN can use, but at the end of the day, your box only has two CPUs and you really can't run that many containers anyways. You'd probably need to go change the size of the TEZ containers, too, for Hive/TEZ to ask for more than whatever the tiny configuration from the Sandbox is granting you. I don't know the costing model, but I'm betting 4 boxes with 16GB each would be cheaper than this 64GB one you are using now and that would allow you to spread out the workload across multiple machines (and yes, you'd have to install HDP via Ambari, the the http://docs.hortonworks.com site can help a LOT). Good luck and happy Hadooping!
... View more