Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7086 | 06-03-2019 09:31 PM | |
1715 | 05-22-2019 02:38 AM | |
2165 | 05-22-2019 02:21 AM | |
1347 | 05-04-2019 08:17 PM | |
1660 | 04-14-2019 12:06 AM |
03-18-2019
02:55 PM
You just need to align your LOCATION clause of EXTERNALS TABLE's DDL to point to your /FLIGHT folder. Hive will crawl all the subfolders. You might also consider using PARTITION BY and instead of having folders for year, month a day. This let's you do things like WHERE my_partition_col > '19991115' AND my_partition_col < '20010215' which would be much tougher if you partition by specific year, month, and day values.
... View more
03-06-2019
11:26 PM
While I'm doubtful these three directories are the very best answer to this problem, but the old "three directories for the NN metadata" came about long before a solid HA solution was available and as https://twitter.com/LesterMartinATL/status/527340416002453504 points out, it was (and actually still is) all about disaster recovery. The old adage was to configure the NN to write to three different disks (via the directories) -- two local and one off the box such as a remote mount point. Why? Well... as you know that darn metadata is the keys to the whole file system and if it ever gets lost then ALL of your data is non-recoverable!! I personally think this is still valuable even with HA as the JournalNodes are focused on the edits files and do a great job of having that information on multiple machines, but the checkpoint image files only exist on the two NN nodes in HA configuration and, well... I just like to sleep better at night. Good luck and happy Hadooping!
... View more
03-06-2019
11:15 PM
Welcome to Phoenix... where the cardinal rule is if you are going to use Phoenix, then for that table, don't look at it or use it directly from the HBase API. What you are seeing is pretty normal. I don't see your DDL, but I'll give you an example to compare against. Check out the DDL at https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.sql and focus on the CORE column which is a BIGINT and the ACTIVE_VISITOR column which is INTEGER. Here's the data that gets loaded into it; https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.csv. Here's what it looks like via Phoenix... Here's what it looks like through HBase shell (using the API)... Notice the CORE and ACTIVE_VISITOR values looking a lot like your example? Yep, welcome to Phoenix. Remember, use Phoenix only for Phoenix tables and you'll be all right. 🙂 Good luck and happy Hadooping/HBasing!
... View more
03-06-2019
11:01 PM
If the compressed file was of just one file, the Pig approach shown in https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop might have been useful. No matter what you do, you'll have to do this in a single mapper from whatever data access framework you use this it won't be a parallelized job, but I understand your desire to save the time and network from the pull from HDFS and then put back in once extracted. The Java Map/Reduce example at http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ is also assuming the compressed file is a single file, but maybe it could be a start for some custom work you might be able to do. Good luck and happy Hadooping!
... View more
02-20-2019
08:25 PM
1 Kudo
There are a TON of variables at play here. First up, the "big" dataset isn't really all that big for Hive or Spark and that will always play into the variables. My *hunch* (just a hunch) is that your Hive query from beeline is able to use an existing session and that it is able to get access to as many containers as it would like. Conversely, Zeppelin may have a SparkContext that has a smaller number of executors than your Hive query can get access to. Of course, the "flaw in my slaw" is that these datasets are relatively small anyways. Spark's "100x improvement" line is always related to reiterative (aka ML/AI) processing, but for traditional querying and data pipelining, Spark runs faster when there is a bunch of tasks (mappers and reducers) that need to run and it can transition between those milliseconds within the pre-allocated executor containers instead of seconds that Hive has to burn talking to YARN's RM to get the needed containers. I realize this isn't as much of an answer as you were looking for more than it was an opinion piece now that a I review it before hitting "post answer". 🙂 Either way, good luck and happy Hadooping/Sparking!
... View more
02-20-2019
07:04 PM
Looks like @Bryan Bende already answered this over on https://stackoverflow.com/questions/54791414/how-i-can-use-hbase-2-0-4-version-in-nifi
... View more
01-28-2019
09:10 PM
You can find them at https://github.com/HortonworksUniversity/DevPH_Labs
... View more
12-01-2018
07:31 PM
1 Kudo
If their times are in sync now, I'm not sure of any inherent problems that would prevent you from starting them back up. Good luck & happy Hadooping!
... View more
12-01-2018
07:05 PM
See similar question at https://community.hortonworks.com/questions/47798/hbase-graphical-client.html for some ideas. Good luck & happy Hadooping!
... View more
04-27-2018
11:12 AM
Surely NOT the same issue, but along this line of buggy behavior in the HDP Sandbox (2.6.0.3) using Hive and getting messages mentioning hostnames sandbox and sandbox.hortonworks.com, I got this message a few times. FAILED: SemanticException Unable to determine if hdfs://sandbox.hortonworks.com:8020/user/root/salarydata is encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://sandbox.hortonworks.com:8020/user/root/salarydata, expected: hdfs://sandbox:8020 It seems to go away if I just exit the SSH connection and establish it again.
... View more