About LesterMartin

LesterMartin · ‎03-18-2019

You just need to align your LOCATION clause of EXTERNALS TABLE's DDL to point to your /FLIGHT folder. Hive will crawl all the subfolders. You might also consider using PARTITION BY and instead of having folders for year, month a day. This let's you do things like WHERE my_partition_col > '19991115' AND my_partition_col < '20010215' which would be much tougher if you partition by specific year, month, and day values.

LesterMartin · ‎03-06-2019

While I'm doubtful these three directories are the very best answer to this problem, but the old "three directories for the NN metadata" came about long before a solid HA solution was available and as https://twitter.com/LesterMartinATL/status/527340416002453504 points out, it was (and actually still is) all about disaster recovery. The old adage was to configure the NN to write to three different disks (via the directories) -- two local and one off the box such as a remote mount point. Why? Well... as you know that darn metadata is the keys to the whole file system and if it ever gets lost then ALL of your data is non-recoverable!! I personally think this is still valuable even with HA as the JournalNodes are focused on the edits files and do a great job of having that information on multiple machines, but the checkpoint image files only exist on the two NN nodes in HA configuration and, well... I just like to sleep better at night. Good luck and happy Hadooping!

LesterMartin · ‎03-06-2019

Welcome to Phoenix... where the cardinal rule is if you are going to use Phoenix, then for that table, don't look at it or use it directly from the HBase API. What you are seeing is pretty normal. I don't see your DDL, but I'll give you an example to compare against. Check out the DDL at https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.sql and focus on the CORE column which is a BIGINT and the ACTIVE_VISITOR column which is INTEGER. Here's the data that gets loaded into it; https://github.com/apache/phoenix/blob/master/examples/WEB_STAT.csv. Here's what it looks like via Phoenix... Here's what it looks like through HBase shell (using the API)... Notice the CORE and ACTIVE_VISITOR values looking a lot like your example? Yep, welcome to Phoenix. Remember, use Phoenix only for Phoenix tables and you'll be all right. 🙂 Good luck and happy Hadooping/HBasing!

LesterMartin · ‎03-06-2019

If the compressed file was of just one file, the Pig approach shown in https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop might have been useful. No matter what you do, you'll have to do this in a single mapper from whatever data access framework you use this it won't be a parallelized job, but I understand your desire to save the time and network from the pull from HDFS and then put back in once extracted. The Java Map/Reduce example at http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ is also assuming the compressed file is a single file, but maybe it could be a start for some custom work you might be able to do. Good luck and happy Hadooping!

LesterMartin · ‎02-20-2019

There are a TON of variables at play here. First up, the "big" dataset isn't really all that big for Hive or Spark and that will always play into the variables. My *hunch* (just a hunch) is that your Hive query from beeline is able to use an existing session and that it is able to get access to as many containers as it would like. Conversely, Zeppelin may have a SparkContext that has a smaller number of executors than your Hive query can get access to. Of course, the "flaw in my slaw" is that these datasets are relatively small anyways. Spark's "100x improvement" line is always related to reiterative (aka ML/AI) processing, but for traditional querying and data pipelining, Spark runs faster when there is a bunch of tasks (mappers and reducers) that need to run and it can transition between those milliseconds within the pre-allocated executor containers instead of seconds that Hive has to burn talking to YARN's RM to get the needed containers. I realize this isn't as much of an answer as you were looking for more than it was an opinion piece now that a I review it before hitting "post answer". 🙂 Either way, good luck and happy Hadooping/Sparking!

LesterMartin · ‎02-20-2019

Looks like @Bryan Bende already answered this over on https://stackoverflow.com/questions/54791414/how-i-can-use-hbase-2-0-4-version-in-nifi

LesterMartin · ‎01-28-2019

You can find them at https://github.com/HortonworksUniversity/DevPH_Labs

LesterMartin · ‎12-01-2018

If their times are in sync now, I'm not sure of any inherent problems that would prevent you from starting them back up. Good luck & happy Hadooping!

LesterMartin · ‎12-01-2018

See similar question at https://community.hortonworks.com/questions/47798/hbase-graphical-client.html for some ideas. Good luck & happy Hadooping!

LesterMartin · ‎04-27-2018

Surely NOT the same issue, but along this line of buggy behavior in the HDP Sandbox (2.6.0.3) using Hive and getting messages mentioning hostnames sandbox and sandbox.hortonworks.com, I got this message a few times. FAILED: SemanticException Unable to determine if hdfs://sandbox.hortonworks.com:8020/user/root/salarydata is encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://sandbox.hortonworks.com:8020/user/root/salarydata, expected: hdfs://sandbox:8020 It seems to go away if I just exit the SSH connection and establish it again.

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: HIVE External Tables

Re: All three namenode directories are pointing to...

Re: hbase not showing ascii value for ROW key

Re: How to uzip a file in Hadoop path(Inside zip f...

Re: Why Hive with UDF is faster than Spark ?

Re: How I can use HBase 2.0.4 version in Nifi?

Re: HDP2.3-Pig & Hive Rev6 VM for Self Paced Learn...

Re: updated time zone on few of worker nodes which...

Re: Can anyone suggest me a GUI tool for browsing ...

Re: HDP 2.6 Sandbox Hive CLI "UnknownHostException...