About LesterMartin

LesterMartin · ‎04-26-2018

SANDBOX VERSION AFFECTED HDP 2.6.0.3 Sandbox as identified below. # wget https://downloads-hortonworks.akamaized.net/sandbox-hdp-2.6/HDP_2.6_docker_05_05_2017_15_01_40.tar.gz # md5sum HDP_2.6_docker_05_05_2017_15_01_40.tar.gz 886845a5e2fc28f773c59dace548e516 HDP_2.6_docker_05_05_2017_15_01_40.tar.gz ISSUE When using classic Hive CLI after a while the following error surfaces. [root@sandbox demos]# hive log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender. Logging initialized using configuration in file:/etc/hive/2.6.0.3-8/0/hive-log4j.properties Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.UnknownHostException: sandbox at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:547) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:233) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: sandbox at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:311) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:690) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:631) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:179) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:530) ... 8 more Caused by: java.net.UnknownHostException: sandbox ... 21 more [root@sandbox demos]# RESOLUTION Modify /etc/hosts to allow sandbox to be resolved just as sandbox.hortonworks.com does. [root@sandbox ~]# cat /etc/hosts 127.0.0.1localhost ::1localhost ip6-localhost ip6-loopback fe00::0ip6-localnet ff00::0ip6-mcastprefix ff02::1ip6-allnodes ff02::2ip6-allrouters 172.17.0.2sandbox.hortonworks.com [root@sandbox ~]# cp /etc/hosts /tmp/ [root@sandbox ~]# vi /etc/hosts [root@sandbox ~]# diff /etc/hosts /tmp/hosts 7c7 < 172.17.0.2sandbox.hortonworks.com sandbox --- > 172.17.0.2sandbox.hortonworks.com [root@sandbox ~]#

LesterMartin · ‎02-13-2018

I'm guessing you've already seen http://hbase.apache.org/0.94/book/secondary.indexes.html which basically is telling you that you'll need to have a second table whose rowkey is your "secondary index" and is only being used to find the rowkey needed for the actual table. The coprocessor strategy, as I understand it, is to just formalize & automate the "dual-write secondary index" strategy. Good luck and happy Hadooping!

LesterMartin · ‎02-13-2018

While I don't want to oversimplify this process nor not suggest that Hortonworks Professional Services doesn't do these conversions with customers all the time (there is often more at play than simply moving the data, such as testing apps before & after), but... you can leverage DistCp, https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html, as your tool to move the data from your original cluster to your new one. For the HBase data, I'd look to its Snapshots feature, http://hbase.apache.org/book.html#ops.snapshots, including its ability to export the snapshot to another cluster, as a solid approach. Good luck and happy Hadooping!

LesterMartin · ‎02-13-2018

It looks like you are only letting YARN use 25GB's of your worker nodes' 64GB as well as only 6 of your 16 CPU cores, so these values should be raised. Check out details at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/determine-hdp-memory-config.html for a script that can help you set some baseline values for these properties. As for the Spark jobs. Interestingly enough, each of these jobs is requesting a certain size and number of containers and I'm betting each job is a bit different. Since Spark jobs get their resources first, it would seem normal that a specific job (as long as the resource request doesn't change nor does the fundamental dataset size for input) take a comparable time to run from invocation to invocation. Surely, that isn't necessarily the case from different Spark jobs which may be doing entirely different things. Good luck and happy Hadooping/Sparking!

LesterMartin · ‎01-12-2018

https://stackoverflow.com/questions/45100487/how-data-is-split-into-part-files-in-sqoop can start to explain more, but ultimately (and thanks to the power of open-source) you'll have to go look for yourself - you can find source code at https://github.com/apache/sqoop. Good luck and happy Hadooping!

LesterMartin · ‎12-09-2017

From looking at your RM UI it sure looks like both of these jobs are basically fighting each other to get running. Meaning, the AppMaster containers are running, but they can't get anymore more containers to be run from YARN. My recommendation would be to give the VM 10GB of memory (that's how I run it on my 16GB laptop) when you restart it. I'd also try to run it from the command line just to take the Ambari View out of the picture, but if you want to run it in Ambari then kill any application via the RM UI that is around should it hang again. Good luck and happy Hadooping!

LesterMartin · ‎11-01-2017

Unfortunately, it is a bit more complicated than all of that. In general, Spark is lazy executed so depending on what you do even the "temp view" tables/DataFrame(Set) may not stay around from DAG to DAG. There is an explicit cache method you can use on a DataFrame(Set), but even then you may be trying to cache something that simply won't fit in memory. No worries, Spark assumes that your DF(S)/RDD collections won't fit and it inherently handles this. I'm NOT trying to sell you on anything, but probably some deeper learnings could help you. I'm a trainer here at Hortonworks (and again, not really trying to sell you something, but pointing to a resource/opportunity) and we spend several days building up this knowledge in our https://hortonworks.com/services/training/class/hdp-developer-enterprise-spark/ class). Again, apologies for being a salesperson, but my general thought was there's still a bit more to learn for you on Spark internals that might take some more interactive ways of building up that knowledge.

LesterMartin · ‎10-31-2017

Generically speaking, yes, I'd just run the query that is built upon your Hive tables as Spark SQL is going to "figure out" what it needs to do in its optimizer before doing any work anyway. If the performance is within your SLA then I'd just go with that, but of course, you could always then use that as a baseline to do some comparisons with if/when you do some other approaches in your code. Happy Hadooping (eh hem... Sparking!) and good luck!

LesterMartin · ‎09-04-2017

Take the free self-paced course at http://public.hortonworksuniversity.com/hdp-overview-apache-hadoop-essentials-self-paced-training. Additionally, Hadoop: The Definitive Guide guide, https://smile.amazon.com/Definitive-version-revised-English-Chinese/dp/7564159170/, is still a very good resource.

LesterMartin · ‎09-04-2017

Take the free self-paced course at http://public.hortonworksuniversity.com/hdp-overview-apache-hadoop-essentials-self-paced-training as a good start.

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

HDP 2.6 Sandbox Hive CLI "UnknownHostException: sa...

Re: Is hbase has default secondary index feature?

Re: How to properly migrate data (HDFS+HBase) from...

Re: Yarn memory allocation & utilization

Re: How Sqoop internally works

Re: How To Process Data with Apache Pig tutorial S...

Re: Can I use SparkSQL to do complex joins and sor...

Re: Can I use SparkSQL to do complex joins and sor...

Re: HCA - HORTONWORKS CERTIFIED ASSOCIATE Certific...

Re: HCA - HORTONWORKS CERTIFIED ASSOCIATE Certific...