About LesterMartin

LesterMartin · ‎11-08-2016

That "no such id: sandbox" is the concerning error to me. I hate to ask, but could you please download the zipped VM again and start all over. I'd like to attach the latest setup guide as well, but HCC has a file size limit that is preventing me from doing that. If it doesn't work this next time, please send an email to training-support@hortonworks.com (you can reference this HCC post, too) which will create an more easy to track internal support case for our Training DevOps team to further help you. We could also attach the latest setup guide that way if needed.

LesterMartin · ‎11-08-2016

For this Training VM, there is a hidden folder named .sys in the directory you are in above which contains a recreate_sandbox.sh script that can be used to recreate the Docker instance needed and get everything operational again. You can run it as shown below. root@ubuntu:~# cd .sys root@ubuntu:~/.sys# pwd /root/.sys root@ubuntu:~/.sys# ./recreate_sandbox.sh creating sandbox.... docker stop/waiting docker start/running, process 5857 inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0 sandbox started at 172.17.0.2 root@ubuntu:~/.sys# date Tue Sep 20 22:15:10 EDT 2016 root@ubuntu:~/.sys# date Tue Sep 20 22:19:51 EDT 2016 root@ubuntu:~/.sys# ssh sandbox The authenticity of host 'sandbox (172.17.0.2)' can't be established. RSA key fingerprint is 2e:0c:53:b1:d4:06:7d:ab:bd:79:f9:17:08:f2:8a:4b. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'sandbox,172.17.0.2' (RSA) to the list of known hosts. Last login: Sun Dec 20 14:06:45 2015 from ip-172-17-0-1.ec2.internal [root@sandbox ~]# hdfs dfs -ls / Found 7 items drwxrwxrwx - yarn hadoop 0 2015-10-15 09:45 /app-logs drwxr-xr-x - hdfs hdfs 0 2015-10-15 09:45 /apps drwxr-xr-x - hdfs hdfs 0 2015-10-15 09:44 /hdp drwxr-xr-x - mapred hdfs 0 2015-10-15 09:44 /mapred drwxrwxrwx - mapred hadoop 0 2015-10-15 09:44 /mr-history drwxrwxrwx - hdfs hdfs 0 2015-10-15 09:46 /tmp drwxr-xr-x - hdfs hdfs 0 2015-10-20 14:31 /user [root@sandbox ~]# NOTE: If it fails again in the future, instead of recreating everything, try the restart_sandbox.sh script instead.

LesterMartin · ‎10-24-2016

As I don't know the answer to "what do you want to do", I invite you to take a peek at the responses to https://community.hortonworks.com/questions/12787/how-to-integrate-kafka-to-pull-data-from-rdbms.html as it is along the same line of thinking (I believe). Technically, Kafka does have a Connector API, http://kafka.apache.org/documentation.html#connect, which could theoretically could do what you are asking, but I do not know anyone who has done exactly that with Kafka (mostly folks doing more traditional pub/sub clients). As for "in practice", I did a quick Google search for "kafka connect sql server" and found two non open-source solutions that work with Kafka Connect to do what you said, but it doesn't look like there is a completely open-source solution available at the moment. On the Flume front, I think there is only a JDBC Channel, not a source or sink (at least not in 1.5.2 which ships with HDP 2.5). I'm thinking NiFi (aka HDF) and/or Sqoop might be better tools for retrieving data from a RDBMS like SQL Server.

LesterMartin · ‎10-21-2016

More details can be viewed from the "source" at http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

LesterMartin · ‎10-21-2016

Just to make sure we are in-step on nomenclature. "Sources" and "Sinks" are http://flume.apache.org terminology as http://kafka.apache.org is all about Publishers and Subscribers that interact through Topics (aka message queues) that are persisted in a Kafka Cluster. If that makes sense and you just want to understand the interactions between Kafka publishers & subscribers then check out http://kafka.apache.org/intro for some introductory material. On the Flume front, it seems in 1.6.0 Kafka Source & Sink options became available as seen in the current (1.7.0) user guide at https://flume.apache.org/FlumeUserGuide.html. As a point of reference HDP 2.5 includes Flume 1.5.2 as detailed at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_release-notes/content/ch_relnotes_v250.html, so that is not yet available via HDP.

LesterMartin · ‎10-21-2016

The ALL are; especially when talking about so many files that were under-replicated. Ultimately, the NN is the one who determines if a file is under-replicated. It is then the NN's job to notify one of the DNs that has a good copy of one of the blocks' replicas to copy it to another DN. NN isn't going to do any of the actual movement of bits -- it will just coordinate the whole effort. Hope this helps!

LesterMartin · ‎10-19-2016

I just peeked in the Training support system (the one triggered by emailing certification@hortonworks.com) and it looks like we're been able to discuss these items with you. I'll send you a quick update from the ones I'm aware of and see if we can fully run to ground your concerns. If I miss anything then please reply to the particular automated email and we'll work to make sure you have answers to your questinos. Thanks!

LesterMartin · ‎10-18-2016

You are correct that we do not post any sample questions for the HCA certification, but we will evaluate if that is a logical step for this entry-level certification. You correctly found the objectives at http://hortonworks.com/wp-content/uploads/2016/08/ExamObjectives-HCAssociate.pdf and while these are "vast" as you described, please note that the HCA "provides for individuals an entry point and validates the fundamental skills required to progress to the higher levels of the Hortonworks certification program". If you are comfortable with the materials discussed in the Hadoop Essentials course, http://hortonworks.com/training/class/hadoop-essentials/, then you are an ideal candidate for this examination. Good luck!

LesterMartin · ‎10-17-2016

The whole goal of having partitions is to allow Hive to limit the files it will have to look at in order to fulfill the SQL request you send into it. On the other hand, you also clearly understand that having too many small files to look at is a performance/scalability drag. With so few number of records for each day, I'd suggest partitioning at the month level (as a single string such as @Joseph Niemiec and @bpreachuk suggest in their answers to https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html). This will allow you to keep your "original" dates as a column and let the partition months be a new virtual column. Of course, you'll need to train/explain to your query writers the benefit of using this virtual column of the partition name in the queries, but will then get the value of partitioning all while having 1/30th of the files and each of them being 30x bigger. Good luck!

LesterMartin · ‎10-06-2016

@Artem Ervits is right; this would defeat the purpose of the test.

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: HDP2.3-Pig & Hive Rev6 VM for Self Paced Learn...

Re: HDP2.3-Pig & Hive Rev6 VM for Self Paced Learn...

Re: can you please explain the apache kafka lifecy...

Re: Which nodes are participating in Under Replica...

Re: can you please explain the apache kafka lifecy...

Re: Which nodes are participating in Under Replica...

Re: HDPCD: Java Exam Issues

Re: HCA - HORTONWORKS CERTIFIED ASSOCIATE Certific...

Re: The best approach to the thousands of small pa...

Re: HDPCD: Java Practice Exam Task Solution