About LesterMartin

LesterMartin · ‎12-05-2016

Is anyone aware of any plans for integrations between Knox and Accumulo?

LesterMartin · ‎11-24-2016

+1 on suggestion #1

LesterMartin · ‎11-22-2016

Yes, there was some messaging early this year (that it seems the Spring folks replicated on their site) which indicated a change in the certification program, but it was determined that the planned changes were being released too fast and did not give enough time for folks already preparing for the existing certification exams to complete them. It is imaginable that the certification program will continue to evolve, but our intention is to ensure adequate time to adjust will be offered whenever any future plans are introduced. Good luck on the HDPCD exam!

LesterMartin · ‎11-18-2016

Nope. There are no certification prerequisites for HDPCD. Good luck on the exam!!

LesterMartin · ‎11-18-2016

Most of the answers you are looking for are explained in http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism, but here's my 1-2-3 answers to your questions. Absolutely, Sqoop is building a SQL query (actually one for each mapper) to the source table it is ingesting into HDFS from. The number of mappers (default is four, but you can override) leverage the split-by column and basically Sqoop tries to build an intelligent set of WHERE clauses so that each of the mappers have a logical "slice" of the target table. As and example, if we used three mappers and a split-by column that is an integer with ranges from 0 to 1,000,000 for the actual data (i.e. sqoop can do a pretty easy min and max call to the DB on the split-by column), then Sqoop first mapper would try to get values 0-333333, the second mapper would pull 333334-666666, and the last would grab 666667-1000000. Nope, Sqoop is running a map-only job which each mapper (3 in my example above) running a query with a specific range to prevent any kind of overlap. The mapper then just drops the data in the target-dir HDFS directory with a file named part-m-00000 (well, the 2nd on ends with 00001 and the 3rd one ends with 00002). The composite export is represented by the target-dir HDFS directory (basically follows the MapReduce naming scheme of files). I'm hoping your question about parallelism makes sense now. I'm hopeful this helps out some. As with everything, some simple testing on your own will help it all make sense. As for an architectural diagram, check out the image (and additional details) at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/using_sqoop_to_move_data_into_hive.html which might aid in your understanding. Happy Hadooping!!

LesterMartin · ‎11-15-2016

For the HCA exam, there are no sample tests. Please review the objective link I provided above. The Essentials course is also offered as a one-day public training pretty regularly as seen at https://ilearning.seertechsolutions.com/lmt/clmsCatalogSummary.prMain?site=hw&in_region=hw and we are considering whether to offer this as a free on-demand offer as well.

LesterMartin · ‎11-11-2016

As with all "interesting" questions like this, the best answer is to try it and see for yourself. My hypothesis was that sqoop would report these directives are incompatible with each other and I was glad to see that was what happened when I gave it a try myself. [root@sandbox Lab3.1]# sqoop import --connect jdbc:mysql://sandbox/test?user=root --table salaries --columns gender,age --query "select * from salaries s where s.salary > 90000.00 and \$CONDITIONS" --split-by gender -m 2 --target-dir willItWork 16/11/11 08:22:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.3.2.0-2950 Cannot specify --query and --table together. Try --help for usage instructions. [root@sandbox Lab3.1]#

LesterMartin · ‎11-11-2016

I do not believe you can use Ambari to configure this, but the manual instructions at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-installation/content/install_kafka_rpms.html call out some notes on how to approach this "not recommended" strategy. I guess a member of the Support team would have to be solicited for a more definitive answer, but... it sounds like it is supported.

LesterMartin · ‎11-10-2016

User Defined Functions (UDF) come to the rescue. Search for "Filter Functions" in http://pig.apache.org/docs/r0.15.0/udf.html and you'll see a rough example of how to do this. Now, your "isEmpty" (or whatever you call the function) will be implemented differently. In your's, you would need to walk each element and check for null. If all of the row's (called "input" in that example UDF) fields are null then you can ultimately return a boolean value that can be used in your code (after you build the UDF). If this is your first Pig UDF, there are plenty of examples on the internet; including mine at https://martin.atlassian.net/wiki/x/C4BRAQ. Good luck!

LesterMartin · ‎11-09-2016

The self-paced library's setup guide is current; I was using that as a ruse to help me move this concern over to our internal tracking system. 😉 Yes, if "hdfs dfs -ls /" is responding, then by all means march forward. If at some point it stops working (i.e. these VM's don't really like to be stopped & started) then please try the restart_sandbox.sh script mentioned earlier with the recreate_sandbox.sh as a "nuclear option". Good luck!

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Knox and Accumulo

Re: Data ingestion from MSSQL server to HDFS?

Re: is HCA certification mandatory to HDPCD ?

Re: is HCA certification mandatory to HDPCD ?

Re: How Sqoop internally works

Re: HCA - HORTONWORKS CERTIFIED ASSOCIATE Certific...

Re: Sqoop: What happens if both the --query and --...

Re: Does HDP support multipule Kafka brokers on si...

Re: Filter OUT Null from all columns

Re: HDP2.3-Pig & Hive Rev6 VM for Self Paced Learn...