About bleonhardi

bleonhardi · ‎05-07-2016

1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... ) Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End 2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date. 3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions Like Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end. 4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.

bleonhardi · ‎05-06-2016

No you don't have to install a local KDC, you have to configure SSSD to connect to AD for linux user authentication. As said AD normally provides kerberos tickets automatically. To create a new service user in AD you best talk to your AD team. Once you have created a hue service user ( in the same group as the hdfs etc. users ) you should be able to export the keytab. The guide would be for a standard KDC. Which is also an option, however if you want a standard KDC then you need to add a one way trust from the AD to your local KDC.

drussell · ‎04-29-2016

Hi @Pedro Rodgers there's quite a lot of good information available at the Hortonworks website that includes project examples and customer success stories. The intro page for a lot of this is: http://hortonworks.com/partner/sas/ A few examples include: Rogers Media - https://youtu.be/wTnkg16jHwg Webinar on Predictive Analytics and Machine Learning using SAS - https://youtu.be/D6YzqFgiRnI

vvaks · ‎05-08-2016

@Benjamin Leonhardi With the release of Yarn.Next, the containers will receive their own IP address and get registered in DNS. The FQDN will be available via a rest call to Yarn. If the current Yarn container die, the docker container will start in a different Yarn container somewhere in the cluster. As long as all clients are pointing at the FQDN of the application, the outage will be nearly transparent. In the mean time, there are several options using only slider but it requires some scripting or registration in Zookeeper. If you run: slider lookup --id application_1462448051179_0002 2016-05-08 01:55:51,676 [main] INFO impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 2016-05-08 01:55:53,847 [main] WARN shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 2016-05-08 01:55:53,868 [main] INFO client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050 { "applicationId" : "application_1462448051179_0002", "applicationAttemptId" : "appattempt_1462448051179_0002_000001", "name" : "biologicsmanufacturingui", "applicationType" : "org-apache-slider", "user" : "root", "queue" : "default", "host" : "sandbox.hortonworks.com", "rpcPort" : 1024, "state" : "RUNNING", "diagnostics" : "", "url" : "http://sandbox.hortonworks.com:8088/proxy/application_1462448051179_0002/", "startTime" : 1462454411514, "finishTime" : 0, "finalStatus" : "UNDEFINED", "origTrackingUrl" : "http://sandbox.hortonworks.com:1025", "progress" : 1.0 } 2016-05-08 01:55:54,542 [main] INFO util.ExitUtil - Exiting with status 0 You do get the host the container is currently bound to. Since the instructions bind the docker container to the host IP, this would allow URL discovery but as I said, not out of the box. This article is merely the harbinger to Yarn.Next as that will integrate the PaaS capabilities into Yarn itself, including application registration and discovery.

bleonhardi · ‎04-28-2016

You would have to make sure that the mapreduce.framework.name is set correctly ( yarn I suppose ) and the mapred files are there but first please verify that your nameNode parameter is set correctly. HDFS is very exact about it and requires the hdfs:// in front. So hdfs://nameonode:8020 instead of namenode:8020

vinaykumarpotnu · ‎04-28-2016

@Robert Levas Thanks Robert. Regarding this Question How does Kerberos work?, could you please answer it here.

psitik · ‎12-08-2017

@Joseph Niemiec How can I do this command " select * from table where date <= '2017-12-08' " in nest partitions form? In case the table is partitioned by day,month,year

rachel_wijsmull · ‎04-22-2016

Thanks Benjamin, great information. About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is] To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think). Then: INSERT OVERWRITE TABLE final PARTITION (daily_date=${nice_date}) SELECT facts, otherFact<exclude daily_date> FROM staging ; Should work! Thanks again.

rbalam · ‎04-20-2016

@Matt FoleyThanks for additional information. This is very helpful

bleonhardi · ‎04-26-2016

@Kevin Sievers Hi Kevin, your commands look good to me, somehow he does not take the number of reduce tasks though. You are right Hadoop should be MUCH faster. But the one reduce task and even weirder one mapper seem to be the problem And I assure you it runs with a lot of mappers and 40 reducers and is loading and transforming around 300 GB of data in 20 minutes on an 7 datanode cluster. So basically I have NO idea why he does only one mapper, I have no idea why he has the second Reducer AT ALL. I have no idea why he ignores the mapred.reduce.tasks parameter? I think a support ticket might be in order. set hive.tez.java.opts = "-Xmx3600m"; set hive.tez.container.size = 4096; set mapred.reduce.tasks=120; CREATE EXTERNAL TABLE STAGING ... ... insert into TABLE TARGET partition (day = 20150811) SELECT * FROM STAGING distribute by DT ;

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: optimized oozie workflow to import multiple ta...

Re: How do you add a a new User from Active Direct...

Re: Case Study - SAS using Hadoop

Re: Hadoop as an Application PaaS with Slider and ...

Re: org.apache.oozie.action.ActionExecutorExceptio...

Re: How does kerberos work ?

Re: Best Pratices for Hive Partitioning especially...

Re: Partitioning - no existing column suitable

Re: hadoop nodes from SUSE to RHEL

Re: How do you force the number of reducers in a m...