Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5408 | 08-12-2016 01:02 PM | |
2202 | 08-08-2016 10:00 AM | |
2612 | 08-03-2016 04:44 PM | |
5499 | 08-03-2016 02:53 PM | |
1424 | 08-01-2016 02:38 PM |
05-07-2016
07:34 PM
1 Kudo
1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... ) Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End 2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date. 3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions Like Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end. 4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.
... View more
05-06-2016
04:57 PM
No you don't have to install a local KDC, you have to configure SSSD to connect to AD for linux user authentication. As said AD normally provides kerberos tickets automatically. To create a new service user in AD you best talk to your AD team. Once you have created a hue service user ( in the same group as the hdfs etc. users ) you should be able to export the keytab. The guide would be for a standard KDC. Which is also an option, however if you want a standard KDC then you need to add a one way trust from the AD to your local KDC.
... View more
04-29-2016
10:38 AM
Hi @Pedro Rodgers there's quite a lot of good information available at the Hortonworks website that includes project examples and customer success stories. The intro page for a lot of this is: http://hortonworks.com/partner/sas/ A few examples include: Rogers Media - https://youtu.be/wTnkg16jHwg Webinar on Predictive Analytics and Machine Learning using SAS - https://youtu.be/D6YzqFgiRnI
... View more
05-08-2016
02:01 AM
@Benjamin Leonhardi With the release of Yarn.Next, the containers will receive their own IP address and get registered in DNS. The FQDN will be available via a rest call to Yarn. If the current Yarn container die, the docker container will start in a different Yarn container somewhere in the cluster. As long as all clients are pointing at the FQDN of the application, the outage will be nearly transparent. In the mean time, there are several options using only slider but it requires some scripting or registration in Zookeeper. If you run: slider lookup --id application_1462448051179_0002
2016-05-08 01:55:51,676 [main] INFO impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2016-05-08 01:55:53,847 [main] WARN shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2016-05-08 01:55:53,868 [main] INFO client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
{
"applicationId" : "application_1462448051179_0002",
"applicationAttemptId" : "appattempt_1462448051179_0002_000001",
"name" : "biologicsmanufacturingui",
"applicationType" : "org-apache-slider",
"user" : "root",
"queue" : "default",
"host" : "sandbox.hortonworks.com",
"rpcPort" : 1024,
"state" : "RUNNING",
"diagnostics" : "",
"url" : "http://sandbox.hortonworks.com:8088/proxy/application_1462448051179_0002/",
"startTime" : 1462454411514,
"finishTime" : 0,
"finalStatus" : "UNDEFINED",
"origTrackingUrl" : "http://sandbox.hortonworks.com:1025",
"progress" : 1.0
}
2016-05-08 01:55:54,542 [main] INFO util.ExitUtil - Exiting with status 0
You do get the host the container is currently bound to. Since the instructions bind the docker container to the host IP, this would allow URL discovery but as I said, not out of the box. This article is merely the harbinger to Yarn.Next as that will integrate the PaaS capabilities into Yarn itself, including application registration and discovery.
... View more
04-28-2016
04:57 PM
3 Kudos
You would have to make sure that the mapreduce.framework.name is set correctly ( yarn I suppose ) and the mapred files are there but first please verify that your nameNode parameter is set correctly. HDFS is very exact about it and requires the hdfs:// in front. So hdfs://nameonode:8020 instead of namenode:8020
... View more
04-28-2016
06:26 AM
@Robert Levas Thanks Robert. Regarding this Question How does Kerberos work?, could you please answer it here.
... View more
12-08-2017
03:45 AM
@Joseph Niemiec How can I do this command " select * from table where date <= '2017-12-08' " in nest partitions form? In case the table is partitioned by day,month,year
... View more
04-22-2016
04:24 PM
Thanks Benjamin, great information. About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is] To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think). Then: INSERT OVERWRITE TABLE final
PARTITION (daily_date=${nice_date})
SELECT facts, otherFact<exclude daily_date>
FROM staging
;
Should work! Thanks again.
... View more
04-20-2016
06:17 PM
@Matt FoleyThanks for additional information. This is very helpful
... View more
04-26-2016
06:02 PM
@Kevin Sievers Hi Kevin, your commands look good to me, somehow he does not take the number of reduce tasks though. You are right Hadoop should be MUCH faster. But the one reduce task and even weirder one mapper seem to be the problem And I assure you it runs with a lot of mappers and 40 reducers and is loading and transforming around 300 GB of data in 20 minutes on an 7 datanode cluster. So basically I have NO idea why he does only one mapper, I have no idea why he has the second Reducer AT ALL. I have no idea why he ignores the mapred.reduce.tasks parameter? I think a support ticket might be in order. set hive.tez.java.opts = "-Xmx3600m";
set hive.tez.container.size = 4096;
set mapred.reduce.tasks=120;
CREATE EXTERNAL TABLE STAGING ...
...
insert into TABLE TARGET partition (day = 20150811) SELECT * FROM STAGING distribute by DT ;
... View more