About bleonhardi

bleonhardi · ‎02-04-2016

That is weird. I am using Eclipse Datatools with 2.3.0 and 2.3.2 and in both cases I can run multiple commands delimited by ";" Like in any other database. However I am using a "generic JDBC connection in Eclipse. Not sure what Aqua Studio does there. ( The screenshot below uses only DDL language but I have examples with SELECTS as well.

bleonhardi · ‎02-04-2016

I wanted to know if people around here have experience with oozie and yarn preemption. I think I remember that the two do not work well with each other. I.e. lets assume we have the launcher Application Master, the launcher map task and the Task ( perhaps pig ) Application Master and the Pig tasks So there are 4 possibilities: A) Pig Container is killed Should be fine, pig will reschedule it through the application master B) Pig Application Master is killed Should be rare since preemption kills Application Masters only as a last resort. I assume the oozie launcher would fail but that there is a retry parameter in oozie? C) Oozie launcher Map is killed Suddenly the pig task is orphaned. Will oozie application master restart the map? Will the map reconnect to the pig task? Or will it start a second one? D) Oozie launcher AM is killed Similar to C) but will oozie server restart the task or will it be shown as killed I also remember an engagement where they had orphaned tasks because of oozie and preemption anybody seen someting like that? Thanks a lot.

bleonhardi · ‎02-04-2016

https://developer.yahoo.com/hadoop/tutorial/module4.html Map -> Combiner -> Partitioner -> Sort -> Shuffle -> Sort -> Reduce https://farm3.static.flickr.com/2374/3529959828_0b689d1d5c_o.png https://farm3.static.flickr.com/2275/3529146683_c8247ff6db_o.png

bleonhardi · ‎02-03-2016

Unfortunately there is no way to enable HDFS HA without restarting the Namenode. So unless you can change the process to use a buffer in between. ( Kafka would be a very popular tool combining MQ like use with almost unlimited scalability and easy buffering of dozens to hundreds of Terabyte of data ) . I am not sure what you could do. So if you really really absolutely cannot lose a tuple or you want to have a safer architecture anyway: A) Develop a process that reads the events and puts them into kafka. You would also need a process that reads them from kafka again and puts them in hbase. B) Switch over the process from hbase to kafka C) Upgrade your cluster D) Switch on the Kafka->Hbase process. That would not be time critical since even a 3 node Kafka cluster can easily store 10-20TB of data in a replicated fashion.

bleonhardi · ‎02-03-2016

Different options. Depends how you want to do it. Often time I end up with a bit of python glue code on the edge node. There is not really a "best" way to do it. I have used - flume ( good if you want to merge all files into a log stream and perhaps filter events ) - webhdfs ( good if you want to upload files as is but cannot access an edge node ) - mounted a folder on the edge node and used a shell script running in cron This is perhaps the easiest for secure mount there is sshfs and you can just run the hadoop fs -put commands in the shell script - used rsync to sync a folder to an edge node and run a python program there to pick up the files the file logic was more difficult so python was better than shell - used rsync to copy a log folder and used a python script to load files incrementally Since the log files were supposed to load incrementally the python file kept an offset with tell() for each file and uploaded new results My tip: if you can mount the log folders on the edge node and use the hadoop client api for full file loads If you want incremental loads and pre-processing before hdfs look at nifi or flume

bleonhardi · ‎02-03-2016

Which version of Ambari are you running? 2.2 like the docu? It was removed in one version but reintroduced pretty much immediately and I definitely have it in 2.1.2. So it would be weird if it was gone in 2.2 https://issues.apache.org/jira/browse/AMBARI-12707

bleonhardi · ‎02-03-2016

In that case you cannot use SPNEGO. But you might be able to stop users from administering queues. (killing applications ). I have seen these settings not working on a non-kerberized cluster but it might have been a bug.

bleonhardi · ‎02-02-2016

Sqoop might work or if you want to be closer to realtime, IBM CDC. Change Data Capture from IBM also has a Hadoop connector I would hope they can talk to each other. ( Mainframe versions are sometimes very different but I would assume you can forward the changes to a normal CDC instance which then should have the BigData connector.) http://www-03.ibm.com/software/products/en/ibminfochandatacaptforzos Depending on how you want the data in HDFS there is for example a webhdfs connector in IBM CDC ( now InfoSphere Data replication? ) https://www.ibm.com/developerworks/community/files/app?lang=en#/file/c04518fb-8ff3-4b2a-9fb9-38733478bb9b

bleonhardi · ‎02-02-2016

Yes you can use oozie. Let's concentrate on this because to run queries in parallel efficiently you most likely will need an oozie workflow anyway. ( Falcon can kick off oozie workflows ) Regardless if you use falcon or a classic oozie coordinator to schedule them. What oozie workflows provide is the ability to create an execution graph where each action can continue on to any other node. ( cycles are forbidden ) . To allow parallel action execution you have forks and joins. A fork starts two actions in parallel and a join waits for all actions it waits on to finish. So you can pretty much create any structure you want. The example below is very simple but you could also have fork in a fork etc. pp. There are surely other ways as well but Oozie most likely will be the canonical way of doing it. For example: <start to="check-files"/> <fork name="parallel-load"> <path start="load1"/> <path start="load2"/> </fork> <action name="load1"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url> <password>${hivepassword}</password> <script>/data/sql/load1.sql</script> </hive2> <ok to="join-node"/> <error to="kill"/> </action> <action name="load2"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url> <password>${hivepassword}</password> <script>/data/sql/load2.sql</script> </hive2> <ok to="join-node"/> <error to="kill"/> </action> <join name="join-node" to="end"/>

bleonhardi · ‎02-02-2016

Most likely was this. Sorry for not accepting for so long but when I changed the <falconfolder>/staging/falcon folder manually it all worked and I forgot about it Thanks a lot. https://issues.apache.org/jira/browse/FALCON-1647

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Run multiple queries on Hive / Phoenix?

Yarn preemption and Oozie

Re: What is the difference between Partitioner, Co...

Re: How to enable name node HA without cluster dow...

Re: Import data from multiple servers

Re: Where is "Regenerate kerberos" button ?

Re: Is there any way to disable "kill application"...

Re: What is the fastest method (best practice) for...

Re: How can I coordinate execution of Hive queries

Re: Falcon clusters with multiple users