Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5429 | 08-12-2016 01:02 PM | |
2204 | 08-08-2016 10:00 AM | |
2613 | 08-03-2016 04:44 PM | |
5516 | 08-03-2016 02:53 PM | |
1427 | 08-01-2016 02:38 PM |
02-04-2016
02:34 PM
1 Kudo
That is weird. I am using Eclipse Datatools with 2.3.0 and 2.3.2 and in both cases I can run multiple commands delimited by ";" Like in any other database. However I am using a "generic JDBC connection in Eclipse. Not sure what Aqua Studio does there. ( The screenshot below uses only DDL language but I have examples with SELECTS as well.
... View more
02-04-2016
01:35 PM
2 Kudos
I wanted to know if people around here have experience with oozie and yarn preemption. I think I remember that the two do not work well with each other. I.e. lets assume we have the launcher Application Master, the launcher map task and the Task ( perhaps pig ) Application Master and the Pig tasks So there are 4 possibilities: A) Pig Container is killed Should be fine, pig will reschedule it through the application master B) Pig Application Master is killed Should be rare since preemption kills Application Masters only as a last resort. I assume the oozie launcher would fail but that there is a retry parameter in oozie? C) Oozie launcher Map is killed Suddenly the pig task is orphaned. Will oozie application master restart the map? Will the map reconnect to the pig task? Or will it start a second one? D) Oozie launcher AM is killed Similar to C) but will oozie server restart the task or will it be shown as killed I also remember an engagement where they had orphaned tasks because of oozie and preemption anybody seen someting like that? Thanks a lot.
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache YARN
02-04-2016
09:20 AM
2 Kudos
https://developer.yahoo.com/hadoop/tutorial/module4.html Map -> Combiner -> Partitioner -> Sort -> Shuffle -> Sort -> Reduce https://farm3.static.flickr.com/2374/3529959828_0b689d1d5c_o.png https://farm3.static.flickr.com/2275/3529146683_c8247ff6db_o.png
... View more
02-03-2016
10:17 PM
2 Kudos
Unfortunately there is no way to enable HDFS HA without restarting the Namenode. So unless you can change the process to use a buffer in between. ( Kafka would be a very popular tool combining MQ like use with almost unlimited scalability and easy buffering of dozens to hundreds of Terabyte of data ) . I am not sure what you could do. So if you really really absolutely cannot lose a tuple or you want to have a safer architecture anyway: A) Develop a process that reads the events and puts them into kafka. You would also need a process that reads them from kafka again and puts them in hbase. B) Switch over the process from hbase to kafka C) Upgrade your cluster D) Switch on the Kafka->Hbase process. That would not be time critical since even a 3 node Kafka cluster can easily store 10-20TB of data in a replicated fashion.
... View more
02-03-2016
01:44 PM
2 Kudos
Different options. Depends how you want to do it. Often time I end up with a bit of python glue code on the edge node. There is not really a "best" way to do it. I have used - flume ( good if you want to merge all files into a log stream and perhaps filter events ) - webhdfs ( good if you want to upload files as is but cannot access an edge node ) - mounted a folder on the edge node and used a shell script running in cron This is perhaps the easiest for secure mount there is sshfs and you can just run the hadoop fs -put commands in the shell script - used rsync to sync a folder to an edge node and run a python program there to pick up the files the file logic was more difficult so python was better than shell - used rsync to copy a log folder and used a python script to load files incrementally Since the log files were supposed to load incrementally the python file kept an offset with tell() for each file and uploaded new results My tip: if you can mount the log folders on the edge node and use the hadoop client api for full file loads If you want incremental loads and pre-processing before hdfs look at nifi or flume
... View more
02-03-2016
01:02 PM
1 Kudo
Which version of Ambari are you running? 2.2 like the docu? It was removed in one version but reintroduced pretty much immediately and I definitely have it in 2.1.2. So it would be weird if it was gone in 2.2 https://issues.apache.org/jira/browse/AMBARI-12707
... View more
02-03-2016
10:11 AM
In that case you cannot use SPNEGO. But you might be able to stop users from administering queues. (killing applications ). I have seen these settings not working on a non-kerberized cluster but it might have been a bug.
... View more
02-02-2016
03:51 PM
3 Kudos
Sqoop might work or if you want to be closer to realtime, IBM CDC. Change Data Capture from IBM also has a Hadoop connector I would hope they can talk to each other. ( Mainframe versions are sometimes very different but I would assume you can forward the changes to a normal CDC instance which then should have the BigData connector.) http://www-03.ibm.com/software/products/en/ibminfochandatacaptforzos Depending on how you want the data in HDFS there is for example a webhdfs connector in IBM CDC ( now InfoSphere Data replication? ) https://www.ibm.com/developerworks/community/files/app?lang=en#/file/c04518fb-8ff3-4b2a-9fb9-38733478bb9b
... View more
02-02-2016
02:55 PM
2 Kudos
Yes you can use oozie. Let's concentrate on this because to run queries in parallel efficiently you most likely will need an oozie workflow anyway. ( Falcon can kick off oozie workflows ) Regardless if you use falcon or a classic oozie coordinator to schedule them. What oozie workflows provide is the ability to create an execution graph where each action can continue on to any other node. ( cycles are forbidden ) . To allow parallel action execution you have forks and joins. A fork starts two actions in parallel and a join waits for all actions it waits on to finish. So you can pretty much create any structure you want. The example below is very simple but you could also have fork in a fork etc. pp. There are surely other ways as well but Oozie most likely will be the canonical way of doing it. For example: <start to="check-files"/>
<fork name="parallel-load">
<path start="load1"/>
<path start="load2"/>
</fork>
<action name="load1">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
<password>${hivepassword}</password>
<script>/data/sql/load1.sql</script>
</hive2>
<ok to="join-node"/>
<error to="kill"/>
</action>
<action name="load2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
<password>${hivepassword}</password>
<script>/data/sql/load2.sql</script>
</hive2>
<ok to="join-node"/>
<error to="kill"/>
</action>
<join name="join-node" to="end"/>
... View more
02-02-2016
02:35 PM
Most likely was this. Sorry for not accepting for so long but when I changed the <falconfolder>/staging/falcon folder manually it all worked and I forgot about it Thanks a lot. https://issues.apache.org/jira/browse/FALCON-1647
... View more