Member since
09-23-2015
800
Posts
897
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2089 | 08-12-2016 01:02 PM | |
1233 | 08-08-2016 10:00 AM | |
1214 | 08-03-2016 04:44 PM | |
2543 | 08-03-2016 02:53 PM | |
708 | 08-01-2016 02:38 PM |
01-27-2016
09:55 AM
1 Kudo
OK the exec tag executes a shell script in the local working directory of oozie. For example /hadoop/yarn/.../oozietmp/myscript.sh You have no idea before which directory this is or on which server it is located. It is in some yarn tmp dir. The file tag is there to put something into this temp dir. And you can rename the file as well using the # syntax. So if your shell script is in HDFS in hdfs://tmp/myfolder/myNewScript.sh But you do not want to change the exec tag for some reason. You can do <file>/tmp/myfolder/myNewScript.sh#myscript.sh</file> And oozie will take the file from HDFS put it into the tmp folder before execution and rename it. You can use the file tag to upload any kind of files ( like jars or other dependencies ) As far as I can see the ${EXEC} is just a variable they set somewhere with no specific meaning. Oh last but not least, if you want to avoid the file tag you can also simply put these files into a lib folder in the workflow folder. Oozie will upload all of these files per default.
... View more
01-27-2016
09:48 AM
4 Kudos
Kafka producers are Java programs. Here is a very simple example of a KafkaProducer that uses Syslog4J I did a while back for example. In your case you need to have a program that can pull the data from the webservice and then push it into the producer. You need to serialize your package into a byte array for kafka but apart from that the main work will be connecting to WSDL. https://github.com/benleon/SysLogProducer
... View more
01-26-2016
01:03 PM
Just be wary of potential load issues. We reached the connection limits of our consolidated postgresql database because all services were pointing to the same db, This essentially stopped oozie and hive randomly. The biggest culprit seems to have been ranger. If auditing to db is switched on it puts quite a load on the database.
... View more
01-26-2016
11:35 AM
1 Kudo
I tried to set falcon retention period on a feed. Expecting it to delete old folders after the specified time period. ( In this case 7 days ). However this does not happen. Anybody knows how to debug that? Where would he write any log information to this? In the oozie falcon action surrounding the workflow or in the falcon server logs? Or is there anything else I need to do? The process is happening every 15 minutes so he shouldn't have to schedule cleanup tasks. <clusters>
<cluster name="xxx" type="source">
<validity start="2016-01-14T12:45Z" end="2033-01-13T20:00Z"/>
<retention limit="days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/xxx/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
... View more
Labels:
- Labels:
-
Apache Falcon
01-25-2016
11:36 AM
2 Kudos
wrote as an answer because of the character limit: yes first go into ambari or perhaps better the OS and search for the tez.lib.uris property in the properties file less /etc/tez/conf/tez-site.xml You should find something like this: <value>/hdp/apps/${hdp.version}/tez/tez.tar.gz</value> if this is not available you may have a different problem. ( Tez client not installed some configuration issue) You can then check if these files exist in HDFS with hadoop fs -ls /hdp/apps/ find the version number for example 2.3.2.0-2950 [root@sandbox ~]# hadoop fs -ls /hdp/apps/2.3.2.0-2950/tez Found 1 items -r--r--r-- 3 hdfs hadoop 56926645 2015-10-27 14:40 /hdp/apps/2.3.2.0-2950/tez/tez.tar.gz You can check if this file is corrupted somehow with hadoop fs -get /hdp/apps/2.3.2.0-2950/tez/tez.tar.gz You can then try to untar it to see if that works. If the file doesn't exist in HDFS you can find it in the installation directory of HDP (/usr/hdp/2.3.2.0-2950/tez/lib/tez.tar.gz on the local filesystem ) You could then put it into hdfs
... View more
01-25-2016
10:20 AM
1 Kudo
There are different possibilities. Normally this means the tez libraries are not present in HDFS. Are you using the sandbox? You should check if the tez client is installed on your pig client, if the tez-site.xml contains the tez.lib.uris property and if the tez libraries are actually in HDFS and valid ( download them and untar to check ) /hdp/apps/<hdp_version>/tez/tez.tar.gz https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html
... View more
01-25-2016
10:00 AM
Hmmmm weird, the order shouldn't really make a difference. I assume he added a reducer doing that. Only explanation I have. Adding a distribute by would most likely also have helped. But sort is good for predicate pushdown and so as long as all is good ... 🙂
... View more
01-14-2016
06:23 PM
1 Kudo
Apart from apreduce.reduce.java.opts=-Xmx4096m missing an m which I don't think will be the problem; How many days are you loading? You essentially do a dynamic partitioning so the task needs to keep memory for every day you load into. If you have a lot of days this might be the reason: Possible solutions: a) Try to load one day and see if that makes it better. b) use dynamic sorted partitioning, ( slide 16) this theoretically should fix the problem if this is the reason c) use manual distribution ( slide 19 ) http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
... View more
01-13-2016
12:25 PM
1 Kudo
That is very curious I have seen lots of stripes being created because of memory problems. But normally he only gets down to 5000 rows and then out of memory. Which version of Hive are you using? What are your memory settings for the hive tasks and if the file is small is it possible that the table is partitioned and the task is writing into a large number of partitions at the same time? Can you share the LOAD command and the table layout?
... View more
01-11-2016
05:32 PM
ah nice undercover magic. I will try and see what happens if I switch the active off.
... View more
01-11-2016
05:29 PM
I have seen the question for HA Namenodes however HA Resource Managers still confuse me. In Hue you are for example told to add a second resource manager entry with the same logical hue name. I.e. Hue supports adding two resource manager urls and he will manually try both. How does that work in Falcon, how can I enter an HA Resource Manager entry into the interfaces of the cluster Entity document. For Namenode HA I would use the logical name and the program would then read the hdfs-site.xml I have seen the other similar questions for oozie but I am not sure it was answered or I didn't really understand it. https://community.hortonworks.com/questions/2740/what-value-should-i-use-for-jobtracker-for-resourc.html so assuming my active resource manager is mycluster1.com:8050 and standby is mycluster2,com:8050
... View more
Labels:
01-07-2016
02:05 PM
2 Kudos
You could use a shell action, add the token to the oozie files ( file tag ) and do the kinit yourself before running the java command. Obviously not that elegant and you have a token somewhere in HDFS but it should work. I did something similar with a shell action running a scala program and running a kinit before. ( Not against hive but running kinit then connecting to HDFS ). Ceterum censeo I would always suggest using a hive server with LDAP/PAM authentication. beeline and hive2 action has a password file option now and it makes life so much easier. As a database guy kerberos for a jdbc connection just always makes problems. Here is the oozie shell command by the way. <shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>runJavaCommand.sh</exec>
<file>${nameNode}/scripts/runJavaCommand.sh#runJavaCommand.sh</file>
<file>${nameNode}/securelocation/user.keytab#user.keytab</file>
</shell>
then just add a kinit into the script before running java
kinit -kt user.keytab user@EXAMPLE.COM
java org.apache.myprogram
... View more
01-04-2016
01:59 PM
It looks like a very useful command for debugging. Never used it before. Shame it seems to be broken.
... View more
01-04-2016
12:23 PM
Are you sure that dump behaves the same? If I do ( using your data 😞 a = load '/tmp/test' using PigStorage(',') as (year,month,day);
dump a;
(2015,,08)(2015,,09)... And if I do b = foreach a generate month;and dump b;
()()() Looks to me pigstorage works perfectly fine with dump. If I use illustrate everything goes wrong though. After using illustrate even the dump command fails with a nullpointer exception. So not only does it not work correctly it breaks the grunt shell until I restart it. I think the problem is the illustrate command: Which is not too surprising since this is the warning on top of it in the pig docs: Illustrate: (Note! This feature is NOT maintained at the moment. We are looking for someone to adopt it.)
... View more
01-04-2016
12:12 PM
I assume you are using HDP? In that case PIG_HOME is set when executing the pig command. If you cat /usr/bin/pig you can find the line export PIG_HOME=${PIG_HOME:-/usr/hdp/2.3.2.0-2950/pig}. So you could run this manually.
... View more
01-04-2016
11:40 AM
2 Kudos
If you have 4 nodes he will not be able to replicate 8 copies.It looks like some tmp files from accumulo. While I do not know accumulo too well some small files like jars normally have a high replication level so they are locally available on most nodes. You can check the filesystem with: hadoop fsck / -files -blocks -locations Normally programs honor a parameter called max replication which in your case should be 4 but it seems like accumulo doesnt always do that. https://issues.apache.org/jira/browse/ACCUMULO-683 Is this causing any problems? Or are you just worried about the errors in the logs.
... View more
01-04-2016
11:31 AM
1 Kudo
There are different possibilities why a hive action in oozie might fail. Missing jars ( oozie needs to start it with shared libs and they need to be configured correctly ), security ( kerberos configured? ) or just bad SQL. You would normally find more information in the logs. Either in the yarn logs of the oozie launcher task or the hive task or in the oozie logs. Hue lets you click through all of them pretty conveniently.
... View more
12-17-2015
09:27 AM
I implemented something similar to that, I wanted to run a data load every hour but load a dimension table from a database every 12 hours. I couldn't use two coordinators since the load would fail if the dimension table is loaded at the same time. So doing it in the same workflow was better. Instead of having a coordinator that starts two workflows I have a parameter in the coordinator that is given to the workflow which contains the hour like this: 2015121503 ( 2015-12-15-03 ) <property>
<name>hour</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyyMMddHH')}</value>
</property>
I then use a decision node in the workflow to only do the sqoop action every 12 hours ( in this case ) and do the load alone in all other cases . The sqoop action obviously continues with the load action. <start to="decision"/>
<decision name="decision">
<switch>
<case to="sqoop">
${( hour % 100) % 12 == 0}
</case>
<default to="load"/>
</switch>
</decision>
</decision>
... View more
12-16-2015
11:12 AM
3 Kudos
And you should be able to remove it in the hive.exec.pre/post/failure.hooks parameter in Ambari/Hive/Config/AdvancedConfig as a workaround if this is really resulting in the error. Perhaps a bug?
... View more
12-16-2015
10:09 AM
3 Kudos
I would think this is not enough information to give you an answer. Tez will normally be faster even on big amounts of data but your setup is pretty unusual. Huge amounts of memory ( Hive is normally CPU or IO bound so more nodes are in general a good idea ). And very big tasks which could lead to some Garbage Collection issues ( tez would reuse tasks which would not happen in mapreduce ) . Also you don't change the sort memory for example so that could still be small. The biggest question I have is cluster utilization and CPU utilization when you run both jobs. I.e. is the cluster fully utilized (run top on a node ) when you run the mapred job but not in tez?. Or is he waiting in a specific node? Tez dynamically adjusts the number of reducers running so it is possible that it decided on less tasks. Running set hive.tez.exec.print.summary=true; can help you figure out which part of your query took longest. The second question would be query complexity. Tez allows the reuse of tasks during execution so complex queries work better. It is always possible that there is a query that will run better on MapReduce.
... View more
12-16-2015
09:38 AM
3 Kudos
Just a question for clarification: Can you do a hdfs dfs -ls /user/oozie? If the test1 folder is not owned by user admin ( he only has rwx but is not the owner ), then he cannot change the ownership either. That is the same in Linux. I suppose this is not the case here but I just wanted to clarify
... View more
12-08-2015
10:14 AM
3 Kudos
I have a question about monitoring an application that reads files from one datasource ( oozie ) and streaming logs through a streaming application Logs -> Flume -> Kafka -> Storm -> HDFS. I would like to monitor this application and apart from Email actions in the Oozie workflow and Ambari monitoring errors in the setup I was wondering if anybody had done something like that before. 1) Storm Topology/Broker/Flume agent failures Is there any way to add ambari alerts or an ambari view that shows this in one? 2) Data problems For example if data stops flowing from the source Some things I have tried: Push number of received, inserted tuples from Storm into ambari metrics and show it on Ambari. Anybody did something like that? Are custom charts in Ambari supported now? Any simpler solutions.
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
-
Apache Storm
11-12-2015
01:04 PM
How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.
... View more
11-11-2015
07:06 PM
1 Kudo
I suppose you can use haproxy for example. However if you have kerberos and spnego you would need to add the proxy tickets similar to the oozie ha setup described here in the cloudera doc ( I would use ours if we would actually describe that ) http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_sg_oozie_ha_kerberos.html
... View more
11-11-2015
04:20 PM
7 Kudos
Using Hive in Oozie can be challenging. There are two available actions HiveAction and Hive2Action Th Hive action uses the hive client and needs to set a lot of libraries and connections. I ran into a lot of issues especially related to security. Also the logs are not available in the Hive server and hive server settings are not honoured. The Hive2 action is a solution for this. It runs a beeline command and connects through jdbc to the hive server. The below assumes that you have used LDAP or PAM security for your hive server. <action name="myhiveaction">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://server:10000/default</jdbc-url>
<password>${hivepassword}</password>
<script>/data/sql/dayly.sql</script>
<param>database=${database}</param>
<param>day=${day}</param>
</hive2>
<ok to="end"/>
<error to="kill"/>
</action>
The problem with this is that everybody who has access to the oozie logs has access to the password in the hivepassword parameter. This can be less than desirable. Luckily beeline provides a new function to use a password file. A file containing the hive password. beeline -u jdbc:hive2://sandbox:10000/default -n user -w passfile passfile being a file containing your password without any new lines at the end. Just the password. To use that in the Action you can give it as an argument. However you still need to upload the passfile to the oozie execution folder. This can be done in two ways ( create a lib folder under your workflow directory and put it there or use the file argument. <action name="myhiveaction">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://server:10000/default</jdbc-url>
<script>/data/sql/dayly.sql</script>
<param>database=${database}</param>
<param>day=${day}</param>
<argument>-wpassfile</argument>
<file>/user/myuser/passfile#passfile</file>
</hive2>
<ok to="end"/>
<error to="kill"/>
</action>
This will copy the password file to the temp directory and beeline will use it for authentication. Only the owner of the oozie workflow needs access to that file but other people can see the logs ( but not the password. ) Note: The Hive2Action seems to be weird about parameters. It is important to use -wpassfile not -w passfile The space will cause it to fail because it is adding the space to the filename. This is different for the command line beeline.
... View more
- Find more articles tagged with:
- Data Processing
11-09-2015
01:51 PM
2 Kudos
I think the answer depends much more on the nr. of queries per second than on RAM. 1GB is not enough but the moment you have 8-12 you should be fine outside of very specific usecases. The problem is more that hiveserver reaches limits when you run 10-15 queries per second. It is better in 2.3 which has parallel planning but it will not be able to do much more than 10-20 q/s in any case. Adding more RAM will not help you but increasing the number of parallel server threads and obviously adding additional hive servers. Obviously in most situations hive server will not be the bottleneck when you run into these kinds of query numbers
... View more
10-26-2015
04:52 AM
Thanks for the links, but I think we followed the instructions from the wiki ( adding the url into the firefox settings) do you think its possible that the issue is having multiple kerberos tickets in the windows machine? Does SPNEGO send all in other words or only the primary one ( which would be the wrong one the user directly gets from AD )
... View more
10-23-2015
03:43 AM
We have a user at xxxx who wants to access the web ui but gets a 401 on his windows machine. We have a valid ticket for the realm of the cluster but also a ticket for a different realm. ( the primary realm of the machine ) . We have done the steps for preparing firefox as specified in the storm ui question but it does not work. Any idea how to specify a principal? Also little addon. We sometimes see a 302 in CURL instead of a 200. We can also see this in the Ambari alerts. But ambari seems to think its ok ( as in timeline server is 302 and oozie 200 but I got 302 in oozie curl ) What does this mean exactly?
... View more
10-02-2015
04:34 AM
1 Kudo
That was actually the jar I was using. The hive-jdbc.jar is a link to the standalone jar. But I still had to add the two other jars. Otherwise I got Classnotfound exceptions.
... View more
- « Previous
- Next »