Created on 01-21-2015 08:40 AM - edited 09-16-2022 02:19 AM
Hello,
I'm searching for a good tutorial about how to schedule impala jobs into oozie.
The only threads that I found about this subject are:
o https://issues.apache.org/jira/browse/OOZIE-1591
o https://groups.google.com/a/cloudera.org/forum/#!topic/impala-user/8vM7fKR7F3A
Can you please help? (give at least one example)
Created 01-21-2015 10:55 AM
Hey,
Currently there is not an Impala action, so you must use a shell action that calls impala-shell. The shell script that calls impala-shell must also include an entry to set the PYTHON EGGS location. Here is an example shell script:
#!/bin/bash
export PYTHON_EGG_CACHE=./myeggs
/usr/bin/kinit -kt cconner.keytab -V cconner
impala-shell -q "invalidate metadata"
NOTICE the PYTHON_EGG_CACHE, this is the location you must set or the job will fail. This also does a kinit in the case of a kerberized cluster. Here is the workflow that goes with that script:
<workflow-app name="shell-impala-invalidate-wf" xmlns="uri:oozie:workflow:0.4">
<start to="shell-impala-invalidate"/>
<action name="shell-impala-invalidate">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>shell-impala-invalidate.sh</exec>
<file>shell-impala-invalidate.sh#shell-impala-invalidate.sh</file>
<file>cconner.keytab#cconner.keytab</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
You must include the <file> tag with the shell script, but not the keytab part unless you are using kerberos.
Hope this helps.
Thanks
Chris
Created 01-21-2015 10:55 AM
Hey,
Currently there is not an Impala action, so you must use a shell action that calls impala-shell. The shell script that calls impala-shell must also include an entry to set the PYTHON EGGS location. Here is an example shell script:
#!/bin/bash
export PYTHON_EGG_CACHE=./myeggs
/usr/bin/kinit -kt cconner.keytab -V cconner
impala-shell -q "invalidate metadata"
NOTICE the PYTHON_EGG_CACHE, this is the location you must set or the job will fail. This also does a kinit in the case of a kerberized cluster. Here is the workflow that goes with that script:
<workflow-app name="shell-impala-invalidate-wf" xmlns="uri:oozie:workflow:0.4">
<start to="shell-impala-invalidate"/>
<action name="shell-impala-invalidate">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>shell-impala-invalidate.sh</exec>
<file>shell-impala-invalidate.sh#shell-impala-invalidate.sh</file>
<file>cconner.keytab#cconner.keytab</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
You must include the <file> tag with the shell script, but not the keytab part unless you are using kerberos.
Hope this helps.
Thanks
Chris
Created 02-11-2015 10:32 PM
Hi cconner
It worked by your example, thanks!
But if I want to execute impala-shell -f *.sql , in Oozie ,there was error mentioned couldn't find the file.
I had upload the sql file into Oozie workspace and add path for it .
I try to pass shell argument, like ./impala-shell -f $2 ,but it didn't work.
Could you please tell me how to do it in right way?
This is my xml ( I replaced sensitive info by ***):
<workflow-app name="impala-shelltest" xmlns="uri:oozie:workflow:0.4">
<credentials>
<credential name="hcat" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>thrift://***:9083</value>
</property>
<property>
<name>***</name>
<value>***</value>
</property>
</credential>
<credential name="hive2" type="hive2">
<property>
<name>hive2.jdbc.url</name>
<value>jdbc:hive2://***:10000/default</value>
</property>
<property>
<name>hive2.server.principal</name>
<value>hive/***</value>
</property>
</credential>
</credentials>
<start to="impalashell"/>
<action name="impalashell" cred="hcat,hive2">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>runstats.sh</exec>
<argument>***.keytab</argument>
<argument>***.sql</argument>
<file>runstats.sh#runstats.sh</file>
<file>***.keytab#***.keytab</file>
<file>***.sql#***.sql</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Created 02-12-2015 06:47 AM
The method you did was correct and should have worked. Did you try a command like:
impala-shell -f ./file.sql
Created 02-12-2015 10:55 PM
Hi
No, it didn't work....
I assume impala -f only support <local>path , but the hard thing is we couldn't which node will run the script and we couldn't sync the file to every node for some security reason.
Do you have more suggestion ?
TK
Created 02-12-2015 11:49 PM
Hello,
I also had this problem ( impala -f only support <local>path) .
As for myself, I did a shell script that downloads the file from hdfs to /tmp/somedirectory.
Afterwards, I call the impala -f command.
(I did something like)
FILE_TO_LAUNCH_LOCAL='/tmp' FILE_TO_LAUNCH_LOCAL_WITH_TIMESTAMP=${FILE_TO_LAUNCH_LOCAL}/Oozie${TIMESTAMP} export PYTHON_EGG_CACHE=./myeggs mkdir ${FILE_TO_LAUNCH_LOCAL_WITH_TIMESTAMP} mkdir ${FILE_TO_LAUNCH_LOCAL_WITH_TIMESTAMP} hdfs dfs -copyToLocal ${FILE_TO_LAUNCH} ${FILE_TO_LAUNCH_LOCAL_WITH_TIMESTAMP}/ impala-shell -f ${FILE_TO_LAUNCH_LOCAL_WITH_TIMESTAMP}/myfile; rm ${FILE_TO_LAUNCH_LOCAL_COMPLETE}
Alina
Created 02-13-2015 05:59 AM
This right here works:
<workflow-app name="shell-impala-invalidate-wf" xmlns="uri:oozie:workflow:0.4">
<start to="shell-impala-invalidate"/>
<action name="shell-impala-invalidate">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>shell-impala-invalidate.sh</exec>
<file>shell-impala-invalidate.sh#shell-impala-invalidate.sh</file>
<file>shell-impala-invalidate.sql#shell-impala-invalidate.sql</file>
<file>cconner.keytab#cconner.keytab</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Script:
#!/bin/bash
LOG=/tmp/shell-impala-invalidate-$USER.log
ls -alrt > $LOG
export PYTHON_EGG_CACHE=./myeggs
/usr/bin/kinit -kt cconner.keytab -V cconner
/usr/bin/klist -e >> $LOG
impala-shell -f shell-impala-invalidate.sql
NOTE: the <file> tag puts that file on the local file system where the impala-shell is going to run, so the file is indeed local for the -f flag and it goes in "PWD/<whatever is after #>". For example:
<file>test.sql#/test1/test.sql</file>
Then test.sql will be found in:
PWD/test1/test.sql
And:
<file>test.sql#test.sql</file>
Then test.sql will be found in:
PWD/test.sql
And the shell script and the keytab are also in "PWD" because of the file tags. I would do the following in your shell script to get some more insight:
#!/bin/bash
LOG=/tmp/shell-impala-invalidate-$USER.log
ls -alrtR > $LOG #This will show you all the files in the directory and their relative paths
export PYTHON_EGG_CACHE=./myeggs
/usr/bin/kinit -kt cconner.keytab -V cconner
/usr/bin/klist -e >> $LOG
hadoop fs -put $LOG /tmp #put the log file in HDFS to find it easily
impala-shell -f shell-impala-invalidate.sql
NOTICE the "ls -lartR" and the "hadoop fs" command, this way you can easily grab the log file from HDFS and see what files are actually there.
Created 08-07-2015 08:19 AM
In general 3 files are needed:
1) Linux shell-script to invoke
2) Sql file with impala queries/commands
3) Keytab
Those 3 files should be uploaded to the workflow deployment folder of the Oozie wf, and references as:
<workflow-app name="logdirs" xmlns="uri:oozie:workflow:0.4">
<start to="invalidate_oozie"/>
<action name="invalidate_oozie" cred="">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>${SHELL_SCRIPT}</exec>
<env-var>PYTHON_EGG_CACHE=./python-eggs</env-var>
<file>${SHELL_SCRIPT}#${SHELL_SCRIPT}</file>
<file>${KEYTAB_FILE}#${KEYTAB_FILE}</file>
<file>${SQL_FILE}#${SQL_FILE}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Later replace the variables when invoking the workflow.
Then those 3 files will be avialble/reachable for the shell-script under the $PWD path.
So kinit -kt <keytab_file> -V <username> can be done from the linux shell-script as
kinit -kt $PWD/<keytab_file> -V <username>
and impala invoked as:
impala-shell -f $PWD/<sql_file>
In reallity, because the script is invoked from the PWD path, $PWD variable could be ommited. The important is to include the 3 files in the Shell action to make them available on runtime.
Created 10-01-2015 02:32 AM
I hope I didn't necro this one. I just want to ask if I need the python eggs if I just want to schedule a job for impala. and oh, since i am using the oozie web rest api, i wanted to know if there is any XML sample I could relate to, especially when I needed the SQL line to be dynamic enough. e.g. first http request would be "select * from table1" while the next from it would be "select * from table2".
Created 10-01-2015 06:34 AM
Hello,
For me it didn't work without the python eggs thing.
In order to by dynamic I replaced some well defined sequences in the shell script.
For example. I had my impala script something like this one:
select * from ${table1}; select * from ${table2};
Then in the shell script I did something like:
sed "s/\${table1}/my_real_table_one/g;s/\${table2}/my_real_table_two/g;" $LOCAL_FILE_PATH > $LOCAL_FILE_WITH_VARIABLE_REPLACED
I hope this and my other post with the copyToLocal command for copying in local will help you.
Alina GHERMAN
Created 02-22-2016 01:58 PM
Thanks to all who participated in this thread. Check out the Community Knowledge article we created based on it. 🙂
Created on 09-01-2016 03:40 AM - edited 09-01-2016 05:25 AM
2016-09-01 12:32:17,252 WARN org.apache.oozie.action.hadoop.ShellActionExecutor: SERVER[******] USER[****] GROUP[-] TOKEN[] APP[PV3] JOB[0000050-160512133914543-oozie-oozi-W] ACTION[0000050-160512133914543-oozie-oozi-W@shell-7170] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], main() threw exception, Cannot run program "shell-impala-invalidate.sh" (in directory "/data/4/yarn/nm/usercache/****/appcache/application_1463053085953_30120/container_e49_1463053085953_30120_01_000002"): error=2, No such file or directory 2016-09-01 12:32:17,252 WARN org.apache.oozie.action.hadoop.ShellActionExecutor: SERVER[******] USER[****] GROUP[-] TOKEN[] APP[PV3] JOB[0000050-160512133914543-oozie-oozi-W] ACTION[0000050-160512133914543-oozie-oozi-W@shell-7170] Launcher exception: Cannot run program "shell-impala-invalidate.sh" (in directory "/data/4/yarn/nm/usercache/****/appcache/application_1463053085953_30120/container_e49_1463053085953_30120_01_000002"): error=2, No such file or directory java.io.IOException: Cannot run program "shell-impala-invalidate.sh" (in directory "/data/4/yarn/nm/usercache/*******/appcache/application_1463053085953_30120/container_e49_1463053085953_30120_01_000002"): error=2, No such file or directory
I tried to follow the tutorial, but for some reasons i get the following error.
My workflow.xml
<workflow-app name="shell-impala-invalidate-wf" xmlns="uri:oozie:workflow:0.4"> <start to="shell-impala-invalidate"/> <action name="shell-impala-invalidate"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>shell-impala-invalidate.sh</exec> <file>shell-impala-invalidate.sh#shell-impala-invalidate.sh</file> <file>shell-impala-invalidate.sql#shell-impala-invalidate.sql</file> <file>****.keytab#****.keytab</file> </shell> <ok to="end"/> <error to="kill"/> </action> <kill name="kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
Shell:
#!/bin/bash LOG=/tmp/shell-impala-invalidate-$USER.log ls -alrtR > $LOG #This will show you all the files in the directory and their relative paths export PYTHON_EGG_CACHE=./myeggs /usr/bin/kinit -kt ***.keytab -V *** /usr/bin/klist -e >> $LOG hadoop fs -put $LOG /tmp #put the log file in HDFS to find it easily impala-shell -f shell-impala-invalidate.sql
I feel a bit of lost now. Could any of you please give me a push, what is wrong?
Thanks
Created 09-02-2016 02:51 AM
Hello,
Your problem is not linked to the impala scheduler, but to the shell. In fact oozie cannot find your shell file.
1.What are the permission on the file? does oozie has acces?
shell-impala-invalidate.sh
2.What are the premission to the folder?
/data/4/yarn/nm/usercache/*******/appcache/application_1463053085953_30120/container_e49_1463053085953_30120_01_000002
(this folder is on one of your workers)
Alina
Created 09-02-2016 04:30 AM
Hi Alina,
First attempt:
I placed everything into my own library user / username, and gave the same permissions as in the attached picture.
Second attempt:
All the files were placed into the oozie workspace, (user/hue/oozie/workspaces/...), with the above permissions but the log message is still the same.
2.:
Can I check the permission to this folder through the webUI?
Created 09-15-2016 05:35 AM
Thyanks for the fruitful discussion , it saved my time. I have just made it working by making below changes
My workflow app xml :
<workflow-app name="shell-impala-invalidate-wf" xmlns="uri:oozie:workflow:0.4">
<start to="shell-impala-invalidate"/>
<action name="shell-impala-invalidate">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<file>${EXEC}#${EXEC}</file>
<file>${EXECQSCRIPT}#${EXECQSCRIPT}</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
where EXEC and EXECQSCRIPT are the path to the shell script and my sql file respectively.
In my shell script , I just need to add the following lines (I am not using secured environment here, so no Kerberos)
export PYTHON_EGG_CACHE= /home/python-egg-catch
impala-shell -f shellq.sql
where shellq.sql is my query file and python-egg-catch is the folder I have created with proper permission.
Thanks
Created on 11-03-2016 06:03 AM - edited 11-03-2016 06:06 AM
Hi - see this is quite an old thread, but was very useful to get me going. I've ended up writing a wrapper script to make the shell action behave like a HiveServer2 action from a parameter perspective. Here it is:
Usage Notes:
There are 4 mandatory parameters which must be in the correct ordinal position:
Parameter 5 onwards can be used for parameters in your SQL file that you want to be substituted in your reference SQL file. These take the form of <key>=<value> - and work in exactly the same way as HiveServer2 actions in that a parameter in the SQL file in the format "${<name>}" will be substituted with "<value>". In our example we have 2 additional parameters:
So the ${db_name} and ${table_name} tokens will be substituted with the supplied values:
Created 12-10-2018 02:07 AM