Member since
01-28-2016
38
Posts
14
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
874 | 03-11-2016 04:16 PM |
05-10-2019
01:21 PM
Hi all, As mentioned from the title i'm trying to run a shell action that kicks off a spark job but unfortunately i'm consistently getting the following error... 19/05/10 14:03:39 ERROR AbstractRpcClient: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] java.io.IOException: Could not set up IO Streams to <hbaseregionserver>
Fri May 10 14:03:39 BST 2019, RpcRetryingCaller{globalStartTime=1557493419339, pause=100, retries=2}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: <hbaseregionserver>
Been playing around trying to get the script to take in the kerberos ticket but having no luck, as far as I can tell its related to the Oozie job not being able to pass the kerberos ticket any ideas why its not picking it up? I'm at a loss? Related code is below Oozie workflow action <action name="sparkJ" cred="hive2Cred">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${oozieQueueName}</value>
</property>
</configuration>
<exec>run.sh</exec>
<file>/thePathToTheScript/run.sh#run.sh</file>
<file>/thePathToTheProperties/myp.properties#myp.properties</file>
<capture-output />
</shell>
<ok to="end" />
<error to="fail" />
</action> Shell script
#!/bin/sh
export job_name=SPARK_JOB
export configuration=myp.properties
export num_executors=10
export executor_memory=1G
export queue=YARNQ
export max_executors=50
kinit -kt KEYTAB KPRINCIPAL
echo "[[[[[[[[[[[[[ Starting Job - name:${job_name}, configuration:${configuration} ]]]]]]]]]]]]]]"
/usr/hdp/current/spark2-client/bin/spark-submit \
--name ${job_name} \
--driver-java-options "-Dlog4j.configuration=file:./log4j.properties" \
--num-executors ${num_executors} \
--executor-memory ${executor_memory} \
--master yarn \
--keytab KEYTAB \
--principal KPRINCIPAL \
--supervise \
--deploy-mode cluster \
--queue ${queue} \
--files "./${configuration},./hbase-site.xml,./log4j.properties" \
--conf spark.driver.extraClassPath="/usr/hdp/current/hive-client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=./jaas.conf -Dlog4j.configuration=file:./log4j.properties" \
--conf spark.executor.extraClassPath="/usr/hdp/current/hive-client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=${max_executors} \
--conf spark.streaming.concurrentJobs=2 \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.yarn.security.tokens.hive.enabled=true \
--conf spark.yarn.security.tokens.hbase.enabled=true \
--conf spark.streaming.kafka.maxRatePerPartition=5000 \
--conf spark.streaming.backpressure.pid.maxRate=3000 \
--conf spark.streaming.backpressure.pid.minRate=200 \
--conf spark.streaming.backpressure.initialRate=5000 \
--jars /usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,/usr/hdp/current/hbase-client/lib/hbase-common.jar,/usr/hdp/current/hbase-client/lib/hbase-client.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
--class myclass myjar.jar ./${configuration} Many thanks to any help you can provide.
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache Spark
07-30-2018
03:59 PM
Hi Nikhil, Can you perform any commands on that table? Perhaps try dropping the partition. It seems that the data was removed at some point on HDFS but the hive tables metadata still thinks those partitions exist ALTER TABLE *tableName* drop if exists PARTITION(date_part="2018022313");
... View more
04-11-2018
03:26 PM
Hi all, Bit of background, I have been struggling with an issue in which a table can get locked but the lock gets stuck in a "waiting" state and so does not allow the query to progress, after about 2 hours the job complains that the table is locked and it cannot access it. I'm trying to identify why the table is getting locked which brings me to my question.... If a user runs a query but then disconnects the session can this cause a table to get locked in this state? | lockid | database | table | partition | lock_state | blocked_by | lock_type | transaction_id | last_heartbeat | acquired_at | user | hostname | agent_info | 78011111.2 | db | tbl | NULL | WAITING | 78043210.2 | EXCLUSIVE | NULL | 1523459564452 | NULL | user | host | hive_20180411161225_2b33e811-e44c-59ds-afb3-b4111fcb019a |
... View more
Labels:
- Labels:
-
Apache Hive
10-13-2017
11:13 AM
Thought as much, thanks for confirming
... View more
10-13-2017
10:29 AM
Hi all, Sorry for the basic question i've had little success searching online, i just need clarification whether something is possible. Can i run a beeline command against a file that is in HDFS. I know we use -f in beeline to specify a file when its on the local file system but can this also be done against a file on HDFS. My use case is that i'd like to run a beeline command through a shell action in Oozie. I'm hitting some issues using Hive2 actions so i wanted to try using a shell action instead Any help is much appreciated, Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
09-15-2017
01:47 PM
Hi all, I was hoping someone could confirm whether something i'm trying to do is possible because i'm currently hitting multiple issues. I would like to add a record into a hive table using and insert statement, within this insert statement I have one column which should add a count value to the table based off of the result of a query. My Hive SQL is below.... use ${database};
set hivevar:deltaCount = select count(*) from ${database}.${hive_table};
DROP TABLE IF EXISTS ${database}.process_status_stg_${hive_table};
create table ${database}.process_status_stg_${hive_table} (
taskName varchar(50) COMMENT 'Name of the task being run to populate data',
starttime varchar(50) COMMENT 'time of record addition',
status varchar(50) COMMENT 'status of the task',
workflowID varchar(50) COMMENT 'workflow ID that is running the task',
oozieErrorCode varchar(50) COMMENT 'Error code returned by Oozie',
recordsLoadedCount varchar(50) COMMENT 'records pulled in previous load') ;
insert into table ${database}.process_status_stg_${hive_table} values ('${hive_table}','${current_time}','${taskStatus}','${workflowID}','${errorCode}', (CASE ${taskStatus} WHEN 'COMPLETED' THEN '${hiveconf:deltaCount}' ELSE 'N/A' end as recordsLoadedCount));
Any help is much appreciated, Thanks
... View more
Labels:
- Labels:
-
Apache Hive
09-14-2017
01:40 PM
Like it says your permissions haven't been set up quite right. Could be down to some permissions not being applied after your update? Check ranger(or whatever control manager you're using) and confirm that the relevant user/group has the expected write permission on your HDFS location(s).
... View more
08-07-2017
04:28 PM
Does this happen everytime you run the oozie workflow or is this just a one time event? If so i'd just kill it.
... View more
06-13-2016
02:15 PM
Thanks for the response Ben. The changes you have suggested have worked to a degree and I know due to the use of partitions that there should be no further degradation to the speed of the query, but this query can still take up to a minute to complete. I shall continue to look into other solutions to this issue and post them if I find any.
... View more
06-08-2016
10:34 AM
Hi all, I am currently pulling the max value of a timestamp column from my tables in Hive and using this to pull data after this date using Sqoop, i am using Oozie in order to perform these steps. This is currently done by running a query against the Hive table to put this value into hdfs and then this is picked up in another Oozie action before being passed to the Sqoop action. This all run perfectly fine, however retrieving the Max timestamp value and putting this into HDFS is currently very slow and I can only see this getting slower as more data is inserted into the table. The Hive SQL I am using to pull this value is as below: INSERT OVERWRITE DIRECTORY '${lastModifiedDateSaveLocation}' select max(${timestamp_column}) from ${hive_table}; Can anyone suggest a more optimized solution to retrieve this max timestamp? Thanks for your help, Dan
... View more
Labels:
06-03-2016
04:08 PM
I am trying to obtain the date for which the falcon process is running in my oozie workflow. Any idea on how this can be passed to the workflow or obtained directly? Any help is much appreciated.
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Oozie
04-15-2016
03:00 PM
I ended up going with your approach Ben as it suited what I was trying to do a bit better and after much fiddling around I managed to get it working. However, I am getting the value from my query back like this lastModified=+------------------------+--+ | 2016-03-31 21:59:57.0 | +------------------------+--+ Whereas all I really want is the date value not the extra jargon, is this something i can use regex for to get rid of? Thanks
... View more
04-12-2016
10:59 AM
2 Kudos
Hi all, I was hoping someone might be able to detail whether what I am attempting to do is currently possible in Oozie and if so how it could be done. I have seen many sources about getting an output from a shell action and inputting it into a Hive action I have however not seen much on whether this can be done the other way around. So my issue is that I would like to run a hive action which will capture the most recent field in a table based off of the max timestamp. I would then like to pass this timestamp value over to a shell action which will take this value and put it in the where statement for a Sqoop Extract. How would I go about passing this value from the Hive action to the Shell action? Is this possible? Please let me know if you need any additional information, thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Oozie
-
Apache Sqoop
03-11-2016
04:16 PM
2 Kudos
I've resolved this issue, turned out the problem came from a driver that was being used --driver
oracle.jdbc.OracleDriver Removed the driver, problem solved!
... View more
03-11-2016
03:47 PM
1 Kudo
I should also note that the desired output appears if I use --table instead of --query, however --query is neccersary in order to tidy up the data.
... View more
03-11-2016
02:43 PM
3 Kudos
I'm having an issue I was hoping someone could help me with. I'll try and explain as best as I can. I'm
currently trying to run an incremental sqoop job which pulls new data
down based on when the job was last run. To do this I have used the
incremental append option. This sqoop job also contains a free form
query which is used to clean up columns containing additional
whitespaces etc. The problem I am having is that I am doing the
incremental import based on a timestamp column and the lower bound and
upper bound values being generated by the sqoop job are not of a
timestamp value so I am recieving type errors whenever I try and run the
job. My question is, is there a way to insert a to_timestamp function
around the lower bound and upper bound values so that the values are in
the correct format when they are comparred to the timestamp column? Some example code below: Current output WHERE ts > '2014-08-10 10:09:36.094' AND ts <= '2016-03-05 10:09:36.094' Desired output WHERE ( ts > TO_TIMESTAMP('2014-08-10 10:09:36.094', 'YYYY-MM-DD HH24:MI:SS.FF') AND ts <=
TO_TIMESTAMP('2016-03-05 10:09:36.094', 'YYYY-MM-DD HH24:MI:SS.FF') )
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Sqoop
02-04-2016
10:10 AM
I will look into it, thanks for your help
... View more
02-03-2016
05:20 PM
My issue is that the files that will be copied across will have the date and time in the filename, they will also be updated daily and so it will be next to impossible to know what the names of the files to be copied will be.
... View more
02-03-2016
05:07 PM
2 Kudos
I've been trying to find the solution to this problem for a while. I
have found that in a normal file system using shell you can use this
command to move all files under a location but leave the directories
alone. find .-maxdepth 1-type f -exec mv {} destination_path \; I was wondering if there is also a command to be able to do the same in hdfs. So if I have a folder in hdfs called "folder1" which contains the
files "copyThis.txt", "copyThisAsWell.txt" and "theFinalCopy.txt" and
also contains a folder "doNotCopy" and I want to copy the files into a
new folder called "folder2" but leave the folder "doNotCopy" behind, how
can this be done in hdfs? Thanks for any help you can provide.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Oozie
01-29-2016
11:38 AM
Thanks, this is very helpful in understanding the partitioning. I've made the changes you have suggested but i'm still not getting any data into my table, despite the staging table being populated with data, do you know why this might be the case?
... View more
01-28-2016
05:29 PM
1 Kudo
Hi all, I'm having an issue I was hoping someone could help me with, I believe its due to how my tables are being partitioned but I'm struggling to come up with a solution. I have created a table such as the example one below CREATE TABLE Demo
(time timestamp COMMENT 'timestamp in format yyyymmddTss:mm:hh',
exampleId varchar(6) COMMENT 'example field'
example2 varchar(10) COMMENT 'example field'
example3 varchar(50) COMMENT 'example field'
example4 varchar(50) COMMENT 'example field'
)
COMMENT 'A table to demonstrate my problem'
PARTITIONED BY (TRAN_DATE DATE COMMENT 'Transaction Date')
CLUSTERED BY (exampleId)
SORTED BY (exampleId) INTO 24 BUCKETS
stored as orc;
And I am then trying to copy data from a CSV file into a table using an external table such as below Drop TABLE Demo_staging
CREATE TABLE Demo_staging
(time timestamp COMMENT 'timestamp in format yyyymmddTss:mm:hh',
exampleId varchar(6) COMMENT 'example field'
example2 varchar(10) COMMENT 'example field'
example3 varchar(50) COMMENT 'example field'
example4 varchar(50) COMMENT 'example field'
)
COMMENT 'The staging table to demonstrate my problem'
row format delimited fields terminated by ',' null defined as '\001'
STORED AS TEXTFILE
LOCATION '${appPath}/raw'
tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="2");
insert overwrite table Demo partition (TRAN_DATE = ${day}) SELECT * FROM Demo_staging;
The value in TRAN_DATE should be a date format of format YYYYMMDD which is derived from the field time in which all the values are set to 2015-06-20T00:00:00 but i'm not sure how TRAN_DATE is supposed to get this value. The value of ${day} is 20150620. I've tried using the following as a test to see the data appearing but have had no luck insert overwrite table Demo partition (to_char(time,YYYY-MM-DD) = ${day}) SELECT * FROM Demo_staging; I can see the data has appeared in my staging table but it does not make it to the actual table and I can only think of the partitioning being the reason for this. Any help is greatly appreciated. Thanks
... View more
Labels:
- Labels:
-
Apache Hive