Member since
01-30-2017
36
Posts
1
Kudos Received
0
Solutions
06-29-2017
11:42 AM
1 Kudo
@Harsh J Is there any documentation available in connecting to HBase from Pyspark. I would like to know how we can create dataframes, read and write to Hbase from Pyspark. Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration
... View more
06-29-2017
11:39 AM
@Harsh J I have the same problem. First I created a hbase table with two column families like below hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'
hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds
hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'
hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'
hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds
hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds
hbase(main):017:0> scan 'pyspark'
ROW COLUMN+CELL
1 column=cf1:a, timestamp=1498758639265, value=spark
1 column=cf2:b, timestamp=1498758656282, value=pyspark
2 column=cf1:a, timestamp=1498758678501, value=df
2 column=cf2:b, timestamp=1498758690263, value=python Then in pyspark shell I have done like below: pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load() then done pyspark.show() This gave me result +---------+----+-------+
|KEY_FIELD| A| B|
+---------+----+-------+
| 1|null|pyspark|
| 2|null| python|
+---------+----+-------+ Now my questions: 1)Why am I getting Null values in column A of the dataframe. 2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame. 3) Or is there a generic way of do this?
... View more
04-25-2017
11:43 AM
@mbigelow Did you find a chance to have a look at my scripts and my problem. Looks like I am stuck on this, Cannot figure out any solution but only option is to use cron jobs which I don't want to
... View more
04-19-2017
07:05 AM
@mbigelow Hi did you get a chance to look at my problem?
... View more
04-18-2017
09:18 AM
@HillBilly I have checked the blog you posted. It says fork and joins must be used together. But here in my script I will be creating new tables from existing tables. I don't have anything to join. So looks like I should not use fork and join in my Workflow
... View more
04-17-2017
03:27 PM
@mbigelow I have another bigshell.sh script that takes shell.sh script and executes the shell.sh script 10 times in parallel. Now when I run this script in oozie It fails Contents of bigshell.sh nl -n rz test | xargs -n 2 --max-procs 10 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"' when I change the max-procs to 1 instead of 10 then the script is succesful. nl -n rz test | xargs -n 2 --max-procs 1 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'
... View more
04-17-2017
02:46 PM
@mbigelow Here is my shell script #!/bin/bash
LOG_LOCATION=/home/$USER/logs
exec 2>&1
[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }
table=$1
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
`hive -e "create table testing.${table} stored as parquet as select * from fishing.${table}"`
g_STATUS=$?
log_status $g_STATUS "Hive create ${table}"
echo "***********************************************************************************************************************************************************************"
... View more
04-17-2017
01:38 PM
@mbigelow The args file contains all the tables. I want to pass 10 arguments from the same file. So for the workflow, I believe I have to pass only one args file. How can I pass 10 arguments from a single file to the same workflow to run the script in parallel
... View more
04-17-2017
10:01 AM
What I am looking for is in Linux shell we can run the same script for 10 different argument s at the same time. Can I use the same in oozie? Because when I called a new shell script that executes my script for 10 arguments in Parallel in linux it works fine. But when I try to do the same in oozie the job fails why is this happening? What changes should I make for my workflow
... View more
04-16-2017
10:03 AM
I have a shell script in HDFS. I have scheduled this script in oozie with the following workflow. Workflow: <workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8f63"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8f63">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>shell.sh</exec>
<argument>${input_file}</argument>
<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
<file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
<file>/user/xxxx/args/${input_file}#${input_file}</file>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app> job properties nameNode=xxxxxxxxxxxxxxxxxxxx
jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx args file tableA
tableB
tablec
tableD Now the shell script runs for single table in args file. This workflow is executed successfully without any errors. How can I schedule this shell script to run in parallel. I want the script to run for 10 table at the same time. What are the steps needed to do so. What changes should I make to the workflow. Should I created 10 workflow for running 10 parallel jobs. Or what are the best scenarios to deal with this issue.
... View more
Labels:
- Labels:
-
Apache Oozie