About sanjeev20

sanjeev20 · ‎06-29-2017

@Harsh J Is there any documentation available in connecting to HBase from Pyspark. I would like to know how we can create dataframes, read and write to Hbase from Pyspark. Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration

sanjeev20 · ‎06-29-2017

@Harsh J I have the same problem. First I created a hbase table with two column families like below hbase(main):009:0> create 'pyspark', 'cf1', 'cf2' hbase(main):011:0> desc 'pyspark' Table pyspark is ENABLED pyspark COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} {NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 2 row(s) in 0.0460 seconds hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark' hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark' hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df' 0 row(s) in 0.0070 seconds hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python' 0 row(s) in 0.0080 seconds hbase(main):017:0> scan 'pyspark' ROW COLUMN+CELL 1 column=cf1:a, timestamp=1498758639265, value=spark 1 column=cf2:b, timestamp=1498758656282, value=pyspark 2 column=cf1:a, timestamp=1498758678501, value=df 2 column=cf2:b, timestamp=1498758690263, value=python Then in pyspark shell I have done like below: pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load() then done pyspark.show() This gave me result +---------+----+-------+ |KEY_FIELD| A| B| +---------+----+-------+ | 1|null|pyspark| | 2|null| python| +---------+----+-------+ Now my questions: 1)Why am I getting Null values in column A of the dataframe. 2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame. 3) Or is there a generic way of do this?

sanjeev20 · ‎04-25-2017

@mbigelow Did you find a chance to have a look at my scripts and my problem. Looks like I am stuck on this, Cannot figure out any solution but only option is to use cron jobs which I don't want to

sanjeev20 · ‎04-19-2017

@mbigelow Hi did you get a chance to look at my problem?

sanjeev20 · ‎04-18-2017

@HillBilly I have checked the blog you posted. It says fork and joins must be used together. But here in my script I will be creating new tables from existing tables. I don't have anything to join. So looks like I should not use fork and join in my Workflow

sanjeev20 · ‎04-17-2017

@mbigelow I have another bigshell.sh script that takes shell.sh script and executes the shell.sh script 10 times in parallel. Now when I run this script in oozie It fails Contents of bigshell.sh nl -n rz test | xargs -n 2 --max-procs 10 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"' when I change the max-procs to 1 instead of 10 then the script is succesful. nl -n rz test | xargs -n 2 --max-procs 1 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'

sanjeev20 · ‎04-17-2017

@mbigelow Here is my shell script #!/bin/bash LOG_LOCATION=/home/$USER/logs exec 2>&1 [ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; } table=$1 TIMESTAMP=`date "+%Y-%m-%d"` touch /home/$USER/logs/${TIMESTAMP}.success_log touch /home/$USER/logs/${TIMESTAMP}.fail_log success_logs=/home/$USER/logs/${TIMESTAMP}.success_log failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log #Function to get the status of the job creation function log_status { status=$1 message=$2 if [ "$status" -ne 0 ]; then echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}" #echo "Please find the attached log file for more details" exit 1 else echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}" fi } `hive -e "create table testing.${table} stored as parquet as select * from fishing.${table}"` g_STATUS=$? log_status $g_STATUS "Hive create ${table}" echo "***********************************************************************************************************************************************************************"

sanjeev20 · ‎04-17-2017

@mbigelow The args file contains all the tables. I want to pass 10 arguments from the same file. So for the workflow, I believe I have to pass only one args file. How can I pass 10 arguments from a single file to the same workflow to run the script in parallel

sanjeev20 · ‎04-17-2017

What I am looking for is in Linux shell we can run the same script for 10 different argument s at the same time. Can I use the same in oozie? Because when I called a new shell script that executes my script for 10 arguments in Parallel in linux it works fine. But when I try to do the same in oozie the job fails why is this happening? What changes should I make for my workflow

sanjeev20 · ‎04-16-2017

I have a shell script in HDFS. I have scheduled this script in oozie with the following workflow. Workflow: <workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5"> <start to="shell-8f63"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="shell-8f63"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>shell.sh</exec> <argument>${input_file}</argument> <env-var>HADOOP_USER_NAME=${wf:user()}</env-var> <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file> <file>/user/xxxx/args/${input_file}#${input_file}</file> </shell> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app> job properties nameNode=xxxxxxxxxxxxxxxxxxxx jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx queueName=default oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx args file tableA tableB tablec tableD Now the shell script runs for single table in args file. This workflow is executed successfully without any errors. How can I schedule this shell script to run in parallel. I want the script to run for 10 table at the same time. What are the steps needed to do so. What changes should I make to the workflow. Should I created 10 workflow for running 10 parallel jobs. Or what are the best scenarios to deal with this issue.

Online	Offline
Last Visited	‎06-30-2017 10:14 AM

Member Since	‎01-30-2017 05:35 PM
Last Visited	‎06-30-2017 10:14 AM
Posts	36
Kudos received	1

Cloudera Community

Re: Include latest hbase-spark in CDH

Re: Include latest hbase-spark in CDH

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Re: Schedule shell script to run parallelly in ooz...

Schedule shell script to run parallelly in oozie