Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to run hive, sqoop, pig, sparksql scripts for batch process?

Highlighted

How to run hive, sqoop, pig, sparksql scripts for batch process?

Explorer

Hi, I have a need to create a batch process, where I have to read (select only a few columns and few records) from a set of mainframe files (stored in a Hadoop cluster), and then convert them into ASCII files and store them in HDFS. There will be more than 100 or 300 files like this. So is there a way to do this using HIVE, or PIG or SPARK? or Java?

Also, i want to have an automated way to read these files (scheduled) and pass parameters during runtime (like file name, path, column names and filter condition to filter records). I came to know about the following (https://community.hortonworks.com/questions/80649/how-can-i-read-mainframe-file-which-is-in-ebcdic-f.html)

So can you suggest me some ideas please.

2 REPLIES 2
Highlighted

Re: How to run hive, sqoop, pig, sparksql scripts for batch process?

Sqoop can import from Mainframes into HDFS. EBCDIC encoded fixed length data will be stored as ASCII encoded variable length text on HDFS.

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_mainframe_literal

http://blog.syncsort.com/2014/06/big-data/big-iron-big-data-mainframe-hadoop-apache-sqoop/

You can easily create a shell script to call Sqoop and pass argument values to Sqoop.

Re: How to run hive, sqoop, pig, sparksql scripts for batch process?

Expert Contributor

here's how I exactly batch process files in a directory in spark. you can edit as you need and fire a JAR file and run. Works perfectly. The args(0) and args(1) in the code below is where you pass parameters for input and output.

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.util.control.Breaks._



object BatchProcessSpark {

  val conf = new SparkConf()
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .setAppName("mySparkApp")
     conf.set("spark.speculation","true")

  val sc = new SparkContext(conf)

  val sqlContext = new org.apache.spark.sql.SQLContext(sc)

  import sqlContext.implicits._

  case class test(
    f1:String,
    f2:String
  )


  def doSomething(file1:String) = {

    val x = sc.textFile(file1);
    
    val ref_id = x.map(_.split("\\|"))
             .map(x => test(
               x(0).toString,
               x(1).toString
             )).toDF

    // Now do all operation as needed

}
   

  // main method
  def main(args: Array[String]) {

    val files1 = FileSystem.get( sc.hadoopConfiguration ).listStatus(new Path(args(0)))
    val OUTPUT = args(1)

    files1.foreach(filename => {
     
         val file = filename.getPath.toString()
         // Call function to do operation on each file
         doSomething(file)
            
     })

   

  } 
} 

Don't have an account?
Coming from Hortonworks? Activate your account here