About Biswajit16

Biswajit16 · ‎02-02-2018

Hi All , In my project , where I am trying to read log files and process it in spark , I am using NiFi to read the file from tomcat log folder location and copy it to my Edge node in my hadoop cluster. But the problem is that my Application (for which I am processing log files) is in cluster environment and in all 4 tomcat cluster log file names are same. So what I want to do , getFTP will get the log file from app server location , then data will flow into a updateAttribite processor , which will append the server and cluster identification(just something like server1Cluste1 or server2Cluster1) with the file name and then putFile will store the log file in local file system with new name. Which I will process in spark job. Can any one help me out for configuration of updateAttribute in my case? is there anything in updateAttribute by which I can identify from which server I am getting this file and depending on that can I change the file name to putFile? Any help will be highly appreciated Thanks in advance

Biswajit16 · ‎01-15-2018

@Bala , sorry for vary late response .... Actually my purpose is read some data file(server log) , transform those into proper format and prepare a data warehouse (that in my case , HIVE) for analysis on latter. So , in my project I have 3 different activities mainly 1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job) 2) prepare a data-ware house with those daily data (for which , I am inserting those Spark DF into HIVE table --- frequency : daily job) 3) Show the result (for this I am using again spark SQL along with HIVE as that is faster than using only HIVE query , and will use Zeppelin or tableau for data visualization --frequency : weekly job or as on required ) Though as my reading and understanding , I guess SpakSQL alone + cache will be much faster the spark+hive , but I think I do ont have any other option as I have to do analysis on repository data. Do you suggest any other approach for this use case?

Biswajit16 · ‎10-25-2017

@kgautam Actually my requirement is some thing like that... 1) Read the data from file 2) do some filter operation on those data 3) store them back in HIVE for other application 4) View those data in Zapplion from HIVE

Biswajit16 · ‎10-25-2017

Hi , I am trying to read out a tomcat log file (size is around 5 gig ) and store those data in HIVE in spark. After reading out log file my dataframe size around 100K. But when I am trying to insert them in Hive I am getting "java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ... spark.sql("insert into table com.pointsData select * from temptable") where "temptable" in my dataframe in spark. Any one can help me out with any work around ? Anything like , I can split the DF and run insert into in small chuck? Please note I am using maximum of my driver system memory , I can not increase it any more and I am using Kyro. Thanks in advance...

Biswajit16 · ‎09-25-2017

Thanks a lot .... it worked exactly as I wanted .... thanks again ... one more thing any link or resource where I can get this kind of information or setup details .....

Biswajit16 · ‎09-23-2017

Thanks a lot for your help , you saved my day ... thanks again .....

Biswajit16 · ‎09-23-2017

Hi All, My rolling log file pattern is something like this /my/path/directory/my-app-2017-09-06.log /my/path/directory/my-app-2017-09-07.log /my/path/directory/my-app-2017-09-08.log Can you one suggest what can set for property in NIFI for a tailFile processor to read out those. Please note I have old file also and some different file also , but I want to read file wit this specific file name and today onward only , not the old file. I read the doc available for NiFi in website , but not clear me ... Can any one please help me out to configure tailFile with this file pattern. Any help will be highly appreciated actually I stuck on this issue for last 5 days ....

Biswajit16 · ‎09-22-2017

Hi , In my project I am using Nifi to read log file from tomcat and process those data in a spark application , after that insert those process data in DB. But my problem is that , at app server level , I have 4 tomcat cluster(4 different log file) in 2 different box and I have mark out which data is from which cluster at spark level. In my present set up I have 2 tailFile processor in Nifi which pointing to single outport , in per box but not able to identify which data is from which cluster at spark level. Is there any option in tailFile processor to add some suffix or prefix or file name(or any attribute) in each record ? so that I can identify each record is coming from which cluster and persist in db in that way? Any help will be highly appreciated .... Thanks in advance

Biswajit16 · ‎08-25-2017

Thanks Jodan... you saved my day ....thanks a lot .... 🙂

Biswajit16 · ‎08-24-2017

Hi All, I have a sample table(stuends1) in HIVE which I want to connect from Spark using JDBC (as Hive is not in same cluster). I was just trying with following code... def main(args: Array[String]): Unit = { //Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance() val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]") val sc = new SparkContext(conf) val spark = SparkSession .builder() .appName("Spark Hive Example") .getOrCreate() val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:hive2://34.223.237.55:10000") .option("dbtable", "students1") .option("user", "hduser") .option("password", "hadoop") //.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver") .load() println("able to connect------------------") jdbcDF.show jdbcDF.printSchema() jdbcDF.createOrReplaceTempView("std") val sqlDF = spark.sql("select * from std") println("Start println-----") spark.sqlContext.sql("select * from std").collect().foreach(println) println("end println-----") sqlDF.show(false) } I tried in multiple ways but all the time its showing table structure with column name only. Like ... +--------------+-------------+-------------+ |students1.name|students1.age|students1.gpa| +--------------+-------------+-------------+ +--------------+-------------+-------------+ But not data, but able to get data when trying to with dbeaver from my local with SQL query. From spark, jdbcDF.printSchema() also showing proper schema , so I guess no issue with connection. I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this .... libraryDependencies ++= Seq( "log4j" % "log4j" % "1.2.17", "org.apache.spark" % "spark-core_2.11" % "2.1.1" , "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.2", "org.apache.spark" % "spark-hivecontext-compatibility_2.10" % "2.0.0-preview", "org.apache.spark" % "spark-sql_2.11" % "2.1.1" , "org.apache.spark" % "spark-hive_2.10" % "2.1.1", "org.apache.hive" % "hive-jdbc" % "1.2.1" } can any one suggest why I am not getting any output of show(). Thanks in advance...

Online	Offline
Last Visited	‎04-27-2017 12:23 AM

Member Since	‎06-28-2016 02:33 PM
Last Visited	‎04-27-2017 12:23 AM
Posts	34
Kudos received	1

Cloudera Community

NiFi with updateattribute

Re: Process big size file in spark / Hive

Re: Process big size file in spark / Hive

Process big size file in spark / Hive

Re: NiFi with rolloing file pattern

Re: prefix or suffix in NIFI tailFile Processor

NiFi with rolloing file pattern

prefix or suffix in NIFI tailFile Processor

Re: Spark with HIVE JDBC connection

Spark with HIVE JDBC connection