Member since
06-28-2016
34
Posts
1
Kudos Received
0
Solutions
02-02-2018
06:41 PM
Hi All , In my project , where I am trying to read log files and process it in spark , I am using NiFi to read the file from tomcat log folder location and copy it to my Edge node in my hadoop cluster. But the problem is that my Application (for which I am processing log files) is in cluster environment and in all 4 tomcat cluster log file names are same. So what I want to do , getFTP will get the log file from app server location , then data will flow into a updateAttribite processor , which will append the server and cluster identification(just something like server1Cluste1 or server2Cluster1) with the file name and then putFile will store the log file in local file system with new name. Which I will process in spark job. Can any one help me out for configuration of updateAttribute in my case? is there anything in updateAttribute by which I can identify from which server I am getting this file and depending on that can I change the file name to putFile? Any help will be highly appreciated Thanks in advance
... View more
Labels:
- Labels:
-
Apache NiFi
01-15-2018
03:30 AM
@Bala , sorry for vary late response .... Actually my purpose
is read some data file(server log) , transform those into proper format
and prepare a data warehouse (that in my case , HIVE) for analysis on
latter. So , in my project I have 3 different activities mainly 1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job) 2)
prepare a data-ware house with those daily data (for which , I am
inserting those Spark DF into HIVE table --- frequency : daily job) 3)
Show the result (for this I am using again spark SQL along with HIVE as
that is faster than using only HIVE query , and will use Zeppelin or tableau for data visualization --frequency : weekly job or as on required ) Though
as my reading and understanding , I guess SpakSQL alone + cache will be
much faster the spark+hive , but I think I do ont have any other option as I
have to do analysis on repository data. Do you suggest any other approach for this use case?
... View more
10-25-2017
10:03 AM
@kgautam Actually my requirement is some thing like that... 1) Read the data from file 2) do some filter operation on those data 3) store them back in HIVE for other application 4) View those data in Zapplion from HIVE
... View more
10-25-2017
09:32 AM
Hi , I am trying to read out a tomcat log file (size is around 5 gig ) and store those data in HIVE in spark. After reading out log file my dataframe size around 100K. But when I am trying to insert them in Hive I am getting "java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ... spark.sql("insert into table com.pointsData select * from temptable") where "temptable" in my dataframe in spark. Any one can help me out with any work around ? Anything like , I can split the DF and run insert into in small chuck? Please note I am using maximum of my driver system memory , I can not increase it any more and I am using Kyro. Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
09-25-2017
08:59 AM
Thanks a lot .... it worked exactly as I wanted .... thanks again ... one more thing any link or resource where I can get this kind of information or setup details .....
... View more
09-23-2017
05:17 PM
Thanks a lot for your help , you saved my day ... thanks again .....
... View more
09-23-2017
05:15 PM
Hi All, My rolling log file pattern is something like this /my/path/directory/my-app-2017-09-06.log
/my/path/directory/my-app-2017-09-07.log
/my/path/directory/my-app-2017-09-08.log Can you one suggest what can set for property in NIFI for a tailFile processor to read out those. Please note I have old file also and some different file also , but I want to read file wit this specific file name and today onward only , not the old file. I read the doc available for NiFi in website , but not clear me ... Can any one please help me out to configure tailFile with this file pattern. Any help will be highly appreciated actually I stuck on this issue for last 5 days ....
... View more
Labels:
- Labels:
-
Apache NiFi
09-22-2017
07:23 AM
Hi , In my project I am using Nifi to read log file from tomcat and process those data in a spark application , after that insert those process data in DB. But my problem is that , at app server level , I have 4 tomcat cluster(4 different log file) in 2 different box and I have mark out which data is from which cluster at spark level. In my present set up I have 2 tailFile processor in Nifi which pointing to single outport , in per box but not able to identify which data is from which cluster at spark level. Is there any option in tailFile processor to add some suffix or prefix or file name(or any attribute) in each record ? so that I can identify each record is coming from which cluster and persist in db in that way? Any help will be highly appreciated .... Thanks in advance
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Spark
08-24-2017
08:17 AM
Hi All, I have a sample table(stuends1) in HIVE which I want to connect from
Spark using JDBC (as Hive is not in same cluster). I was just
trying with following code... def main(args: Array[String]): Unit = {
//Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:hive2://34.223.237.55:10000")
.option("dbtable", "students1")
.option("user", "hduser")
.option("password", "hadoop")
//.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.load()
println("able to connect------------------")
jdbcDF.show
jdbcDF.printSchema()
jdbcDF.createOrReplaceTempView("std")
val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
} I tried in multiple ways but all the time its showing table structure with column name only. Like ... +--------------+-------------+-------------+
|students1.name|students1.age|students1.gpa|
+--------------+-------------+-------------+
+--------------+-------------+-------------+ But not data, but able to get data when trying to with dbeaver from
my local with SQL query. From spark, jdbcDF.printSchema() also showing
proper schema , so I guess no issue with connection. I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this .... libraryDependencies ++= Seq(
"log4j" % "log4j" % "1.2.17",
"org.apache.spark" % "spark-core_2.11" % "2.1.1" ,
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.2",
"org.apache.spark" % "spark-hivecontext-compatibility_2.10" % "2.0.0-preview",
"org.apache.spark" % "spark-sql_2.11" % "2.1.1" ,
"org.apache.spark" % "spark-hive_2.10" % "2.1.1",
"org.apache.hive" % "hive-jdbc" % "1.2.1"
} can any one suggest why I am not getting any output of show(). Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Spark