Member since
06-28-2016
34
Posts
1
Kudos Received
0
Solutions
02-02-2018
06:41 PM
Hi All , In my project , where I am trying to read log files and process it in spark , I am using NiFi to read the file from tomcat log folder location and copy it to my Edge node in my hadoop cluster. But the problem is that my Application (for which I am processing log files) is in cluster environment and in all 4 tomcat cluster log file names are same. So what I want to do , getFTP will get the log file from app server location , then data will flow into a updateAttribite processor , which will append the server and cluster identification(just something like server1Cluste1 or server2Cluster1) with the file name and then putFile will store the log file in local file system with new name. Which I will process in spark job. Can any one help me out for configuration of updateAttribute in my case? is there anything in updateAttribute by which I can identify from which server I am getting this file and depending on that can I change the file name to putFile? Any help will be highly appreciated Thanks in advance
... View more
Labels:
- Labels:
-
Apache NiFi
01-15-2018
03:30 AM
@Bala , sorry for vary late response .... Actually my purpose
is read some data file(server log) , transform those into proper format
and prepare a data warehouse (that in my case , HIVE) for analysis on
latter. So , in my project I have 3 different activities mainly 1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job) 2)
prepare a data-ware house with those daily data (for which , I am
inserting those Spark DF into HIVE table --- frequency : daily job) 3)
Show the result (for this I am using again spark SQL along with HIVE as
that is faster than using only HIVE query , and will use Zeppelin or tableau for data visualization --frequency : weekly job or as on required ) Though
as my reading and understanding , I guess SpakSQL alone + cache will be
much faster the spark+hive , but I think I do ont have any other option as I
have to do analysis on repository data. Do you suggest any other approach for this use case?
... View more
10-25-2017
10:03 AM
@kgautam Actually my requirement is some thing like that... 1) Read the data from file 2) do some filter operation on those data 3) store them back in HIVE for other application 4) View those data in Zapplion from HIVE
... View more
10-25-2017
09:32 AM
Hi , I am trying to read out a tomcat log file (size is around 5 gig ) and store those data in HIVE in spark. After reading out log file my dataframe size around 100K. But when I am trying to insert them in Hive I am getting "java.lang.OutOfMemoryError: Java heap space" error in driver. Code is something like this ... spark.sql("insert into table com.pointsData select * from temptable") where "temptable" in my dataframe in spark. Any one can help me out with any work around ? Anything like , I can split the DF and run insert into in small chuck? Please note I am using maximum of my driver system memory , I can not increase it any more and I am using Kyro. Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
10-21-2017
05:04 PM
creating HiveContext is not recommended any more I guess as per the Spark new version. In fact in spark 2 it has been depricated.
... View more
10-21-2017
11:36 AM
Hi All , I am trying to access HIVE from spark application with scala. My code is as follows ... val hiveLocation = "hdfs://master:9000/user/hive/warehouse"
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]").set("spark.sql.warehouse.dir",hiveLocation)
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("select * from test").show()
println("End of SQL session-------------------") But it ends up with error message "Table or view not found", but when I
run "show tables;" under hive console , I can see that table and can run
"Select * from test". All are in "user/hive/warehouse" location. Just
for testing I tried with create table also from spark , just to find out
the table location ... val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("CREATE TABLE IF NOT EXISTS test11(name String)")
println("End of SQL session-------------------") This code also executed properly(with success note) but strange thing is that I can find this table from hive console. Even if I use "select * from TBLS;" in mysql (in my setup I
configured mysql as metastore for hive) , I did not found those table
which are created from spark. Is spark location is different than hive
console ??? (as per my knowledge both must be the same location).
What I have to do if I need to access existing table in hive from spark. Please suggest.... Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
09-25-2017
08:59 AM
Thanks a lot .... it worked exactly as I wanted .... thanks again ... one more thing any link or resource where I can get this kind of information or setup details .....
... View more
09-23-2017
05:17 PM
Thanks a lot for your help , you saved my day ... thanks again .....
... View more
09-23-2017
05:15 PM
Hi All, My rolling log file pattern is something like this /my/path/directory/my-app-2017-09-06.log
/my/path/directory/my-app-2017-09-07.log
/my/path/directory/my-app-2017-09-08.log Can you one suggest what can set for property in NIFI for a tailFile processor to read out those. Please note I have old file also and some different file also , but I want to read file wit this specific file name and today onward only , not the old file. I read the doc available for NiFi in website , but not clear me ... Can any one please help me out to configure tailFile with this file pattern. Any help will be highly appreciated actually I stuck on this issue for last 5 days ....
... View more
Labels:
- Labels:
-
Apache NiFi
09-22-2017
07:23 AM
Hi , In my project I am using Nifi to read log file from tomcat and process those data in a spark application , after that insert those process data in DB. But my problem is that , at app server level , I have 4 tomcat cluster(4 different log file) in 2 different box and I have mark out which data is from which cluster at spark level. In my present set up I have 2 tailFile processor in Nifi which pointing to single outport , in per box but not able to identify which data is from which cluster at spark level. Is there any option in tailFile processor to add some suffix or prefix or file name(or any attribute) in each record ? so that I can identify each record is coming from which cluster and persist in db in that way? Any help will be highly appreciated .... Thanks in advance
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Spark
09-19-2017
04:44 PM
Hi All, Right now I am working in a project where we are trying to read tomcat
access log using flume and process those data in saprk and dump those in
DB in proper format. But problem is that tomcat access log file is a
daily rolling file and file name will chang every day. Some thing
like... localhost_access_log.2017-09-19.txt
localhost_access_log.2017-09-18.txt
localhost_access_log.2017-09-17.txt and my flume conf file for source section is something like # Describe/configure the source
flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /tomcatLog/localhost_access_log.2017-09-17.txt
#flumePullAgent.sources.nc1.selector.type = replicating Which is running tail command on a fixed file name(I used fixed name ,
for testing only). How can I pass the file name as a parameter in flume
conf file. In fact , If some how I able to pass the file name as parameter ,
then also it will not be a actual solution. say , I start flume today
with some file name (example : "localhost_access_log.2017-09-19.txt"),
tomorrow when I will change the file name
(localhost_access_log.2017-09-19.txt to
localhost_access_log.2017-09-20.txt) some one has to stop the flume and
restart with new file name. In that case it will not be a continues
process, I have to stop / start the flume using cron job or something
like this. Another problem is that I will loss some data(The server we
are working now is high throughput server , 700-800 TPS almost ) every
day during the processing time.(I mean time it will take to generate the
new file name+time to stop flume+time to start flule) Any one , have idea how to run flume with roll over file name in
production environment? Any help will be highly appreciated...
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Spark
09-15-2017
02:22 AM
currently I am working in use case which include Flume + Spark +
MySQL. My job is to read tomcat access log file using flume , process
the data in spark streaming and insert those data ,in proper format, in
MySql table. All are working properly but some how I found that my
access log and MySql table data both are not in proper sync. My Access Log File Data 174.37.196.221 - - [15/Sep/2017:00:06:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200 987
174.37.196.222 - - [15/Sep/2017:00:10:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200 987
174.37.196.223 - - [15/Sep/2017:00:11:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200 987
174.37.196.224 - - [15/Sep/2017:00:12:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200 987 Where as my MySql Table is as follows... # id_pk ip requestdateTime request status
===========================================================================================================================================
1 150 174.37.196.221 [15/Sep/2017:00:06:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
2 151 174.37.196.221 [15/Sep/2017:00:06:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
3 148 174.37.196.222 [15/Sep/2017:00:10:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
4 152 174.37.196.222 [15/Sep/2017:00:10:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
5 149 174.37.196.223 [15/Sep/2017:00:11:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
6 153 174.37.196.223 [15/Sep/2017:00:11:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200
7 154 174.37.196.224 [15/Sep/2017:00:12:00 +0000] "GET /cs/v1/points/bal?memberId=2164699082&accountType=10 HTTP/1.1" 200 Please note the difference between log file and Table data. In table
data, ip address 174.37.196.221,174.37.196.222,174.37.196.223 I found
double entry where as in log file all are single entry. My flume conf file is as follows ... flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /home/hduser/test.txt
flumePullAgent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
flumePullAgent.sinks.spark.hostname = 192.168.56.102
flumePullAgent.sinks.spark.port = 33333
flumePullAgent.sinks.spark.channel = m1
flumePullAgent.sinks.log.type=logger
flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /home/hduser/test.txt
# Describe the sink
flumePullAgent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
flumePullAgent.sinks.spark.hostname = 192.168.56.102
flumePullAgent.sinks.spark.port = 33333
flumePullAgent.sinks.spark.channel = m1
flumePullAgent.sinks.log.type=logger
flumePullAgent.channels.m1.type = memory
flumePullAgent.channels.m1.capacity = 1000
flumePullAgent.channels.m1.transactionCapacity = 100# Use a channel which buffers events in memory
flumePullAgent.channels.m1.type = memory
flumePullAgent.channels.m1.capacity = 1000
flumePullAgent.channels.m1.transactionCapacity = 100 And my spark code is something like this val stream = FlumeUtils.createPollingStream(ssc, host, port)
stream.count().map(cnt => "Received " + cnt + " flume events ---->>>>>>>>." ).print()
val tweets = stream.foreachRDD
{
rdd => rdd.foreachPartition {
it =>
val conn = DriverManager.getConnection(MYSQL_CONNECTION_URL,MYSQL_USERNAME,MYSQL_PWD)
val del = conn.prepareStatement("INSERT INTO requestlog (ip, requestdatetime, request, status) VALUES(?,?,?,?)")
for (tuple <- it) {
val strVal = new String(tuple.event.getBody.array())
//val matcher:Matcher = patternU.matcher(strVal.getBody.array())
println("Printing for for each RDD :"+strVal)
val matcher:Matcher = patternU.matcher(strVal)
if(matcher.matches())
{
println("Match Found ")
val logdataObj = new logData(matcher.group(1),matcher.group(3),matcher.group(4),matcher.group(5))
del.setString(1,logdataObj.ip)
del.setString(2,logdataObj.requestdateTime)
del.setString(3,logdataObj.request)
del.setString(4,logdataObj.status)
del.executeUpdate
}
else
{
println("No Match Found ")
}
}
conn.close()
}
} Can any one help me out to find out the where I did the mistake? Why
my log data and table data are not in sync. Is this due to "tail"
command? My exception was that in table also I will have same entry with
same frequency as I have in log file. In fact in that case we will be
able to do proper analysis for access data of our API server. Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Spark
09-04-2017
09:56 AM
Hi All , I have spark application to process log file. Data flow is something like this.. Log File --> Flume --> kafka --> Spark --(After Process) --> hive Everything is working properly , expect the the flume(I guess so..). I
am not getting continuous update from my log file in kafka consumer. to start the kakfa consumer I use the command ... kafka-console-consumer.sh --zookeeper hadoopmaster:2181 --bootstrap-server hadoopmaster:9092 --topic memoryChannel --from-beginning But the strange thing is that when I start the flume agent 1st time I
am getting all the messages in Kakfa consumer , that I have in log file , but if after that I
open the log file and some more line and save it , those updated lines
are not getting propagated in kafka consumer as well as spark. I start the flume agent with ... /home/hduser/flume/flume/bin/flume-ng agent -c /home/hduser/flume/flume/conf -f /home/hduser/flume/flume_kafka_sync.conf -n agent No error in flume log file. My Flume conf file is something like this ... agent.sources = source1
agent.channels = channel1
agent.sinks = sink1
agent.sources.source1.type = exec
agent.sources.source1.command = tail -f /home/hduser/test.txt
agent.sources.source1.channels = channel1
agent.sources.source1.interceptors = itime
agent.sources.source1.interceptors.itime.type = timestamp
agent.channels.channel1.type = memory
agent.channels.channel1.capacity = 10000
agent.channels.channel1.transactionCapacity = 1000
agent.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.sink1.channel = channel1
agent.sinks.sink1.kafka.bootstrap.servers=hadoopmaster:9092,hadoopslave1:9092
agent.sinks.sink1.kafka.topic = memoryChannel
agent.sinks.sink1.batchsize = 200
agent.sinks.sink1.producer.type = async
agent.sinks.sink1.serializer.class = kafka.serializer.StringEncoder Anything I am missing in the configuration to get the continuous update ?
My Testing process is that I just open the test.txt and add some line
and save it. Any help will be appreciated ...
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
08-24-2017
08:17 AM
Hi All, I have a sample table(stuends1) in HIVE which I want to connect from
Spark using JDBC (as Hive is not in same cluster). I was just
trying with following code... def main(args: Array[String]): Unit = {
//Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:hive2://34.223.237.55:10000")
.option("dbtable", "students1")
.option("user", "hduser")
.option("password", "hadoop")
//.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.load()
println("able to connect------------------")
jdbcDF.show
jdbcDF.printSchema()
jdbcDF.createOrReplaceTempView("std")
val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
} I tried in multiple ways but all the time its showing table structure with column name only. Like ... +--------------+-------------+-------------+
|students1.name|students1.age|students1.gpa|
+--------------+-------------+-------------+
+--------------+-------------+-------------+ But not data, but able to get data when trying to with dbeaver from
my local with SQL query. From spark, jdbcDF.printSchema() also showing
proper schema , so I guess no issue with connection. I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this .... libraryDependencies ++= Seq(
"log4j" % "log4j" % "1.2.17",
"org.apache.spark" % "spark-core_2.11" % "2.1.1" ,
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.2",
"org.apache.spark" % "spark-hivecontext-compatibility_2.10" % "2.0.0-preview",
"org.apache.spark" % "spark-sql_2.11" % "2.1.1" ,
"org.apache.spark" % "spark-hive_2.10" % "2.1.1",
"org.apache.hive" % "hive-jdbc" % "1.2.1"
} can any one suggest why I am not getting any output of show(). Thanks in advance...
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Spark
07-24-2017
11:19 AM
Hi All , I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter. My initial plan is that get those data from App server log -- push to Spark Streaming using kafka and process the data -- store those data to HIVE -- use zeppelin to get back those processed and centralized log data and generate reports as per client requirement. But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server. Another option, I think we have in this case is Apache Flume. Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases? I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume. Any help will be highly appreciated...... 🙂 Thanks a lot ....
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
-
Apache Spark
06-14-2017
10:51 AM
Hi all, I have table in HIVE with insertDTM and id. My job is to find out those
member whose last insertDTM is 30 days older than the present date. I am
using Datediff UFD for that. My query is something like .. select * from finaldata where datediff(current_date,select max(insertdtm) from finaldata group by memberid)>30; But it is giving some error. Its looks like datediff does not support any SQL as param. One more thing , can any one explain why this is not working ? when I
am calling datediff with SQL , how HIVE handle this ? My assumption was
that first it will execute the SQL in between (in my case which is
"select max(insertdtm) from finaldata group by memberid") then call the
Datediff with current_date and the output of this SQL. But it looks like
it is not working that way. Thanks a lot ...
... View more
Labels:
- Labels:
-
Apache Hive
06-09-2017
11:49 AM
Hi All, I have a hadoop cluster with 10 datanodes. All are working properly. But if I need to reboot any of my data datanode , after reboot that system not entering into the hadoop cluster(master node not able to identify this new datanode) . Is there any way to include that datanode(previously configured or a new one) into hadoop cluster with out restart the master node. Thanks a lot...
... View more
Labels:
- Labels:
-
Apache Hadoop
08-11-2016
07:42 PM
yapee......Artem , Thanks a lot 🙂 ...... You really saved my day.... Thanks again .....
... View more
08-11-2016
01:47 PM
Hi , Thanks for you reply.... ohhh.... is it , I did not know that. Is there any other unit testing tool do you know for M/R job ? I gone through the URL also , its helpful but I also experienced the same issue as reported in stackoverlow "Null Pointer" exception when mapper trying to get the path/URI from configuration.
... View more
08-11-2016
11:19 AM
1 Kudo
Hi All , I have a simple mapper which read some data from a log file and do some join operation with a another file data and send that combined output to reducer for further processing. In mapper I have used DistributedCache as the file is small one. Its working properly. Now I have to write some MRUnit test cases for that mapper. Can any one help me out with some code example how to write MRUnit with DistributedCache support. I am using Hadoop2 and MRUnit version is as follows .... <dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<classifier>hadoop2</classifier>
</dependency> In Driver class I have added for DistributedCache (this is just to explain how I added cache in MR)Job job = Job.getInstance(conf);job.setJarByClass(ReportDriver.class); job.setJobName("Report");
job.addCacheFile(new Path("zone.txt").toUri());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ReportMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setReducerClass(ReportReducer.class);
job.setNumReduceTasks(3);
//job.setCombinerClass(ReportReducer.class);
logger.info("map job started ---------------");
System.exit(job.waitForCompletion(true) ? 0 : 1); In Mapper class I am fetching the cases file like this .... @Override
protected void setup(Context context) throws IOException, InterruptedException
{
URI[] localPaths = context.getCacheFiles();
}
Please help me out if any one use DistributedCache with MRUnit with some code example... Thanks a lot ....
... View more
Labels:
- Labels:
-
Apache Hadoop
07-11-2016
08:25 PM
Hi All, I have a hadoop cluster setup with one master and two slave. I want to installed hbase in the cluster. But when running hbase I hbase I am getting one error in zookeeper log file <code>> 2016-07-11 22:49:18,199 WARN [QuorumPeer[myid=0]/0.0.0.0:2181] quorum.QuorumCnxManager: Cannot open channel to 2 at election address /10.0.1.105:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2016-07-11 22:49:18,201 WARN [QuorumPeer[myid=0]/0.0.0.0:2181] quorum.QuorumCnxManager: Cannot open channel to 1 at election address /10.0.1.103:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2016-07-11 22:49:18,201 INFO [QuorumPeer[myid=0]/0.0.0.0:2181] quorum.FastLeaderElection: Notification time out: 25600
hduser@hadoopmaster:/home/$
But no error in salve. My HBase-site.xml configuration is as follows ... In Master <code><configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase:rootdir</name>
<value>hdfs://hadoopmaster:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>10.0.1.104,10.0.1.103,10.0.1.105</value>
</property>
<property>
<name>hbase.master</name>
<value>hadoopmaster:60000</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://hadoopmaster:9000/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.leaderport</name>
<value>3888</value>
</property>
In Slave <code><configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase:rootdir</name>
<value>hdfs://hadoopmaster:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.leaderport</name>
<value>3888</value>
</property>
</configuration>
IP Address Details 10.0.1.104 --- hadoopmaster 10.0.1.103 --- hadoopslave2 10.0.1.105 --- hadoopslave1 Please note, I do not have firewall enabled in setup. I am not using separate zookeepr installation rather I am using embed zookeeper installation that comes with HBASE. Any one face this issue ? Please suggest the way to resolve this issue. Any help will be highly appropriated, Thanks in advance....
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
07-11-2016
11:13 AM
Hi Mukesh , Thanks for you help... I have configured every thing as per the document now two issues I am facing .... 1) <masterNodeIP>:60100 is not working 2) When I am trying to run List in hbase shell its not working , getting error .... servererror.txt Out put of JPS in Master and Slave Master JPS hduser@hadoopmaster:/var/hadoop$ jps 3734 ResourceManager 5266 HRegionServer 3377 NameNode 5599 Jps 3591 SecondaryNameNode 5043 HQuorumPeer hduser@hadoopmaster:/var/hadoop$ Slave JPS hduser@hadoopslave1:/home/$ jps 3357 HRegionServer 3252 HQuorumPeer 3465 Jps 2711 DataNode 2842 NodeManager hduser@hadoopslave1:/home/$ I guess JPS output is ok..... Log file is also attached hbase-hduser-master-hadoopmaster.txth base-hduser-regionserver-hadoopmaster.txt
hbase-hduser-zookeeper-hadoopmaster.txt from Log file it looks like zookepper not able to connect properly. Please note I do not have separate zookeeper installation. For google( http://stackoverflow.com/questions/30940981/zookeeper-error-cannot-open-channel-to-x-at-election-address ) I found some fix for this error but as I do not have extra zookeeper installation how can I change the setup for zoo.cfg file. Thanks in advance .... Any help will he highly appreciated ....
... View more
07-08-2016
10:00 AM
Hi Ankit , thanks again , Problem is that I can setup in ambari , as we have already a existing setup and we have to submit a POC for one client requirement. Thanks for your help....
... View more
07-08-2016
09:28 AM
Thanks for your help. No I do not have zookeeper install in cluster. In that case do I need to install and run zookeeper in all nodes separately? Sorry I do not have much knowledge on HBASE.
... View more
07-08-2016
08:39 AM
Hi All, Any one can help me in HBASE installation and configuration in cluster environment(I have 10 node cluster with yarn and working properly)? I spent a lot of time but nothing came out. I tried many salutations which got from google , but unfortunately noting works. Present situation, All services are running
in master and slave but not able to "list" or create table in Hbase. I guess I am doing something wrong in some where. Please provide any URL or document for step by step installation process and setup process for HBASE in cluster. One more point do I need to install/run zookeeper in master and slave node separately for HBASE.? Any help will be highly appropriated..... Thanks a lot .....
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
07-05-2016
02:08 PM
Thanks a lot .... You save my day .... 🙂 I missed the --target-dir.
... View more
07-05-2016
11:09 AM
Hi All I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command sqoop import --connect jdbc:oracle:thin:@dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table. Now I want to update this table with incremental update with
updatedate (last value today's date) so that I can get day to day update
for that OLTP table into my HIVE table using sqoop. For Incremental import I am using following sqoop command sqoop import --hive-import --connect jdbc:oracle:thin:@dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception "Append mode for hive imports is not yet supported. Please remove the parameter --append-mode" and as per documentation in sqoop "--hive-import" does not support with "--incremental append" When I remove the "--hive-import" it run properly and file in getting created in sqoop location in HDFS but I did not found those new update in HIVE table that I have in OLTP table. When I run.... "select * from hive-table " It is showing the old data only no incremental data. Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop. Any help will be appropriated.. Thanks in Advance ...
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Sqoop
07-04-2016
10:35 AM
Hi All, I have a typical requirement from one of our client (a beverage company in US). Requirement is : In website they have option to show different events , organized in
US ,to the customer depending upon the region and member interest. Table Details in OLTP are as follows .... erd.jpg Now Requirement is... When member logging into the site , we have to find out event name
depending on event date , member Interest and region as preferred event
for that member. In general OLTP it takes huge time due to huge load(registered user around 3000000 and Transaction rate around 700 TPS. We have some other Hadoop based implementation for this application
also (like daily sales report in different band , customer update/login
frequency etc.) , so we have a ready setup for hadoop (MR,HIVE,PIG) for
this client. Can any one suggest me , how we can handle this typical requirement using hadoop. Only steps or process is required. I mean the solution designing part or the process only. Any helps/suggestions will be appreciated... Please let me know if any one need to know any thing more on requirement or setup about the existing process. Thanks ....
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Pig