Member since
09-11-2018
39
Posts
1
Kudos Received
0
Solutions
01-06-2019
01:05 PM
Friends, any update for these 2 questions. Sadly after so many days still no reply. Regards
... View more
01-06-2019
05:07 AM
Friends, any update for these 2 questions, sadly after many days still no reply. Regards
... View more
12-21-2018
07:57 PM
Hello Friends, We have a upcoming project and for that I am learning Spark Streaming (with focus on pyspark). So far I have completed few simple case studies from online. But I am stuck with 2 scenarios and they are described below. I shall be highly obliged if you guys kindly share your thoughts or guide me to any web page for help on solution. 1. Writing Streaming Aggregation to File # spark-submit --master local[*] /home/training/santanu/ssfs_2.py
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
spark = SparkSession.builder.appName("FileStream_Sink_Group").getOrCreate()
source = "file:///home/training/santanu/logs"
tgt = "file:///home/training/santanu/hdfs"
chk = "file:///home/training/santanu/checkpoint"
schema1 = StructType([StructField('agent',StringType()),StructField('code',IntegerType())])
df1 = spark.readStream.csv(source,schema=schema1,sep=",")
df2 = df1.filter("code > 300").select("agent").groupBy("agent").count()
df3 = df2.select("agent","count").withColumnRenamed("count","group_count")
query = df3.writeStream.format("csv").option("path",tgt).option("checkpointLocation",chk).start() # Error
query.awaitTermination()
spark.stop() Error message : # AnalysisException: Append output mode not supported when there are streaming aggregations on DataFrames without watermark; 2. Reading from Kafka (Consumer) using Streaming # spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 /home/training/santanu/sskr.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Kafka_Consumer").getOrCreate()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","MyTopic_1").load()
print(type(df)) # <class 'pyspark.sql.dataframe.DataFrame'>
df.printSchema() # printing schema hierarchy
query = df.selectExpr("CAST(value AS STRING)").writeStream.format("console").start() # Error
query.awaitTermination()
spark.stop() Error message : # NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.subscribe(Ljava/util/Collection;) Please Help. Thanking you Santanu Ghosh
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache Spark
12-21-2018
07:28 AM
Hello Friends, We have a upcoming project and for that I am learning Spark Streaming (with focus on Structured Streaming). So far I have completed few simple case studies from online. But I am stuck with 2 scenarios and they are described below. I shall be highly obliged if you guys kindly share your thoughts or guide me to any web page for help on solution. 1. Writing Streaming Aggregation to File # spark-submit --master local[*] /home/training/santanu/ssfs_2.py
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
spark = SparkSession.builder.appName("FileStream_Sink_Group").getOrCreate()
source = "file:///home/training/santanu/logs"
tgt = "file:///home/training/santanu/hdfs"
chk = "file:///home/training/santanu/checkpoint"
schema1 = StructType([StructField('agent',StringType()),StructField('code',IntegerType())])
df1 = spark.readStream.csv(source,schema=schema1,sep=",")
df2 = df1.filter("code > 300").select("agent").groupBy("agent").count()
df3 = df2.select("agent","count").withColumnRenamed("count","group_count")
query = df3.writeStream.format("csv").option("path",tgt).option("checkpointLocation",chk).start() # Error
query.awaitTermination()
spark.stop() Error I am getting : # Append output mode not supported when there are streaming aggregations on DataFrames without watermark; 2. Reading from Kafka (Consumer) using Streaming # spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 /home/training/santanu/sskr.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Kafka_Consumer").getOrCreate()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","MyTopic_1").load()
print(type(df)) # <class 'pyspark.sql.dataframe.DataFrame'>
df.printSchema() # printing schema hierarchy
query = df.selectExpr("CAST(value AS STRING)").writeStream.format("console").start() # Error
query.awaitTermination()
spark.stop() Error I am getting : # NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.subscribe(Ljava/util/Collection;) Please Help. Thanking you Santanu Ghosh
... View more
Labels:
- Labels:
-
Apache Spark
10-04-2018
04:48 PM
Hello Friends, I have just read the news that Cloudera and Hortonworks have merged as one company. I was both shocked and surprised. But now I am eager to know what will be the future of Hadoop, which is still open source and free software under apache ? Also, the Hortonworks Certifications such as HDPCD and others, will they still hold any value at industry level ? Please do let me know your thoughts and ideas. Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Hadoop
09-30-2018
10:01 AM
@Rahul Soni , I am creating one avro file using flume regex interceptor and multiplexing. But that file contains value something like below and when I am trying to generate schema using avro-tools getschema option it is giving only "headers" and "body" as 2 fields. Please advise how to resolve this. Objavro.codenullavro.schema▒{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}▒LA▒▒;ڍ▒(▒▒▒=▒YBigDatJava▒Y{"created_at":"Thu Sep 27 11:40:44 +0000 2018","id":1045277052822269952,"id_str":"1045277052822269952","text":"RT @SebasthSeppel: #Jugh \ud83d\udce3 heute ist wieder JUGH !\nHeute haben wir @gschmutz bei uns mit dem spannenden Thema: Streaming Data Ingestion in\u2026","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":716199874157547520,"id_str":"716199874157547520","name":"Alexander Gomes","screen_name":"nEinsKnull","location":"Kassel, Hessen","url":null,"description":"CTO of family, work at @Micromata, loves #tec #rc #mountainbikes and #paragliding","trans......... Thanking you Santanu
... View more
09-28-2018
04:21 AM
Hi Friends, I need one help regarding Avro file processing using Flume and Kafka. In short, I am reading a json file, using interceptor and selector splitting specific values into avro sink, and then reading that avro source to write to hdfs as avro file. The flume configurations I am using are given below. The problem I am facing is, the avro file is getting written to hdfs with only header and body. So by using "java -jar avro-tools-1.8.2.jar getschema" option I am not getting desired schema of the avro file. Every time it is showing only "header" and "body" as 2 fields. Please suggest how to resolve this problem. It is very urgent. 1. flume-ng agent --name ta1 --conf conf --conf-file /home/cloudera/santanu/flume_interceptor_multiplexing.conf -Dflume.root.logger=DEBUG,console
## Flume Agent with Json Source , Kafka and Memory Channels , Avro Sink
ta1.sources = twitter
ta1.sinks = avrofile
ta1.channels = memchannel kafkachannel
## Properties
## Sources
ta1.sources.twitter.type = exec
ta1.sources.twitter.command = tail -F /home/cloudera/workspace/Cloudera_Share/Bigdata.json
## Sinks
ta1.sinks.avrofile.type = avro
ta1.sinks.avrofile.hostname = 192.XXX.x.x
ta1.sinks.avrofile.port = 4141
## Channel 1
ta1.channels.memchannel.type = memory
ta1.channels.memchannel.capacity = 5000
ta1.channels.memchannel.transactionCapacity = 500
## Channel 2
ta1.channels.kafkachannel.type = org.apache.flume.channel.kafka.KafkaChannel
ta1.channels.kafkachannel.kafka.bootstrap.servers = 192.XXX.x.x:9092
ta1.channels.kafkachannel.kafka.topic = MyTopic_1
ta1.channels.kafkachannel.kafka.consumer.group.id = my_group
## Interceptor
ta1.sources.twitter.interceptors = i1
ta1.sources.twitter.interceptors.i1.type = regex_extractor
ta1.sources.twitter.interceptors.i1.regex = (?i)(Python|Java|Scala|Perl|Sqoop|Flume|Kafka|Hive|Spark|Jethro|NoSQL)
ta1.sources.twitter.interceptors.i1.serializers = s1
ta1.sources.twitter.interceptors.i1.serializers.s1.name = BigData
## Source Selector
ta1.sources.twitter.selector.type = multiplexing
ta1.sources.twitter.selector.header = BigData
ta1.sources.twitter.selector.mapping.Python = kafkachannel
ta1.sources.twitter.selector.mapping.Java = kafkachannel
ta1.sources.twitter.selector.mapping.Scala = kafkachannel
ta1.sources.twitter.selector.mapping.Perl = kafkachannel
ta1.sources.twitter.selector.mapping.Sqoop = memchannel
ta1.sources.twitter.selector.mapping.Flume = memchannel
ta1.sources.twitter.selector.mapping.Kafka = memchannel
ta1.sources.twitter.selector.mapping.Hive = memchannel
ta1.sources.twitter.selector.mapping.Spark = memchannel
ta1.sources.twitter.selector.mapping.Jethro = memchannel
ta1.sources.twitter.selector.mapping.NoSQL = memchannel
## Mapping
ta1.sources.twitter.channels = kafkachannel memchannel
ta1.sinks.avrofile.channel = memchannel 2. flume-ng agent --name ta2 --conf conf --conf-file /home/cloudera/santanu/flume_avro_hdfs.conf -Dflume.root.logger=DEBUG,console
## Flume Agent with Avro Source and HDFS Sink
ta2.sources = avrofile
ta2.sinks = hdfsfile
ta2.channels = memchannel
## Properties
## Source
ta2.sources.avrofile.type = avro
ta2.sources.avrofile.bind = 192.XXX.x.x
ta2.sources.avrofile.port = 4141
## Sink
ta2.sinks.hdfsfile.type = hdfs
ta2.sinks.hdfsfile.hdfs.path = /user/cloudera/flume_avro
ta2.sinks.hdfsfile.hdfs.filePrefix = Hadoop
ta2.sinks.hdfsfile.hdfs.fileSuffix = .avro
ta2.sinks.hdfsfile.hdfs.fileType = DataStream
ta2.sinks.hdfsfile.hdfs.writeFormat = Text
ta2.sinks.hdfsfile.hdfs.rollInterval = 5
ta2.sinks.hdfsfile.serializer = avro_event
ta2.sinks.hdfsfile.compressionCodec = snappy
## Channel
ta2.channels.memchannel.type = memory
ta2.channels.memchannel.capacity = 5000
ta2.channels.memchannel.transactionCapacity = 500
## Interceptor
ta1.sources.avrofile.interceptors = i2
ta1.sources.avrofile.interceptors.i2.type = remove_header
ta1.sources.avrofile.interceptors.i2.withName = BigData
## Mapping
ta2.sources.avrofile.channels = memchannel
ta2.sinks.hdfsfile.channel = memchannel 3. Avro File in HDFS with Header and Body as 2 columns Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
-
HDFS
09-20-2018
01:43 PM
Hello Friends, I have recently started working on hadoop and that too from 2.x onwards.
Also, I do understand the concept of High-Availability of NN in Hadoop 2.x . But recently someone told me that even on Hadoop 1.x NN HA was possible, because Hadoop 1.x had ZK and QJN . If so then why most of the articles on web always say NN was SPoF on Hadoop 1.x ? Please do let me know the answer. Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Zookeeper
07-02-2018
07:42 PM
Thanks Vini. Based on your suggestion I am able to connect. Thanking you Santanu
... View more
07-01-2018
07:36 PM
Hello Friends, Please help me on this problem. I am trying to install Jethro data on HDP sandbox and connect HDFS and Hive from that.
Installation and setup of Jethro 3.x on Hortonworks (VirtualBox) HDP2.4 was successful. Access to HDFS from Jethro client by creating table and inserting data into that table is also working fine. But, I am not able connect Hive from Jethro client to create external table by accessing Hive JDBC.
It is giving error: "jdbc driver load has failed or not found, setup JD_HIVE_JDBC_CLASSPATH env in /opt/jethro/current/conf/jd-hadoop-env.sh" .
I have added relevant hive jdbc jars like "/usr/hdp/current/hive-client/lib/hive-jdbc*.jar" to JD_HIVE_JDBC_CLASSPATH env. Still it is not working.
Please suggest what else I need to connect Hive from Jethro client. Thanking you Santanu Ghosh
... View more
Labels:
- Labels:
-
Apache Hive
04-23-2018
03:51 PM
Hi @tbroas, I have replied to your email. Please look into the matter. Thanking you Santanu
... View more
04-21-2018
05:23 PM
Thanks @Geoffrey Shelton Okot , I am also hoping for that. Lets see what does HW reply ? Thanking you Santanu
... View more
04-21-2018
02:45 PM
Also, is there anyone else from Hortonworks Exam support in this community who can help me out. Thanking you Santanu
... View more
04-21-2018
02:12 PM
Hi @tbroas , @asrivastava , I am reaching out to you guys in need of help. I was giving HDPCD exam one more time. My confirmation code was 34E-D22 and candidate id 8732865098. Within 15 mins of exam, the exam window stopped responding. I refreshed the window, but it did not start. Also, that time there was no online agent from PSI exam support whom I could reach. So I lost that attempt. I tried calling to numbers +1-888-504-9178 and +1-702-904-7342 for additional support. I waited there for more than 30 mins, but no one picked up my phone. Hence I sent email to psi support and hortonworks zendesk support. My Ticket# 13053. Please look into this matter. I do not want to loose my money without completing my exam. Please arrange something, so that I can give that exam again. Thanking you Santanu
... View more
04-19-2018
05:05 PM
Thanks @Shu. So if input file is tab delimited and it says create hive table with default format, then just : "create table mydb.user (uid int,name string) row format delimited fields terminated by '\t' ; " should be sufficient. This will create table with tab delimiter, and it will take file storage from default format which is TextFile. (hive.default.fileformat) Thanking you Santanu
... View more
04-19-2018
08:59 AM
Hello Friends, I have couple of questions related to Hive and I am confused about the correct answer. Q1: What is the correct way to define default Hive table ? Q2: Also, what is the default delimiter for Hive table ? In other words, requirement says create hive table with default format. Now I am checking "set hive.default.fileformat;" on hive cli. It is showing "TextFile". So I am creating the Hive table like : create table mydb.user (uid int,name string) ;
But it is creating table with Row Format "LazySimpleSerDe" and without any delimiter. Is this correct way to define default hive table ? Or shall I define it as : create table mydb.user (uid int,name string) row format delimited fields terminated by '\t' ; Because, in this case it is showing Row Format "DELIMITED" and Fields Terminated By '\t'. Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Hive
04-09-2018
02:03 PM
@tbroas, Thanks for your response. I really appreciate your assistance. Thanking you Santanu
... View more
04-08-2018
02:31 PM
Hi @Geoffrey Shelton Okot , I agree to your points. Whatever may be the outcome of the exam, they should intimate the result within stipulated time. Anyway, I am hoping that soon I will hear from them. Meanwhile, would you please help me on how to reach to HW people other than the certification email id. Who is this William Gonzalez you have mentioned ? and how can I get in touch with him ? Thanking you Santanu
... View more
04-08-2018
07:02 AM
Can anyone from Hortonworks or @Artem Ervits , @tbroas , @Dave Russell , @Rahul Soni please reply to my question. I would really appreciate that. I am not getting any response from certification@hortonworks.com. Thanking you Santanu
... View more
04-07-2018
03:00 PM
Hi Friends, On 31-Mar-2018, I gave HDPCD exam.
At the end I was informed by the proctor that result will be intimated within 5 business days. It's almost 7 working days now, still I am waiting for the result.
I have already written to certification@hortonworks.com and my request number is #12624. But I have not heard from HW support team since.
Please let me know whether they have any separate email id, or any Customer Care Contact number where I can call. Thanking you Santanu
... View more
Labels:
- Labels:
-
Certification
03-24-2018
07:56 AM
Hello Friends, I found something while running the final task (Task 10) of HDPCD Practice Exam.
It says, "Put local files from /home/hortonworks/datasets/... into HDFS /user/hortonworks/..." But, "/home" directory does not have "hortonworks", it has "horton" sub directory.
Also, user horton does not have permission to create any sub directory under "/home". Similarly, on hdfs there is "/user/horton" not "/user/hortonworks" directory. Not sure, whether the questions are incorrect or am I doing anything wrong ? Also, please let me know how to access MySQL on HDPCD Practice exam instance ?
I tried "mysql -u root -p" with "hadoop" as password, but it did not work. Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Hadoop
03-23-2018
12:20 PM
Thanks Geoffrey for your response. Yes, I did that and now the issue is resolved. It seems, only the EC2 instance of N.Virginia region works fine for hdpcd practice exam. Not sure about the reason. But when I changed the region from drop down and created a new instance, I was able to connect from VNC Viewer. I hope that problem will not appear again. If so then I will reach out to you guys. Thanking you Santanu
... View more
03-23-2018
10:30 AM
Hello Friends, Please help me on this. I have set up aws ec2 instance for hdpcd practice exam. Also, I have downloaded vnc viewer. Now, as per document I am able to start ec2 instance. The region which I have selected is Asia Pacific - Mumbai. I am using public dns name with port 5901. Also, I have disabled my windows firewall. But when trying to connect from VNC Viewer, it is showing unable to connect error message. Please suggest how to get it through. Unfortunately, I tried to take help from one of my friend who did this on his own machine. Even, he is not able to help me out. Thanking you Santanu
... View more
03-21-2018
09:36 AM
Thanks Rahul
... View more
03-18-2018
06:48 AM
Hello Friends, Please guide me on this matter. I have 64-bit computer with Windows-7 and 8GB RAM (only 7G usable). So, for HDPCD AWS practice exam which EC2 instance would be good ?
Currently there is no option for m3.2xlarge. So I chose m4.2xlarge, but it is performing very poorly. Also, c4.large or c5.large or t2.medium , will any of these instances be any good for practice ? Please suggest. Thanking you Santanu
... View more
03-18-2018
05:27 AM
Thanks @Aditya Sirna for your response. It's working. I used this command. relation_1 = ORDER relation_0 BY <col_2> DESC PARALLEL 3; Thanking you Santanu
... View more
03-17-2018
03:21 PM
1 Kudo
Hi Friends, I was practicing aws tasks for HDPCD exam. There is one question for which I need help. I am describing it briefly. "From a Pig Script, Store the output as 3 Comma Separated files in HDFS directory" Now, I used below command for that. STORE output INTO '<hdfs directory>' USING PigStorage(',') PARALLEL 3; It was running 3 reducers, but eventually stored only one part-r-00000 file to output hdfs path with all rows. So, what is the simplest way to store 3 output comma separated files from Pig ? [ without using any additional jar file ] Thanking you Santanu
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
03-10-2018
06:20 AM
Thanks @rtrivedi for your help and response. It is now resolved. Also, I made a small modification to my sqoop code. I changed --input-null-string '\\N' instead of --input-null-string 'Unknown' (used previously). I am mentioning the entire sqoop code. sqoop export --connect jdbc:mysql://localhost:3306/my_db --username root --password-file /user/cloudera/textdata/sq_pwd_1.txt \
--export-dir /user/cloudera/hivedata/employee --table employee --staging-table employee_stg --clear-staging-table \
--input-null-string '\\N' --input-null-non-string '\\N' --input-fields-terminated-by "," --input-lines-terminated-by "\n" -m 1 Thanking you Santanu
... View more