About tajinderdhanjal

tajinderdhanjal · ‎06-11-2016

I have created a kafka producer -- from kafka import KafkaProducer import json,time userdata={ "ipaddress": "172.16.0.57", "logtype": "", "mid": "", "newsession": "4917279149950184029a78e4a-e694-438f-b994-39897e346953", "previousurl": "/", "searchtext": "", "sessionid": "29a78e4a-e694-438f-b994-39897e346953", "source": "desktop", "uid": "Chrome4929a78e4a-e694-438f-b994-39897e346953", "url": "http://172.16.0.57/", "useragent": "Mozilla/5.0%20(Windows%20NT%2010.0", "utmsocial": "null", "utmsource": "null", "createdtime": "2016-05-03 12:27:38", "latency": 13260.0, "serviceurl": "http://localhost:8080/Business-Web/services/product/getBestDealNew", "domainlayeripaddress": "localhost", "name":"TJ" } producer = KafkaProducer(bootstrap_servers=['172.16.10.13:6667','172.16.10.14:6667'],value_serializer=lambda v: json.dumps(v).encode('utf-8')) for i in range(10): print("adding",i) producer.send('event', userdata) #if i < 10: # producer.send('event', '\n') time.sleep(3) And python code to consume json data from kafka . I run this python code like. spark-submit --jars /usr/hdp/2.3.4.7-4/spark/lib/spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar,/usr/hdp/2.3.4.7-4/spark/lib/spark-streaming-kafka-assembly_2.10-1.6.1.jar /home/hadoop/tajinder/clickstream_streaming.py from pyspark.sql import SQLContext from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json sc = SparkContext(appName="Clickstream_kafka") stream = StreamingContext(sc, 2) kafka_stream = KafkaUtils.createStream(stream,"172.16.10.13:2181","raw-event-streaming-consumer",{"event":1}) parsed = kafka_stream.map(lambda (k, v): json.loads(v)) parsed.pprint() stream.start() stream.awaitTermination() I am able to recieve json data in spark from kafka, but how to convert it to RDD or as table(schema RDD) in pyspark so that RDD operations can be applied on it?

tajinderdhanjal · ‎06-10-2016

I am trying to fetch json format data from kafka through spark streaming and want to create a temp table in spark to query json data like normal table. i tried several tutorials available on internet but did'nt get success. I am able to read a text file from hdfs and process it through spark, but stuck using json data from kafka. can somebody guide me on this.

tajinderdhanjal · ‎06-09-2016

The spark job ran fine now. I used spark-submit --jars spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar,spark-streaming-kafka-assembly_2.10-1.6.1.jar <file.py>

tajinderdhanjal · ‎06-09-2016

i am only running like this - spark-submit <file_name.py>

tajinderdhanjal · ‎06-09-2016

getting error while submitting spark job from command line Spark Streaming's Kafka libraries not found in class path. Try one of the following. 1. Include the Kafka library and its dependencies with in the spark-submit command as $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka:1.5.2 ... 2. Download the JAR of the artifact from Maven Central http://search.maven.org/, Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-assembly, Version = 1.5.2. Then, include the jar in the spark-submit command as $ bin/spark-submit --jars <spark-streaming-kafka-assembly.jar> ... the python code i am running is: from pyspark.sql import SQLContext from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json sc = SparkContext(appName="Clickstream_kafka") stream = StreamingContext(sc, 2) kafka_stream = KafkaUtils.createStream(stream,"172.16.10.13:2181","raw-event-streaming-consumer",{"event":1}) parsed = kafka_stream.map(lambda (k, v): json.loads(v)) print(parsed.collect()) stream.start() stream.awaitTermination()

tajinderdhanjal · ‎06-01-2016

I am running a query which runs 52 map jobs simultaneously. Due to this my Resource manager container gets filled up completely and consumed up 100%. The query stucks at that point and giving no result. I want to reduce number of map tasks which runs in parallel.

tajinderdhanjal · ‎05-30-2016

Thanks kuldeep i am able to run hive queries by putting in file now.

tajinderdhanjal · ‎05-28-2016

I am able to run hive query through its shell but not able to run it by putting it into file. It gives me permission denied. I tried to run it through hdfs user but still getting same error.

tajinderdhanjal · ‎05-28-2016

can you tell me the recommended setting for my cluster. I have 3 nodes each dual core. 1 node with 12 GB RAM and other two with 6 GB RAM

tajinderdhanjal · ‎05-28-2016

I have ran into an issue. I am getting hive prompt and also running basic hive queries which did'nt execute MR job at backend. but when i ran query which execute MR job at backend it hang up with no further progress(no mapper/reducer progress). I have checked REsource manager queue, it looks ok as the container is allocated to the query only. Also i have checked my MapReduce2 is up and running. can anybody suggest what needs to be done in this case?

Online	Offline
Last Visited	‎02-16-2020 05:33 AM

Member Since	‎03-27-2016 08:48 AM
Last Visited	‎02-16-2020 05:33 AM
Posts	47
Kudos received	1

Cloudera Community

Re: can someone point me to a good tutorial on spa...

can someone point me to a good tutorial on spark s...

Re: getting error while submitting spark job

Re: getting error while submitting spark job

getting error while submitting spark job

how to set number of map and reduce tasks

Re: not able to run hive query by putting it into ...

not able to run hive query by putting it into .sql...

Re: map/reduce stuck at 0%

map/reduce stuck at 0%