Member since
09-25-2015
112
Posts
37
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1104 | 12-21-2016 09:31 AM |
07-23-2019
01:23 AM
Do we have Analytical Window function in SAM by any chance? I tried to use Analytical Window functions in structured streaming but no luck. SO wondering if i solve this using SAM? I looked into it we can do stream-stream join which is what iam looking for. However not sure if we have window RANK/LAG functionality in SAM? Thank you
... View more
05-14-2019
12:29 PM
Can you please try removing the master("local[*]") from the spark code and pass it as a parameter in spark submit -- master yarn --deploy-mode cluster.. It should work..
... View more
04-06-2018
04:20 AM
@bkosaraju thanks for your reply. So you are suggesting for Hive with LLAP or postgres DB(any RDBMS) in cluster? If you suggest Hive with LLAP: We are using HDP-2.5 how can i configure Hive with LLAP? Ya in the batch processing i ll be required to insert and update records. we are using hive1.2 and hope it supports ACID and updates? Do you think it will help to create indexes on these hive tables? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing
... View more
04-05-2018
12:02 PM
Hi All, Currently we are doing batch processing using spark (1.6 )and hive (1.2) and HDP 2.5. While processing batches we need to store information about the batches saying the batchid, start time of batch , end time of batch etc i.e. a control table. Will it work if i store this data in a hive table for spark to read it before every batch process or do i use Hbase for it as its quick to lookup records? plz suggest the best practice.
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Spark
02-21-2018
11:00 AM
@Josh Elser @Harald Berghoff If the jira patch is pushed then i believe we can use this cloudera connector and donot need to rewrite hbase salting and spark partitioner from the below post is it? We can just use the connector api instead isnt it? http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/ Also we have another connector from hortonworks not sure if this supports bulk loading the previous one is from cloudera and rdd version. however the below one supports dataframe version of spark https://github.com/hortonworks-spark/shc It would be good if this connector supports dataframe bulk loading? It supports the standard write not sure abt bulk load though https://issues.apache.org/jira/browse/HBASE-15336
... View more
02-21-2018
09:56 AM
@Pavan Kumar KondaIt depends on lot of constraints like compression, serialization and whether the storage format is splittable etc. I think ORC is just for hive. Avro
Avro is language neutral data serialization Writables has the drawback that they do not provide language portability. Avro formatted data can be described through language independent schema. Hence Avro formatted data can be shared across applications using different languages. Avro stores the schema in header of file so data is self-describing. Avro formatted files are splittable and compressible and hence it’s a good candidate for data storage in Hadoop ecosystem. Schema Evolution – Schema used to read a Avro file need not be same as schema which was used to write the files. This makes it possible to add new fields. Parquet Parquet
is a columnar format.
Columnar formats works well where only few columns are required in query/
analysis.
Only
required columns would be fetched / read, it reduces the disk I/O.
Parquet
is well suited for data warehouse kind of solutions where aggregations are
required on certain column over a huge set of data.
Parquet
provides very good compression upto 75% when used with compression formats
like snappy.
Parquet
can be read and write using Avro API and Avro Schema.
It
also provides predicate pushdown,
thus reducing further disk I/O cost.
... View more
02-21-2018
09:44 AM
@Harald Berghoff & @Josh Elser thanks for your reply. OK will try to use HFiles from according to the link which i knew exists. However in that link it says about Spark-hbase connector to support bulk loading in future. Now the jira says its resolved then does it mean now we can use the spark-hbase connector? https://issues.apache.org/jira/browse/HBASE-14150
... View more
02-19-2018
02:22 PM
Iam looking to read really large petabytes scale hive tables into spark, which is a better option in terms of performance to read directly from hive table using select stmt or giving the hdfs path for the ORC files of hive tables to read? Looking for best practice and better performance as well. Also looking to save petabytes scale data to hbase tables. HW has a spark-hbase connector not sure how its to do with petabytes volume of data and performance as well. Shall i consider using HFILES or spark-hbase connector? Any suggestions will be really appreciated... thank you
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Spark
02-15-2018
05:01 PM
On a highlevel i know the difference between kafka streams and structured streaming. However which is currently better to use in production?
... View more
Labels:
- Labels:
-
Apache Kafka
02-23-2017
05:14 AM
@mqureshi yes i tried making it transient as well no luck and inside .map map(path => sc.textFile(path)) its not reading the contents of the file i thnk its just returning the string which is strange if i do sc.textfile outside map function it returns the data. Above is the whole code nothing else to it. Looks simple.. probably rather than calling sc.textfile inside map may be i need to fetch data using s3 api as shown in this link. http://michaelryanbell.com/processing-whole-files-spark-s3.html but not sure i might still get task serailization error???? any ideas to better implement this requirement
... View more
02-22-2017
06:24 PM
Iam trying to read the gzip files in a dir parallely. I followed the steps advised by Matei in the following link http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td15644.html but i get the below exception. looks like other people as well got the same exception. Just wanted to know if its possible to acheive this in spark 2.1.0. iam running on local VM at the moment using a simple two line code. splitFilesPathList = List("s3n://pathtos3file1","file2,file3)etc... val lineRDD = sc.parallelize(splitFilesPathList, 4).map(path => sc.textFile(path)).take(10).toList.foreach(println) The above 2 lines code doesnt work. Any help is really appreciated please. I have checked for closures and moved the code into a new scala class which extends serializable but i still get the task not serializable exception I tried almost all possible ways. I checked this as well https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html
... View more
Labels:
- Labels:
-
Apache Spark
02-09-2017
09:03 PM
Hi Tibor thanks for your reply. I have looked at the above link. I have a dataset of structure i.e. .gz file which iam reading in spark abcdefghij; abc=1234 xyz=987 abn=567 ubg=345 after pivot abcdefghij abn ubg abc xyz abcdefghij 567 987 1234 987 and so on. All the above columns are string columns and abn values are duplicated. So as they are strings and iam just looking to split the data thats why i dont need aggregation, i just need pivot. Any ideas to acheive this in spark scala? thank u
... View more
02-09-2017
05:06 PM
Iam looking to perform spark pivot without aggregation, is it really possible to use the spark 2.1.0 api and generate a pivot without aggregation. I tried but as the pivot returns groupeddataset without aggregation the api is not working for me. Any ideas how to convert to DF or dataset without performing aggregation and show the DF please. Thank you
... View more
Labels:
- Labels:
-
Apache Spark
02-02-2017
05:59 PM
1 Kudo
Just wondering if spark supports Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset.. I think we can read as RDD but its still not working for me. Any help would be appreciated. Thank you. iam using s3n://.. but spark says invalid input path exception. val df = spark.sparkContext.textFile("s3n://..../*.gz) doesnt work for me 😞 I prefer to the s3 dir of .gz files as a DF or Dataset if possible else atleast RDD please. thank you
... View more
Labels:
- Labels:
-
Apache Spark
12-21-2016
09:31 AM
@Mridul M Thanks for the reply. Actually i was using the old version of spark testing base. Didn't know there was a 2.0.2 version. Holden karau pointed me to maven central where i found the latest version of spark testing base. In github readme of spark testing base it was mentioned 1.6 so i assumed the latest version was 1.6. But now its sorted and spark 2.0.2 Dataframe testing works for me. In case people might need this it also needs Hive dependency. Currently its new and couldnt find forums in spark 2.0 testing so iam posting this might help save time for other developers 🙂
... View more
12-20-2016
11:43 AM
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/spark/Logging
java.lang.NoClassDefFoundError: org/apache/spark/Logging https://community.hortonworks.com/questions/58286/noclassdeffounderror-orgapachesparklogging-using-s.html Looked at the above question i think its valid for spark prior to 2.0 but iam looking for 2.0.2 🙂 I tried adding dependencies in sbt for Spark streaming, twitter and many more. Still i get this error for spark 2.0.2. Any ideas how to get this resolved? Thank you
... View more
Labels:
- Labels:
-
Apache Spark
12-08-2016
10:34 AM
@Bernhard Walter thanks for the reply.
... View more
12-06-2016
05:33 PM
1 Kudo
Just wondering if we write a inner join query in plain sql or if we use dataframe api to perform join, do we get the same performance? I can see that the auery and dataframe are pushed to catalyst optimiser from the diagram but just wanted to confirm if i write plain sql queries can i use it for bigdata production usecases or shall i use dataframe for better performance? thank u
... View more
Labels:
- Labels:
-
Apache Spark
08-31-2016
09:37 AM
1 Kudo
Is there any way to monitor the throughput of Nifi i.e. how many messages it has processed in a second? Apart from Nifi UI and Ambari is there a way to monitor the throughput for Nifi please like logs as i cant find any detials regarding throughput in nifi logs or in any of the nifi repositories(content, flowfile,db) as well?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache NiFi
08-23-2016
11:04 AM
Iam new to Alerting & Monitoring. If we want to setup alerts for nifi using splunk can we use putsplunk nifi processor or send log files directly to splunk? Currently we are having applications use splunk where they send the log files directly to splunk for alerting. Which is the effective way to acheive monitoring and alerting for nifi using splunk? Thank you
... View more
Labels:
- Labels:
-
Apache NiFi
08-16-2016
10:22 AM
1 Kudo
What is the best practice to productionize nifi and schedule it to run every hour through a script? How do we upgrade nifi to a new version. How to deploy new workflows and modified workflows into production in EC2. How do we deploy changes into nifi workflows thank you
... View more
Labels:
- Labels:
-
Apache NiFi
08-16-2016
10:13 AM
I think concurrenttasks is having 12 threads for this getkafka processor i.e. 12 consumers in one consumer group? Please correct me if iam wrong. So if i have 12 partitions in my kafka topic then i believe 12 consumers are consuming from 12 partitions? I have a 3 node nifi cluster and made concurrent tasks to 4 so that i can split the load between all the nifi nodes in the cluster. I think each node will be consuming from 4 paritions. What happens to the 4 partitions of that node if it dies or crashes? Normally there will be a rebalance and kafka will reassign partitions to consumer so that the 2 node will consume from 6 partitions each? Is this how it works in Nifi? What will happen to the messages or offsets that has already in the queue of a node that has died? Is there anyway to store offsets in nifi to achieve fault tolerance?
... View more
Labels:
- Labels:
-
Apache Kafka
08-11-2016
09:18 AM
1 Kudo
I mean in which logs the nifi warnings and errors are stored?
... View more
Labels:
- Labels:
-
Apache NiFi
08-10-2016
08:03 AM
@Bryan Bende Thanks for the answer it did work for me. Just a small config iam looking for. Currently when i merge my json events and export them to S3 iam getting concatenated json events delimited by "Space" in a single line. At the moment iam getting concatenated json events in a single line. How can i get the json events delimited by new line \n. Thank you.
... View more
08-09-2016
04:03 PM
The current workflow is exporting each event. We are looking to merge all json events based on service/eventname and concatenate time and export them to s3.
Our requirement on and merge them using expression language at the runtime.
... View more
07-26-2016
12:54 PM
@Simon Elliston Ball thanks for the answer 🙂 If we install multiple ZK that means we need to embed it with nifi on the same node. If so how do we sync ZK state from different slaves nodes where miltuple ZK instances are installed ? Thank you.
... View more
07-26-2016
11:15 AM
@Simon Elliston BallDoes Nifi store offsets in memory as well. Just wondering how will it make sure not to read same offsets or duplicate messages from kafka? thanks
... View more
07-26-2016
11:13 AM
Currently we are looking to use Nifi to get Json events from kafka and store them to s3. We are looking to set up a nifi cluster and i have following questions please? 1. What happens to Nifi when I do a rolling restart of kafka? I think nifi listens to zookeeper :2181 or kafka nodes how do we configure it in GetKafka processor?
2. What does Nifi do if the peered connection becomes unavailable? 3. How does it handle a kafka node going offline? 4. Does it always connect to the same kafka node or what determines which node to connect to? 5. Does it connect on IP or DNS (kafka IP's can change)? Can we make Nifi to connect to kafka DNS rather than IP. 6. What will it do if a leadership election is invoked while its consuming? 7. What monitoring and alerting will be in place in production? 8. Any issues if Zookeeper embedded in slaves and how Zookeeper maintains the state integrity if we have more than 1 zookeeper in the nifi cluster? Or is it better to have a separate single ZK instance in cluster? thanks
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache NiFi
07-25-2016
03:08 PM
@Shishir Saxena Thanks for the answer. How error handling is built in NiFi flows? Whats content & flowfile repository i mean which data is stored in these repositories by Nifi? Thank you
... View more