Member since
07-07-2016
53
Posts
6
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
299 | 06-27-2016 10:00 PM |
02-27-2018
08:30 PM
Hi We are running decision tree models in spark+R studio sparklyr and and the model works upto a certain # of decision trees. When we increase the # decision trees spark is not scaling and its failing with various issues. At 1 stage its asking for 164 tasks and spark context is never getting 164 tasks and its just struck there. We are running spark in Stand Alone mode on docker containers. It has 14 workers and 1 master. Below are the spark properties we are using. Please suggest any tuning properties which will help us. The Spark stand alone cluster has available memory. Thanks spark.cores.max=82 spark.driver.memory=10g spark.driver.maxResultSize=10g spark.executor.memory=20g spark.executor.cores=2 spark.network.timeout=800s spark.rpc.askTimeout=800s spark.dynamicAllocation.enabled=true spark.shuffle.service.enabled=true spark.dynamicAllocation.minExecutors=10 spark.dynamicAllocation.maxExecutors=500 spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps spark.reducer.maxSizeInFlight=1024m spark.shuffle.file.buffer=256k spark.locality.wait=15s spark.files.maxPartitionBytes=268435456 spark.sql.shuffle.partitions=1000 spark.default.parallelism=1000 spark.broadcast.compress=true spark.io.compression.codec=lz4 spark.rdd.compress=true spark.shuffle.compress=true spark.shuffle.spill.compress=true spark.memory.fraction=0.8 spark.memory.storageFraction=0.4 spark.scheduler.minRegisteredResourcesRatio=0.0 spark.scheduler.maxRegisteredResourcesWaitingTime=30s spark.task.maxFailures=5 spark.shuffle.io.maxRetries=3 spark.shuffle.io.preferDirectBufs=true spark.shuffle.io.retryWait=5s
... View more
Labels:
01-11-2018
08:37 PM
Agreed. Thanks for suggestion. For now it seems I have a work around by changing the run schedule from 0 seconds to 1 seconds and I dont see Lease holder exception. Even though there is a little latency in writing to HDFS unlike 0 seconds but error has gone. I will work on your suggestion for production. Thanks for help! Srikaran
... View more
01-11-2018
06:32 PM
@Bryan Bende I liked MergeContent option as you suggested. But Please clarify this. In production surveys will come real time as soon as customer write the survey we want to see in HDFS. So my use-case is during 24 hour period which is per day I want to see only 1 file in HDFS and as soon as Surveys were posted I should see that survey in HDFS. If I use Merge Content processor will that be still considered Real-Time? I am guessing it will wait until data reach certain threshold, upon which merge will happen and write to HDFS? During a day there will be times where no surveys at all or bunch of surveys coming at the same time or 1 survey per second. Thanks Srikaran.
... View more
01-11-2018
05:46 PM
@Bryan BendeHi Bryan. We are testing this in DEV and it has only 1 NIFI Node. However the puthdfs cluster has 4 datanodes. Prod we will have 2 nifi nodes and 5 datanodes. Thanks
... View more
01-11-2018
05:29 PM
HI We have a NIFI flow where we are sourcing the social media surveys from an API and writing them to HDFS via PutHDFS processor in with conflict resolution strategy as "append". This flow works if surveys are coming 1 by 1 with a second or 2 seconds delay. We want to test some 20000 surveys all coming at once and "PutHDFS" processor is failing for this scenario. Error is given below: WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed to create file XXXXXXXXXXXX for DFSClient_NONMAPREDUCE_XXXXXXXXX because current leaseholder is trying to recreate file. org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user@XXXXXXXXX (auth:KERBEROS) cause:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file XXXXXXXXXXX for DFSClient_NONMAPREDUCE_XXXXXXXX for client XXXXXXXX because current leaseholder is trying to recreate file. INFO org.apache.hadoop.ipc.Server: IPC Server handler 14 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from XXXXXXXX Call#XXXXX Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file XXXXXXXXX for DFSClient_NONMAPREDUCE_XXXXXXXX because current leaseholder is trying to recreate file. With these exception all the records are getting blocked in nifi queue to puthdfs and eventually they are not writing into HDFS. Is there a way to configure Nifi PutHDFS processor to accomodate this use-case? Rt now its configured under scheduling as "Timer Driven", Concurrent tasks as "1" and with run schedule as 0 seconds. Yield duration is 1 second. Please suggest. Thanks Srikaran
... View more
Labels:
01-04-2018
06:37 PM
@Karl Fredrickson Hi Karl..Same issue after Stop and restart. I tried 1 hour and 4 hours for Kerberos relogin period as I am using the same relogin period for FetchHDFS/ListHDFS. This is happening only for "GetHDFS". I am assuming "GetHDFS" processor is trying to delete/move or write which might need some other permissions. The HDFS files are owned by hive:hive with 771 permissions. With the same 771 permissions and hive:hive fetchhdfs & listhdfs is working. Thanks
... View more
01-04-2018
05:38 PM
Hi I am using FetchHDFS nifi processor which is running fine to fetch the exact HDFS file. I want to get all HDFS files under a directory hence using GetHDFS by keeping the source file option as "True". But I am getting a Kerberos error saying "ERROR [Timer-Driven Process Thread-1] o.apache.nifi.processors.hadoop.GetHDFS GetHDFS[id=XXXXXXXXXX] Error retrieving file hdfs://XXXXXXXXXXXXXXXXXXXX.0. from HDFS due to java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt): {}
java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
Caused by: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:332)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:205)
Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:311)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:287)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:287) I am wondering why Same Kerberos credentials are working for "FetchHDFS/ListHDFS" but not "GetHDFS". "GetHDFS" need additional setup? Please suggest. Thanks Srikaran
... View more
Labels:
12-14-2017
04:56 PM
@Qi Wang I am running this flight delays use case in Spark 1.6.0 and I am getting the below issue. Can you please let me know what I am missing? scala> val lrModel = lrPipeline.fit(trainingData) <console>:64: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: org.apache.spark.sql.DataFrame
val lrModel = lrPipeline.fit(trainingData)
... View more
12-06-2017
06:26 PM
1 Kudo
@Timothy Spann Thanks a lot. These are very helpful, Let me test the flow and will update accordingly. Thanks
... View more
12-06-2017
06:24 PM
1 Kudo
@anarasimham Looks like GetHDFS will replace HDFS file. I am planning to use fetchHDFS and then invoke http processor. For now I am converting avro file to JSON on Hadoop end and fetching the json and posting it. I will directly test avro & other formats and will update. Thanks!
... View more
12-04-2017
07:17 PM
1 Kudo
Hello. I have a HDFS file for which data needs to be posted to an outside URL (https), I have the user name and password for the URL; I can post a sample JSON via postman from my browser by using the user name and password. Now I have to use Ni-FI for this flow. Please let me know what are the exact nifi processors should I use to get the data from HDFS and post it into the URL via another ni-fi processor. Also kindly let me know what format the HDFS data should be in for these kind of use-cases. Thanks Srikaran
... View more
Labels:
09-14-2016
03:06 PM
@Predrag Minovic Great options. It looks like from all the options above the 2nd ZK quorum should be installed manually outside Ambari and configure the Kafka accordingly? If that's the case when I do upgrades in future on this cluster I have to take care of 2nd manual ZK quorum upgrade as a separate effort rt? And I like the 2 clusters solution but what if some business logic on cluster 1 is dependent on kafka on cluster 2? In that case I guess "2 clusters " solution will not work rt? Please confirm! Thanks Sri.
... View more
09-13-2016
07:10 PM
Hi
I am planning to build a HDP 2.4.2 Kerberized cluster via Ambari Blue-Prints and I am going to change the blue print to have 6 Zookeepers. The reason why I am having 6 zk's is I want to have two ZK quorums with 3 ZK's, 1 quorum I want to use for HDFS NN HA, Hbase and other services except for Kafka and for Kafka alone I want to have other ZK Quorum dedicated. I am assuming when I build the cluster with 6 ZK's initially I guess it will create only 1 ZK quorum with 6 ZK's in it. Can I change it to have to 2 ZK quorums after cluster installation from zkcli? or is there an option in Ambari blue print itself to create 2 ZK quorums with 3 ZK servers in each quorum? Please advice! Thanks
... View more
Labels:
07-25-2016
04:04 PM
@Artem Ervits Thanks. So even if both clusters are Kerberized with Cross-Realm functionality + hftp protocol we should be good rt? Also can you please let me know what do you mean by Source Cluster is "Read Only"? Do you mean when the hftp distcp is happening Soure Cluster shouldn't have any write operations happening on that cluster?
... View more
07-25-2016
02:04 AM
@Robert Levas I am looking for "Cloudera(CDH)" to "Hortonworks(HDP)" cluster migration. Since both are having different packages for cluster build and might have different Hadoop/Hive/Hbase versions and with Hadoop being not backward Compatible (??) will distcp works in this case? Please let me know!
... View more
07-21-2016
02:42 PM
Hi..Can we copy data from a Cloudera Kerberized Cluster to Hortonworks Kerberized Cluster via Cross Realm Setup? Assuming both clusters are having different Hadoop, hive, Hbase versions and also both clusters have complex network topologies and what not! Please let me know.
... View more
Labels:
07-11-2016
02:12 AM
@Ali Bajwa Thanks!
... View more
07-11-2016
02:12 AM
@Peter Kim yes
... View more
06-30-2016
08:38 PM
@Rahul Pathak I created a note.json for screenshots and it worked!
... View more
06-29-2016
06:28 PM
@Rahul Pathak Yes, It worked but Ambari restart of Zeppelin still fails with below errors: WARN [ambari-heartbeat-processor-0] HeartbeatProcessor:545 - Operation failed - may be retried. Service component host: ZEPPELIN_MASTER, host: xxxxxxxxxxxx 359-0 and Task id 3311
29 Jun 2016 10:20:24,531 ERROR [ambari-heartbeat-processor-0] ServiceComponentHostImpl:1030 - Can't handle ServiceComponentHostEvent event at current state, serviceComponentName=ZEPPELIN_MASTER, hostName=xxxxxxxxxx, currentState=STARTED, eventType=HOST_SVCCOMP_OP_FAILED, event=EventType: HOST_SVCCOMP_OP_FAILED
29 Jun 2016 10:20:24,532 WARN [ambari-heartbeat-processor-0] HeartbeatProcessor:563 - State machine exception. Invalid event: HOST_SVCCOMP_OP_FAILED at STARTED
... View more
06-29-2016
03:21 PM
@Rahul Pathak
... View more
06-29-2016
02:35 PM
I copied note.json to screen shots and the error is gone. However Zeppelin restart still fails. I see below from the Ambari logs. Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/ZEPPELIN/package/scripts/master.py", line 235, in <module>
Master().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 535, in restart
self.start(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/ZEPPELIN/package/scripts/master.py", line 179, in start
self.update_zeppelin_interpreter()
File "/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/ZEPPELIN/package/scripts/master.py", line 196, in update_zeppelin_interpreter
data = json.load(urllib2.urlopen(zeppelin_int_url))
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
... View more
06-29-2016
02:18 PM
@Rahul Pathak This seems to clear most of the issues except this one. Can't read note file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots
java.io.IOException: file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots/note.json not found For some reasons I don't see note.json in the screenshots folder. Even If I remove the entire screenshots folder Ambari Zeppelin restart by default is looking for Screenshots. But with screenshots I dont see corresponding note.json. Can I have an empty note.json and give it a try? I dont think that will work? If you have a sample note.json which I can use for this screenshots I think that will work. Please let me know.
... View more
06-28-2016
08:26 PM
@Jitendra Yadav I checked and its not related to pid file issue. What surprises me is when I go to original version of Zeppelin configs which came up during installation it always starts good. After that If I am doing any change in the zeppelin config via ambari and do a restart then this issue is coming up.
... View more
06-28-2016
07:19 PM
@Aravindan Vijayan Well distributed mode means it should use cluster resources and not local resources rt? So Customer is asking whats the use of distributed mode and still have processes running locally?
... View more
06-28-2016
07:14 PM
@Jitendra Yadav Yes I took care of this. There were 2 issues 1 is this and the other one is screenshots note.json missing.
... View more
06-28-2016
07:12 PM
@Rahul Pathak
... View more
06-28-2016
06:52 PM
ERROR [2016-06-28 13:45:20,075] ({main} VFSNotebookRepo.java[list]:140) - Can't read note file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots
java.io.IOException: file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots/note.json not found
... View more
Labels: