Member since
11-30-2016
33
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
15994 | 02-28-2017 04:58 PM |
11-02-2017
06:54 AM
Thanks for the quick response , One such issue I can provide is https://issues.apache.org/jira/browse/SPARK-20922 which affected the Spark 2.1.1 . and few I got from the link https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=spark , where it clearly says to use the Spark version 2.2.0 in many places . Thanks,
... View more
11-02-2017
05:29 AM
Dear All, We are planning to Upgrade our HDP from 2.4.2 to 2.6.2 because we need to use the Spark 2.2.0 in our project. We are asked to use the Spark version 2.2.0 from our security team as it is latest vulnerable free version. So the question here is, Is Spark 2.1.1 which comes with HDP package 2.6.2 is vulnerable free ? because as we read and understand the latest vulnerable free version of Apache Spark available is 2.2.0 . Thanks in Advance , Param.
... View more
08-21-2017
03:02 PM
Hi,
We are having multiple models to load during the run time to predict based on the trained model. And our is mulitthreaded application , so If I load the model using org.apache.spark.ml.PipelineModel.load("path of the mode") will this work properly . Below is the reason why I am asking this doubt . 1. I haven't seen anywhere mentioning that org.apache.spark.ml.PipelineModel is thread safe ? 2. Even If it is not , Can I use this, because at the same time 2 or more thread could use the same object of org.apache.spark.ml.PipelineModel, will this impact anything ? Thanks,
Param.
... View more
Labels:
05-09-2017
07:34 AM
@yvora Thanks for the response. 1. Can you try setting spark.yarn.stagingDir to hdfs:///user/tmp/ ? This is not working . 2. Can you please share which spark config are you trying to set which require RM address? I am trying to run the Spark application through java program , so when the master is yarn , by default it connects to resource manager @ 0.0.0.0:8032 in order to override this property , I need to set the same in spark configuration i.e sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname); sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress); But the problem is when the I have HA enabled the resource manager How to I connect to it . And some idea I got about my question is : There is way to achieve this in the Spark context as below . JavaSparkContext jsc = new JavaSparkContext(sparkConf); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "core-site.xml"));
jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "hdfs-site.xml"));
jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "mapred-site.xml"));
jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "yarn-site.xml")); But it needs the resource manager and staging directory configuration even before creating the context ,so there is problem . And what I am looking for is something like above for the SparkConguration class/object . Thanks , Param.
... View more
05-08-2017
10:24 AM
Hi All, When I am trying to run the spark application in YARN mode using the HDFS file system it works fine when I provide the below properties . sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname);
sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress);
sparkConf.set("spark.yarn.stagingDir",stagingDirectory ); But the problem here , 1. Since my HDFS is NamdeNode HA enabled it won't work when I provide spark.yarn.stagingDir has the commons URL of hdfs example hdfs://hdcluster/user/tmp/ it gives error has unknown host hdcluster , But it works fine when I give the URL as hdfs://<ActiveNameNode>/user/tmp/ , But we don't in advance which will be active so how to resolve this . And few things I have noticed are SparkContext takes the Hadoop configuration but SparkConfiguration class won't have any methods to accepts Hadoop configuration. 2. How To provide the resource Manager address when Resource Manager are running in HA . Thanks in Advance , Param.
... View more
Labels:
05-02-2017
05:01 AM
Thank you for the response , I agree with you completely , But My question even the partial result it not getting stored correctly . Thanks, Param.
... View more
04-27-2017
02:07 PM
Dear All, I am using the spark to train model and save it on file system as using org.apache.spark.ml.PipelineModel.save() . This works fine when I am running the spark in local mode but when I am running the spark in standalone mode where 2 workers are there , it is training the model correctly but when saving on the file it is storing the partial result .And the file system I am using is Local one i.e file:// . Could someone please point out what is the issue here. And is it an issue to store the trained PipelineModel on local file system. FYI .. Same code works when I use HDFS file system. Spark version I am using is : 2.0.0 Thanks in Advance , Param.
... View more
Labels:
04-10-2017
06:00 PM
Hi All , As per my understanding says PipelineModel.save saves the model on the file system based on the URI ,before storing it collects the individual result from the worker and stores on the file system provided . When I am running the spark on standalone mode with 2 workers and the path to save the file is local linux FS say (/tmp/examplemodel/) it stores the model on worker nodes as well . My question here is the data which is getting stored on the FS of workers is intermediate data and on the driver is the complete model data ,then why does spark is not cleaning up the data on the workers when it is of no use as it intermediate data . And When I am using the HDFS then where does the intermediate data of the workers will be stored ? As it final stores in one place . TIA, Param.
... View more
Labels:
04-05-2017
06:09 AM
@yvora Actually I am submitting my spark application through program , I have done the same thing programatically by adding the sparkConfig.setJars(new String[]{"myjar.jar"}); still it is not working .
... View more
04-04-2017
05:36 PM
Hi All, When I am running the spark in local mode my code is working fine and when I am running it in standalone mode and deploy mode as client , I am getting the class not found exception "Caused by: java.lang.ClassNotFoundException:" for my custom class .Actually I am distributing the jar which has the class to worker by setting the property in spark configuration sparkConfig.setJars(new String[]{"myjar.jar"}); and I am seeing this jar getting distributed on worker node as well ,but I am getting the class not found exception.Why the executor is not able to fine the class in the jar on the worker node. Could someone please let me what mistake I am making here . Thanks in advance , Param.
... View more
Labels:
03-31-2017
05:06 AM
@amankumbare Thanks for responding, Sumit and me work in the same team . It is happening for all topics and jdk we using is 7 . The main issue here is the content of the znode of the broker has no host and port information updated . ["PLAINTEXTSASL://xxxx.domain.com:9092"],"host":null,"version":2,"port":-1} -Is it the expected behavior ? -And it happens when listener value is PLAINTEXTSASL://xxxx.domain.com:9092 and everything is fine when it is just PLAINTEXT . Thanks in advance , Param.
... View more
03-17-2017
02:42 PM
Hi All , How to close the the question in hortenworks community by accepting the answer . Thanks , Param.
... View more
03-15-2017
03:24 PM
@Sandeep Nemuri Thanks its worked ..I tried this already but forgot to create the file on each node ,now its fine. And I just got one more question here : If I run in spark app YARN mode I can set the memory parameter through sparkconfiguration using spark.yarn.driver.memoryOverhead properties , Is something similar available for the standalone and local mode ? Thanks in advance , Param.
... View more
03-15-2017
11:31 AM
Hi All, When executing the spark application on YARN cluster can I access the local file system (Underlying OS FS). Though YARN is pointing to HDFS . Thanks , Param.
... View more
Labels:
03-15-2017
11:25 AM
Sorry for the delayed reply ...I got busy in some work. @Artem Ervits thanks a lot for all the responses . I was able to achieve this by setting the spark configuration as below ;- sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXXX");
sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032");
sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXXX:8020/user/hduser/"); sparkConfig.set("--deploy-mode", deployMode); Thanks , Param.
... View more
03-08-2017
05:11 PM
Hi All, I am trying to describe the Kafka Consumer group by the below command . bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server broker1:9092,broker2:9092 --describe --group console-consumer-44261 --command-config conf/security_user_created But I am getting an error as Consumer group `console-consumer-44261` does not exist or is rebalancing. I have checked the zookeeper , the consumer group exist and I am always getting this error . Could some point out me here , what mistake I am making here . Kakfa version I am using is 0.9 . Thanks in advance , Param.
... View more
Labels:
02-28-2017
05:17 PM
Hi All, When I am checking the lag of Kerberized kafka I am getting the below error . Command I am using :- bin/kafka-consumer-groups.sh --describe --zookeeper XXXXX:2181 --group test-consumer-group --security-protocol SASL_PLAINTEXT Output I am getting . GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
Error while executing consumer group command null
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:122)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:114)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$sendRequest(SimpleConsumer.scala:99)
at kafka.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:165)
at kafka.admin.ConsumerGroupCommand$ZkConsumerGroupService$anonfun$getLogEndOffset$1.apply(ConsumerGroupCommand.scala:205)
at kafka.admin.ConsumerGroupCommand$ZkConsumerGroupService$anonfun$getLogEndOffset$1.apply(ConsumerGroupCommand.scala:202)
at scala.Option.map(Option.scala:145)
at kafka.admin.ConsumerGroupCommand$ZkConsumerGroupService.getLogEndOffset(ConsumerGroupCommand.scala:202)
at kafka.admin.ConsumerGroupCommand$ConsumerGroupService$class.kafka$admin$ConsumerGroupCommand$ConsumerGroupService$describePartition(ConsumerGroupCommand.scala:133) Could someone point out what is the issue here . Thanks , Param.
... View more
Labels:
02-28-2017
05:08 PM
Hi All , When checking the offset for the kerberized kafka by the command . bin/kafka-consumer-offset-checker.sh --zookeeper XXXX:2181 --group test-consumer --security-protocol SASL_PLAINTEXT Getting the null output i.e Group Topic Pid Offset logSize Lag Owner Exiting due to: null. Could someone tell me is there any issue with command I am using . Kafka version I am using is 0.9.0.2.4. Thanks in advance , Param.
... View more
Labels:
02-28-2017
04:58 PM
@Artem Ervits , Thanks a lot for your time and help given. However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration . sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX");
sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032");
sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020");
sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");
Regards, Param.
... View more
02-27-2017
05:54 PM
@Artem Ervits , Thanks again ! And Sorry If I am asking too many questions here . What actually I am looking for is ..I should not use the spark-submit script as per the project requirement , So the cluster configuration I am passing through the spark config as given below . SparkConf sparkConfig = new SparkConf().setAppName("Example App of Spark on Yarn");
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXX");
sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032"); And it is able to identify the Resource Manager but it failing because it is not identifying the file system . Though I am setting the hdfs file system configuration as well. sparkConfig.set("fs.defaultFS", "hdfs://xxxhacluster");
sparkConfig.set("ha.zookeeper.quorum", "xxx:2181,xxxx:2181,xxxx:2181"); And it assuming it as the local file system. And error I am getting in the Resource Manager is exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist Thanks and Regards, Param.
... View more
02-27-2017
06:44 AM
@Artem Ervits Thank you very much for the response . I am able to submit the job to YARN through the spark-submit command ,but what actually I am looking here is for doing the same thing trough the program . It would be great if you would give the template for the same, java preferably . -Param.
... View more
02-27-2017
06:38 AM
@Sriharsha Chintalapani Could you please answer this question .
... View more
02-27-2017
06:36 AM
Thanks for the response I very much agree to you answer .
... View more
02-26-2017
12:01 PM
1 Kudo
Hi All, I am new to spark , I am trying to submit the spark application from the Java program and I am able to submit the one for spark standalone cluster .Actually what I want to achieve is submitting the job to the Yarn cluster and I am able to connect to the yarn cluster by explicitly adding the Resource Manager property in the spark config as below . sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXX:8032"); But application is failing due to exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist This I got it from the Resource manger log , what I found is that it is assuming the file system as local and not uploading the required libraries . Source and destination file systems are the same. Not copying file:/tmp/spark-1ed67f05-d496-4000-86c1-07fcf8526181/__spark_libs__1740543841989079602.zip This I got it from the Spark application where I am running my program . Issue I am suspecting here is it is assuming the file system as local not hdfs , Correct me If I am wrong . Question here is : 1.What is the actually issue for the job to fail , given the required data or log info above ? 2.Could you please tell me how to add the resource files to spark configuration like addResource in Hadoop configuration. Thanks in Advance , Param.
... View more
Labels:
02-26-2017
08:43 AM
1 Kudo
Hi All , When I am trying to get the lag from the kerboraised kafka I am getting the exiting due to null error . Command I am using : bin/kafka-consumer-offset-checker.sh --zookeeper XXXXXXX:2181 --group XXXX-consumer --security-protocol PLAINTEXTSASL OutPut: Could not fetch offset for [metrics,0] due to kafka.common.NotCoordinatorForConsumerException.
Group Topic Pid Offset logSize Lag Owner
Exiting due to: null. Could someone please point out where am I making mistake here ? Thanks in Advance , Param.
... View more
Labels:
02-26-2017
08:36 AM
1 Kudo
Hi All , During Kerboraizing the kafka using the Ambari , it is setting the kafka security protocol to PLAINTEXTSASL instead of SASL_PLAINTEXT, but everywhere in the document is it mentioned that it must be SASL_PLAINTEXT , I have few questions regarding this . 1. Why Ambari setting the security protocol to PLAINTEXTSASL , is it a bug ? 2. Even though we are able to produce and consume the messages from program written in java. But in the producer we are setting the security protocol to PLAINTEXTSASL, and in the consumer SASL_PLAINTEXT , it is working fine , Question is how come it is working fine when actual protocol is just PLAINTEXTSASL. Thanks in Advance , Param.
... View more
Labels:
12-28-2016
10:00 AM
Thank you for the response . concept is clear for me . Actually my question is . If the maximum renewable file time of ticket if 7 Days then , maximum time up to which we can renew of the ticket and use is on or before 7 Days , So when I am making calls to HBase Hadoop client is making RPC calls and before that it executes below code .
if(UserGroupInformation.isLoginKeytabBased()){ UserGroupInformation.getLoginUser().reloginFromKeytab(); }elseif(UserGroupInformation.isLoginTicketBased()){ UserGroupInformation.getLoginUser().reloginFromTicketCache(); } What you said , But my question here also , I must get ticket expiration exception once the User principal maximum renewal time reached ? Why I am not getting that . And Is UserGroupInformation.getLoginUser().reloginFromKeytab(); is actually changing the Valid starting Expires Service principal 12/28/16 12:27:20 12/28/16 12:27:50 krbtgt/XYZ@XYZ.COM renew until 12/28/16 12:28:11
Every time it re login. Finally what is the difference between renewal (in command it is kinit -R ) and re-log in (In program UserGroupInformation.getLoginUser().reloginFromKeytab() ) Thanks , Param.
... View more
12-27-2016
05:16 PM
1 Kudo
Dear All, I am running program which fetches the records from the secured (Kerboraized ) HBase . And user principal I am using in my program has maximumlife of 30 Seconds and maximum renewal life of 1 Minutes . And I am actually doing an experiment in the test program to understand how auto renewal works in Hadoop . When I am making the thread sleep for every one minute before fetching the records it able to fetch the records . My question here even though auto renewal of ticket is working fine . Since the maximum renewable life time is 1 minutes when make the thread sleep for a miuntes and then fetches the records it still able to fetch the records How come this is possible as it violates the basic definition of the maximum renewable life time of ticket . Is it because whenever it performs the reloginFromKeyTab before making an RPC call the life time of ticket it getting refreshed and advanced to the future time . i.e the current renewal time + maximum life time . And what is the difference between renewal of ticket and reloginFromKey . Thanks in Advance, Param.
... View more
Labels:
12-14-2016
05:54 PM
1 Kudo
Hi Everyone, We are using the end point co-processor to fetch the records from my HBase cluster. We are having the 3 nodes cluster and total number of regions are 180 . Call to the end point co-processor is taking the more time than the usual , after all the analysis the property I am doubting is hbase.regionserver.handler.count which is 30 by default. And my client code is making the Batch call to the coprocessor , And there are 10 such Batch call which are may be simultaneous and each batch call creates the 180 separate threads, so total number threads at the client end will be 1800 sometimes . I have changed the hbase.regionserver.handler.count from 30 to 100 still not seeing any performance improvement much . Now I am coming to the question : What is the feasible value for the property hbase.regionserver.handler.count ? How to know whether that property is impacting the performance or not ? In order to increase the property what are the other values I should be modifying for the proper functioning . Thanks in Advance . Param.
... View more
Labels:
12-07-2016
09:22 AM
Yes, I am creating the MultiRowRangeFilter and adding to the filter list and then setting to the same on scan object.
... View more