Member since
06-23-2016
136
Posts
8
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1019 | 11-24-2017 08:17 PM | |
1108 | 07-28-2017 06:40 AM | |
293 | 07-05-2017 04:32 PM | |
327 | 05-11-2017 03:07 PM | |
2190 | 02-08-2017 02:49 PM |
05-07-2018
08:14 PM
Duh! Of course it's a different machine! I'll check it in the morning. Thanks!!
... View more
05-07-2018
02:29 PM
I am trying to do the procedure here, but /usr/hdp/current/kafka-broker is a broken link and kafka-topics.sh is nowhere to be found. HDP shows Kafka service running OK. TIA!
... View more
02-11-2018
02:14 PM
Anyone have a basic guide in moving my hDP 2.6 cluster to aws? I need to move the data which seems to be a fairly straightforward copy, right? But then how do I use Amazon processing that matches my hive, spark etc installation from my cluster? TIA!
... View more
12-06-2017
10:29 AM
@Jay Kumar SenSharma Thanks! Sorry forgot to say I am trying to run Spark 2.2 as an independent service that uses HDP2.6. I assum this won't work for it.
... View more
12-06-2017
10:06 AM
Thanks! Unfortunately it already has that line.
... View more
12-06-2017
07:52 AM
EDIT: I forgot to say I am trying to run Spark 2.2 as an independent service that uses HDP2.6. Please help I am running out of time! I run: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue thequeue examples/jars/spark-examples*.jar 10 --executor-cores 4 --num-executors 11 --driver-java-options="-Dhdp.version=2.6.0.3-8" --conf "spark.executor.extraJavaOptions=-Dhdp.version=2.6.0.3-8" I get this error: Spark YARN Cluster mode get this error “Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster” I have tried all the fixes I can find except: 1. Classpath issues. Wher do I set this and to waht? 2. This question suggests it may be due to missing jars. Which jars do I need and what do I do with them? TIA!
... View more
12-05-2017
05:19 PM
I've also tried the Dhdp.version= fixes from here. I've not put the new Spark on my other machines, could that be the problem, if so where do I put it? I created a new folder on master but if I use the same folder on the nodes, how does master knwo about it?
... View more
12-05-2017
07:54 AM
I am trying to run Spark 2.2 with HDP 2.6. I stop Spark2 from Ambari, then I run: /home/ed/spark2.2/spark-2.2.0-bin-hadoop2.7/bin/spark-shell --jars /home/ed/.ivy2/jars/stanford-corenlp-3.6.0-models.jar,/home/ed/.ivy2/jars/jersey-bundle-1.19.1.jar --packages databricks:spark-corenlp:0.2.0-s_2.11,edu.stanford.nlp:stanford-corenlp:3.6.0 \--master yarn --deploy-mode client --driver-memory 4g --executor-memory 4g --executor-cores 2 --num-executors 11 --conf spark.hadoop.yarn.timeline-service.enabled=false It used to run fine, then it started giving me: Error initializing SparkContext.org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. now it just hangs after: 17/12/05 07:41:17 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. I can run it OK, without --master yarn --deploy-mode client but then I get the driver only as executor. I have tried spark.hadoop.yarn.timeline-service.enabled = true. yarn.nodemanager.vmem-check-enabled and pmem are set to false. Can anyone help or point me where to look for errors? TIA! PS spark-defaults.conf: spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18081
spark.yarn.historyServer.address master.royble.co.uk:18081
spark.driver.extraJavaOptions -Dhdp.version=2.6.0.3-8
spark.yarn.am.extraJavaOptions -Dhdp.version=2.6.0.3-8
# spark.eventLog.dir hdfs:///spark-history
# spark.eventLog.enabled true
# spark.history.fs.logDirectory hdfs:///spark-history
# spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
# spark.history.ui.port 18080
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address spark-server:18081
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0
spark.io.compression.codec lzf
spark.yarn.queue default
spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005
... View more
11-29-2017
09:34 PM
I am getting desperate here! My Spark2 jobs take hours then get stuck! I have a 4 node cluster each with 16GB RAM and 8 cores. I run HDP 2.6, Spark 2.1 and Zeppelin 0.7. I have:
spark.executor.instances 11 spark.executor.cores 2 spark.executor.memory 4G yarn.nodemanager.resource.memory-mb=14336 yarn.nodemanager.resource.cpu-vcores =7 Via Zeppelin (same notebook) I do an INSERT into a Hive table::
dfPredictions.write.mode(SaveMode.Append).insertInto("default.predictions") for a 50 column table with about 12 million records. This gets split into 3 stages of 75, 75 and 200 tasks. The 75 and 75 get stuck at stages 73 and 74 and the garbage collection lasts for hours. Any idea what I can try? EDIT: I have not looked at tweaking partitions, can anyone give me pointers on how to do that, please?
... View more
11-29-2017
10:25 AM
Wow thanks. I'll try these tomorrow when my latest slow job finishes.
... View more
11-29-2017
09:45 AM
My Hive query fails: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask I cannot see any logs in the tez view. It looks like the parse stage is hiding to the right but I cannot access it? How do I look at it and where are the logs? TIA!! For some reason I cannot upload a jpg so there is a png here.
... View more
11-28-2017
09:04 PM
I have a 4 node cluster each with 16GB RAM and 8 cores. I run HDP 2.6, Spark 2.1 and Zeppelin 0.7. I have: spark.executor.instances 11
spark.executor.cores 2
spark.executor.memory 4G
yarn.nodemanager.resource.memory-mb= 14336
yarn.nodemanager.resource.cpu-vcores = 7 In an earlier Zeppelin paragraph I did: spark.sql("set hive.execution.engine=tez;") although I do not know if Tez is doing the job, how do I tell? (Tez does not work if I do a job via the Hive View 2.0). Via Zeppelin (same notebook) I do an INSERT into a Hive table:: dfPredictions.write.mode(SaveMode.Append).insertInto("default.predictions") for a 50 column table with about 12 million records. This gets split into 3 stages of 75, 75 and 200 tasks. The first stage is running and has already taken 3.2 hours to do 45 out of the 75 tasks. Does this seem right with this size cluster? UPDATE: nearly finished! Some tasks have hours of garbage collection:
... View more
11-24-2017
08:17 PM
The answer is because I am an idiot. Only S3 had datanode and nodemanager installed. Hopefully this might help someone.
... View more
11-24-2017
11:59 AM
Hi. I am running Spark2 from Zeppelin (0.7 in HDP 2.6) and I am doing an idf transformation which crashes after many hours. It is run on a cluster with a master and 3 datanodes: s1, s2 and s3. All nodes have a Spark2 client and each has 8 cores and 16GB RAM. I just noticed it is only running on one node s3 with 5 executors. In zeppelin-env.sh I have set zeppelin.executor.instances to 32 and zeppelin.executor.mem to 12g and it has the line: export MASTER=yarn-client I have set yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler. I also set spark.executor.instances to 32 in the Spark2 interprter. Anyone have any ideas what else I can try to get the other nodes doing their share?
... View more
10-18-2017
10:33 AM
I am using HDP-2.6.0.3 but I need Zeppelin 0.8, so I have installed it as an independent service. When I run: %sql
show tables I get nothing back and I get 'table not found' when I run Spark2 SQL commands. These tables exist in the 0.7 Zeppelin. Can anyone tell me what I am missing? The steps I performed to create the zep0.8 are as follows: maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11 Copied zeppelin-site.xml and shiro.ini from /usr/hdp/2.6.0.3-8/zeppelin/conf to /home/ed/zeppelin/conf. created /home/ed/zeppelin/conf/zeppeli-env.sh in which I put the following: export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 export HADOOP_CONF_DIR=/etc/hadoop/conf export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8" Copied /etc/hive/conf/hive-site.xml to /home/ed/zeppelin/conf
... View more
08-29-2017
08:20 AM
FYI it seems the files disappear first, after a while the folder goes too. Before they disappear they look like: ed@master:~$ HADOOP_USER_NAME=hdfs hadoop fs -ls /data
Found 2 items
-rw-r--r-- 3 admin hdfs 4273495 2017-08-29 09:19 /data/abbo0.txt
-rw-r--r-- 3 admin hdfs 4211602 2017-08-29 09:19 /data/zip0.txt
... View more
08-29-2017
08:16 AM
Hi, I am running: HDP-2.6.0.3 Spark2 2.1.0 Hive 1.2.1000 HDFS 2.7.3 This is driving me mad so any help is appreciated. I am trying to load a Hive table. I have tried in Hive View 2.0 and also in Spark. I get no errors and it works if I run it quickly, but my hdfs data keeps disappearing! hc.sql("SET hive.support.sql11.reserved.keywords=false;")
hc.sql("add jar /usr/hdp/2.6.0.3-8/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar;")
hc.sql("DROP TABLE tweets11")
hc.sql("create table tweets15 ( racist boolean, contributors string, coordinates string, created_at string, entities struct < hashtags: array <string>, symbols: array <string>, urls: array <struct <display_url: string, expanded_url: string, indices: array <int>, url: string>>, user_mentions: array <string>>, favorite_count int, favorited boolean, filter_level string, geo string, id bigint, id_str string, in_reply_to_screen_name string, in_reply_to_status_id string, in_reply_to_status_id_str string, in_reply_to_user_id string, in_reply_to_user_id_str string, is_quote_status boolean, lang string, place string, possibly_sensitive boolean, retweet_count int, retweeted boolean, source string, text string, timestamp_ms string, truncated boolean, `user` struct < contributors_enabled: boolean, created_at: string, default_profile: boolean, default_profile_image: boolean, description: string, favourites_count: int, follow_request_sent: string, followers_count: int, `following`: string, friends_count: int, geo_enabled: boolean, id: bigint, id_str: string, is_translator: boolean, lang: string, listed_count: int, location: string, name: string, notifications: string, profile_background_color: string, profile_background_image_url: string, profile_background_image_url_https: string, profile_background_tile: boolean, profile_image_url: string, profile_image_url_https: string, profile_link_color: string, profile_sidebar_border_color: string, profile_sidebar_fill_color: string, profile_text_color: string, profile_use_background_image: boolean, protected: boolean, screen_name: string, statuses_count: int, time_zone: string, url: string, utc_offset: string, verified: boolean>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE")
hc.sql("LOAD DATA INPATH '/data' OVERWRITE INTO TABLE tweets15") /data is on hdfs and contains a couple of text files. But when I run this I get: res59: org.apache.spark.sql.DataFrame = [key: string, value: string]
res60: org.apache.spark.sql.DataFrame = [result: int]
res61: org.apache.spark.sql.DataFrame = []
res62: org.apache.spark.sql.DataFrame = []
res63: org.apache.spark.sql.DataFrame = [] and /data is gone or disappears after a while! Same thing happens in Hive: add jar /usr/hdp/2.6.0.3-8/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar;
DROP TABLE tweets0;
create table tweets0
( racist boolean, contributors string, coordinates string, created_at string, entities struct < hashtags: array <string>, symbols: array <string>, urls: array <struct <display_url: string, expanded_url: string, indices: array <int>, url: string>>, user_mentions: array <string>>, favorite_count int, favorited boolean, filter_level string, geo string, id bigint, id_str string, in_reply_to_screen_name string, in_reply_to_status_id string, in_reply_to_status_id_str string, in_reply_to_user_id string, in_reply_to_user_id_str string, is_quote_status boolean, lang string, place string, possibly_sensitive boolean, retweet_count int, retweeted boolean, source string, text string, timestamp_ms string, truncated boolean, `user` struct < contributors_enabled: boolean, created_at: string, default_profile: boolean, default_profile_image: boolean, description: string, favourites_count: int, follow_request_sent: string, followers_count: int, `following`: string, friends_count: int, geo_enabled: boolean, id: bigint, id_str: string, is_translator: boolean, lang: string, listed_count: int, location: string, name: string, notifications: string, profile_background_color: string, profile_background_image_url: string, profile_background_image_url_https: string, profile_background_tile: boolean, profile_image_url: string, profile_image_url_https: string, profile_link_color: string, profile_sidebar_border_color: string, profile_sidebar_fill_color: string, profile_text_color: string, profile_use_background_image: boolean, protected: boolean, screen_name: string, statuses_count: int, time_zone: string, url: string, utc_offset: string, verified: boolean>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE; LOAD DATA INPATH '/data' OVERWRITE INTO TABLE tweets0;
It is not necessarily immediately after running this code (code may have nothing to do with it) but nothing else is happening on the cluster. TIA!!
... View more
07-28-2017
06:40 AM
It was a setting in tez.lib.uris. Changed it to: /hdp/apps/${hdp.version}/tez/tez.tar.gz,hdfs://master.royble.co.uk:8020/jars/json-serde-1.3.7-jar-with-dependencies.jar (Note: no space after comma and hdfs path).
... View more
07-28-2017
05:38 AM
Thanks Deepesh. It is: HIVE_AUX_JARS_PATH=/usr/hdp/2.6.0.3-8/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar
if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then
if [ -f "${HIVE_AUX_JARS_PATH}" ]; then
export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH}
elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
fi
elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
fi
... View more
07-26-2017
04:13 PM
I am trying to run hive from the CLI: HADOOP_USER_NAME=hdfs hive -hiveconf hive.cli.print.header=true -hiveconf hive.support.sql11.reserved.keywords=false -hiveconf hive.aux.jars.path=/usr/hdp/2.6.0.3-8/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar -hiveconf hive.root.logger=DEBUG,console but I get this error: java.lang.RuntimeException: java.io.FileNotFoundException: File does not exist: hdfs://master.royble.co.uk:8020/user/hdfs/ /home/ed/Downloads/serde/json-serde-1.3.7-jar-with-dependencies.jar I have had so many problems with that jar, that I originally used to create a Hive table. Normally I would do an 'add jar' but I cannot start Hive to do that. I have tried adding the jar to hive-env, /usr/hdp/<version>/hive/auxlib (on the hive machine) and hive.aux.jars.path but nothing works. Any idea why Hive is looking for that odd path, or in fact why it is looking for it at all? FYI: master is not the machine with hive on it but it is where I run Ambari. The path /home/ed/Downloads/serde is one I have used in the past but can remember when. Using HDP-2.6.0.3. Any help is much appreciated as this is driving me mad!
... View more
07-25-2017
12:07 PM
Ironically, I am unable to access a question I asked today:
... View more
- Tags:
- hcc
07-25-2017
10:33 AM
1 Kudo
In Rstudio I do: library(sparklyr)
library(dplyr)
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client") # got from ambari spark2 configs
config <- spark_config()
sc <- spark_connect(master = "yarn-client", config = config, version = '2.1.0') which gives: Failed during initialize_connection: org.apache.hadoop.security.AccessControlException: Permission denied: user=ed, access=WRITE, inode="/user/ed/.sparkStaging/application_1500959138473_0003":admin:hadoop:drwxr-xr-x
normally I fix this sort of problem with: HADOOP_USER_NAME=hdfs hadoop fs -put but I do not know how to do this in R. I thought maybe change ed's user and group to hdfs: ed@master:~$ hdfs dfs -ls /user
Found 11 items
drwx------ - accumulo hdfs 0 2017-05-14 15:38 /user/accumulo
drwxr-xr-x - admin hadoop 0 2017-06-27 06:52 /user/admin
drwxrwx--- - ambari-qa hdfs 0 2017-06-02 10:46 /user/ambari-qa
drwxr-xr-x - admin hadoop 0 2017-06-02 11:00 /user/ed
drwxr-xr-x - hbase hdfs 0 2017-05-14 15:35 /user/hbase
drwxr-xr-x - hcat hdfs 0 2017-05-14 15:44 /user/hcat
drwxr-xr-x - hdfs hdfs 0 2017-06-20 12:43 /user/hdfs
drwxr-xr-x - hive hdfs 0 2017-05-14 15:44 /user/hive
drwxrwxr-x - oozie hdfs 0 2017-05-14 15:46 /user/oozie
drwxrwxr-x - spark hdfs 0 2017-05-14 15:40 /user/spark
drwxr-xr-x - zeppelin hdfs 0 2017-07-24 09:29 /user/zeppelin but I am worried as it is currently admin/hadoop and admin is how I log into Ambari. So I do not want to mess up other stuff. Any help is much appreciated!
... View more
07-21-2017
08:50 AM
I am using Spark's MultilayerPerceptronClassifier. This generates a column 'predicted' in 'predictions'. When I try to show it I get the error: SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) ...
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! Other columns, for example, vector display OK. Part of predictions schema: |-- vector: vector (nullable = true) |-- prediction: double (nullable = true) My code is: //racist is boolean, needs to be string:
val train2 = train.withColumn("racist", 'racist.cast("String"))
val test2 = test.withColumn("racist", 'racist.cast("String"))
val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val word2Vec = new Word2Vec().setInputCol("lemma").setOutputCol("vector") //.setVectorSize(3).setMinCount(0)
val layers = Array[Int](4,5, 2)
val mpc = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100).setFeaturesCol("vector").setLabelCol("indexracist")
val pipeline = new Pipeline().setStages(Array(indexer, word2Vec, mpc))
val model = pipeline.fit(train2)
val predictions = model.transform(test2)
predictions.select("prediction").show() Any pointers are much appreciated!
... View more
Labels:
07-05-2017
04:32 PM
Here is how you do it: Got its 'name' from here . Spark 2.1 needs scala 2.11 version, so name is: databricks:spark-corenlp:0.2.0-s_2.11. Edit the spark2 interpreter and add the name. Save it and allow it to restart. In Zeppelin: %spark.dep
z.reset()
z.load("databricks:spark-corenlp:0.2.0-s_2.11")
... View more
07-04-2017
01:27 PM
Can someone explain to me what I need to do to get Stanford CoreNLP wrapper for Apache Spark to work in Zeppelin/Spark please? I have done this: %spark.dep
z.reset() // clean up previously added artifact and repository
// add artifact recursively
z.load("databricks:spark-corenlp:0.2.0-s_2.10")
and this: import com.databricks.spark.corenlp.functions._
val dfLemmas= filteredDF.withColumn("lemmas", lemmas('noURL)).select("racist", "filtered","noURL", "lemmas")
dfLemmas.show(20, false) but I get this <console>:42: error: not found: value lemmas
val dfLemmas= filteredDF.withColumn("lemmas", lemmas('noURL)).select("racist", "filtered","noURL", "lemmas") Do I have to download the files and build them or something? If so how do I do that? Or is there an easier way? TIA!!!!
... View more
Labels: