Member since
06-09-2016
529
Posts
129
Kudos Received
104
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1788 | 09-11-2019 10:19 AM | |
| 9426 | 11-26-2018 07:04 PM | |
| 2560 | 11-14-2018 12:10 PM | |
| 5562 | 11-14-2018 12:09 PM | |
| 3244 | 11-12-2018 01:19 PM |
07-19-2018
01:31 PM
@Bin Ye I recently presented image recognition using spark during meetup in Santiago, Chile. I've made code and presentation along with all necessary things to run the code public under github. Feel free to review it. I used Nifi to pull messages with images from twitter and send them to kafka topic. From there I used spark streaming to pull messages from topic and performed the image analysis. Finally I stored the results in hbase table. https://github.com/felixalbani/future-of-data-santiago-e1-spark-nifi HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-19-2018
01:16 PM
@David Pocivalnik You can search for artifacts and versions on http://repo.hortonworks.com/ For example, I was able to find hbase-client maven dependency <dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.0.0.3.0.0.0-1634</version>
</dependency>
HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-19-2018
12:59 PM
@Muhammad Umar What python version are you using? One of the imports seems to point to python3 - If that is the case you will need to export few environment settings in order for this to run correctly. Check: https://community.hortonworks.com/questions/138351/how-to-specify-python-version-to-use-with-pyspark.html When running on yarn master deployment mode client the executors will run on any of the cluster worker nodes. This means that you need to make sure all the necessary python libraries you are using along with python desired version is installed on all cluster worker nodes in advanced. Finally, it would be good to have both the driver log (which is printed to stdout of spark-submit) along complete yarn logs -applicationId <appId> for further diagnosis. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-19-2018
12:49 PM
@Melchicédec NDUWAYO you will probably need to escape quotes in the strings (or try using single quotes instead) so that it wont break the javascript. If you agree lets discuss this in different thread as it seems to be specific to running code with strings now and the initial question has been addressed.
... View more
07-18-2018
01:19 PM
@forest lin spark.driver.extraClassPath is not same as the one I shared for cluster mode. Could you confirm the code is running in client mode? And then try the exact settings I provided for cluster mode? Please let me know how it goes!
... View more
07-18-2018
01:17 PM
@Deb This looks to be related to parquet way for coding being different in spark than in hive. Have you tried reading a different non parquet table? Try adding the following configuration for the parquet table: .config("spark.sql.parquet.writeLegacyFormat","true") If that does not work please open a new thread on this issue and we can follow up on this new thread. Thanks!
... View more
07-17-2018
10:46 PM
@n c Please review this hcc link https://community.hortonworks.com/questions/57866/how-to-move-hive-and-associated-components-from-on.html Definitely the most important piece is to have a good database backup when hivemetastore is down. Then as outlined above move the mysql database (or used database) first. Then you can move the other components using ambari. And yes, webhcat is part of hive! HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-17-2018
01:09 PM
@forest lin The above suggestion was for --deploy-mode client and I see you used --deploy-mode cluster instead. If you are willing to run in cluster mode you need to do this changes: cp /etc/hbase/conf/hbase-site.xml /etc/spark/conf
cp /etc/hbase/conf/hbase-site.xml /etc/spark2/conf
export SPARK_CLASSPATH="/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.2.0-205-spark2.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/lib/hbase-client.jar:/usr/hdp/current/phoenix-client/lib/phoenix-spark2-4.7.0.2.6.2.0-205.jar:/usr/hdp/current/phoenix-client/lib/hbase-common.jar:/usr/hdp/current/phoenix-client/lib/hbase-protocol.jar:/usr/hdp/current/phoenix-client/lib/phoenix-core-4.7.0.2.6.2.0-205.jar"
spark-submit \
--class com.test.SmokeTest \
--master yarn\
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 4 \
--num-executors 2 \
--conf "spark.executor.extraClassPath=phoenix-4.7.0.2.6.2.0-205-spark2.jar:phoenix-client.jar:hbase-client.jar:phoenix-spark2-4.7.0.2.6.2.0-205.jar:hbase-common.jar:hbase-protocol.jar:phoenix-core-4.7.0.2.6.2.0-205.jar" \
--conf "spark.driver.extraClassPath=phoenix-4.7.0.2.6.2.0-205-spark2.jar:phoenix-client.jar:hbase-client.jar:phoenix-spark2-4.7.0.2.6.2.0-205.jar:hbase-common.jar:hbase-protocol.jar:phoenix-core-4.7.0.2.6.2.0-205.jar" \
--jars /usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.2.0-205-spark2.jar,/usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/current/phoenix-client/lib/hbase-client.jar,/usr/hdp/current/phoenix-client/lib/phoenix-spark2-4.7.0.2.6.2.0-205.jar,/usr/hdp/current/phoenix-client/lib/hbase-common.jar,/usr/hdp/current/phoenix-client/lib/hbase-protocol.jar,/usr/hdp/current/phoenix-client/lib/phoenix-core-4.7.0.2.6.2.0-205.jar \
--files /etc/hbase/conf/hbase-site.xml
--verbose \
/tmp/test-1.0-SNAPSHOT.jar
HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-17-2018
12:51 PM
@Debananda Sahoo In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code: from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession \
.builder \
.appName("data_import") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.shuffle.service.enabled", "true") \
.enableHiveSupport() \
.getOrCreate()
jdbcDF2 = spark.read \
.jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test") More information and examples on this link: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases Please let me know if that works for you. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
07-17-2018
12:51 PM
@Debananda Sahoo In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code: from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession \
.builder \
.appName("data_import") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.shuffle.service.enabled", "true") \
.enableHiveSupport() \
.getOrCreate()
jdbcDF2 = spark.read \
.jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test") More information and examples on this link: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases Please let me know if that works for you. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more