About xindian_long

pminovic · ‎03-09-2017

Hi @X Long, how about up-voting some answers, the guys tried to help you but could not have imagined that the issue was something trivial with your IDE. Give and take. Tnx!

jwiden · ‎11-22-2016

Sorry for the late response. I'm not sure of a way to broadcast the data without collecting to the data driver first. Because we assume the size of hte broadcasted table is small, the collect and broadcast to and from the driver should be fairly quickly. You would have about the same network traffic if you were to someway skip the collect as we need a full copy of the smaller table on each worker anyways. The join(broadcast(right),) is giving spark a hint to do a broadcast join. So this will override the spark.sql.autoBroadcastJoinThreshold, which is 10mb by default. Don't try to broadcast anything larger than 2gb, as this is the limit for a single block in Spark and you will get an OOM or Overflow exception. The data structure of the blocks are capped at 2gb. This autoBroadcastJoinThreshold only applies to hive tables right now that have statistics previously ran on them. So the broadcast hint is going to be used for dataframes not in Hive or one where statistics haven't been run. The general Spark Core broadcast function will still work. In fact, underneath the hood, the dataframe is calling the same collect and broadcast that you would with the general api. The concept of partitions is still there, so after you do a broadcast join, you're free to run mapPartitions on it.

xindian_long · ‎11-08-2016

It is very clear, thanks

aervits · ‎09-29-2016

@X Long please accept the answer if it solved your problem. Thank you

ashneesharma88 · ‎09-13-2017

@X Long I was facing same kind of issue. I have resolve this issue by using following steps:- 1) Edit Ambari->Hive->Configs->Advanced->Custom hive-site->Add Property..., add the following properties based on your HBase configurations(you can search in Ambari->HBase->Configs): custom hive-site.xml hbase.zookeeper.quorum=xyz (find this property value from hbase ) zookeeper.znode.parent=/hbase-unsecure (find this property value from hbase ) phoenix.schema.mapSystemTablesToNamespace=true phoenix.schema.isNamespaceMappingEnabled=true 2) Copy jar to /usr/hdp/current/hive-server2/auxlib from /usr/hdp/2.5.6.0-40/phoenix/phoenix-4.7.0.2.5.6.0-40-hive.jar /usr/hdp/2.5.6.0-40/phoenix/phoenix-hive-4.7.0.2.5.6.0-40-sources.jar If he jar is not working for you then just try to get following jar phoenix-hive-4.7.0.2.5.3.0-37.jar and copy this to /usr/hdp/current/hive-server2/auxlib 3) add property to custom-hive-env HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-server2/auxlib/4) Add follwoing property to custom-hbase-site.xmlphoenix.schema.mapSystemTablesToNamespace=true phoenix.schema.isNamespaceMappingEnabled=true 5) Also run following command 1) jar uf /usr/hdp/current/hive-server2/auxlib/phoenix-4.7.0.2.5.6.0-40-client.jar /etc/hive/conf/hive-site.xml 2) jar uf /usr/hdp/current/hive-server2/auxlib/phoenix-4.7.0.2.5.6.0-40-client.jar /etc/hbase/conf/hbase-site.xml And I hope my solution will work for you 🙂

Online	Offline
Last Visited	‎03-09-2017 07:39 PM

Member Since	‎09-15-2016 09:39 PM
Last Visited	‎03-09-2017 07:39 PM
Posts	19
Kudos received	4

Cloudera Community

Re: Is there a way to do broadcast join in Spark ...

Re: Is there a way to do broadcast join in Spark ...

Re: Is there a way to broadcast a Dataframe/RDD wi...

Re: In HDP 2.5, Phoenix Spark plugin cannot find t...

Re: what infomation should yarn-site.xml contain w...

Re: Exception connecting to a hbase serverr with p...

Re: ​Is there a way to do broadcast join in Spark ...

Re: ​Is there a way to do broadcast join in Spark ...

Re: Is there a way to broadcast a Dataframe/RDD wi...

Re: In HDP 2.5, Phoenix Spark plugin cannot find t...

Re: what infomation should yarn-site.xml contain w...

Re: Exception connecting to a hbase serverr with p...

Re: Is there a way to do broadcast join in Spark ...

Re: Is there a way to do broadcast join in Spark ...