About gbraccialli3

gbraccialli3 · ‎12-16-2015

@Virendra Agarwal have you considered using XML Serde? It supports line breaks natively and probably it will be easier to parse your XML data. See this from @Neeraj Sabharwal: https://community.hortonworks.com/articles/972/hive-and-xml-pasring.html

gbraccialli3 · ‎12-16-2015

@sandeep agarwal See this post about Spark vs Tez: https://community.hortonworks.com/questions/5408/spark-vs-tez.html#comment-6248

gbraccialli3 · ‎12-16-2015

@wiljan van ravensteijn If you are running Hive with doAs=false, metastore service must have write permission in /app/hive/warehouse and also in new directories/tables you create. If you are using Sandbox 2.3.2, easiest way is to define a Ranger police to grant access to hive user, another solution would be to execute pyspark using hive user.

gbraccialli3 · ‎12-15-2015

Great article Wes!

gbraccialli3 · ‎12-15-2015

@Cui Lin If you have to perform huge scans and joins between big tables, you can also consider creating hive tables using hbase storage handler and using hive to perform your sql queries. See examples below: https://community.hortonworks.com/questions/1558/bestoptimized-way-to-move-data-from-phoenix-to-hiv.html https://community.hortonworks.com/questions/1652/how-can-i-query-hbase-from-hive.html

gbraccialli3 · ‎12-15-2015

@Vitor Batista How difficult is it for you to upgrade spark? When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here: http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/ This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?

gbraccialli3 · ‎12-12-2015

@Kit Menke If you want to access your table from hive, you have two options: 1- create table ahead and use df.write.fromat("orc") 2- use Brandon's suggestion here, register df as temp_table and do create table as select from temp_table. See code examples here: https://community.hortonworks.com/questions/6023/orgapachesparksparkexception-task-failed-while-wri.html#answer-6048 If you use saveAsTable function, it will create a table in hive metastore, but hive wont be able to query it. Only spark can use the table with this method.

gbraccialli3 · ‎12-11-2015

Awesome!

gbraccialli3 · ‎12-11-2015

Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use rdd.repartition(8) See the comments and tests from here: https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c.html

gbraccialli3 · ‎12-11-2015

@Nir Kumar See examples from two questions below, tested with latest hdp 2.3.2: HBASE 1.1.1.2.3 HADOOP 2.7.1.2.3 HIVE 1.2.1.2.3 https://community.hortonworks.com/questions/1558/bestoptimized-way-to-move-data-from-phoenix-to-hiv.html https://community.hortonworks.com/questions/1652/how-can-i-query-hbase-from-hive.html

Online	Offline
Last Visited	‎09-28-2021 03:33 PM

Member Since	‎09-25-2015 05:42 PM
Last Visited	‎09-28-2021 03:33 PM
Posts	230
Kudos received	236

Cloudera Community

Re: How to reset Ambari Admin password?

Re: Connection Refused trying to access port 8000 ...

Re: Flume + Knox

Re: Ambari stuck with "Install Pending" when creat...

Re: HDP 2,3.4- Running jobs is not getting display...

Re: Hive support for Line Terminator other than '\...

Re: org.apache.spark.SparkException: Task failed w...

Re: Pyspark, create table in Hive metastore, which...

Re: Unofficial Storm and Kafka Best Practices Guid...

Re: Hbase query best practice

Re: Averaging RandomForest votes in Spark 1.3.1

Re: How do I create an ORC Hive table from Spark?

Re: Leveraging the upcoming HIVE 1.3 security UDFs...

Re: Getting Error while executing this command

Re: hbase hive integration with hadoop 2.7.1 versi...