Member since
09-25-2015
230
Posts
276
Kudos Received
39
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
25016 | 07-05-2016 01:19 PM | |
8406 | 04-01-2016 02:16 PM | |
2102 | 02-17-2016 11:54 AM | |
5639 | 02-17-2016 11:50 AM | |
12627 | 02-16-2016 02:08 AM |
12-16-2015
03:29 PM
1 Kudo
@Virendra Agarwal have you considered using XML Serde? It supports line breaks natively and probably it will be easier to parse your XML data. See this from @Neeraj Sabharwal: https://community.hortonworks.com/articles/972/hive-and-xml-pasring.html
... View more
12-16-2015
02:48 AM
@sandeep agarwal See this post about Spark vs Tez: https://community.hortonworks.com/questions/5408/spark-vs-tez.html#comment-6248
... View more
12-16-2015
01:27 AM
1 Kudo
@wiljan van ravensteijn If you are running Hive with doAs=false, metastore service must have write permission in /app/hive/warehouse and also in new directories/tables you create. If you are using Sandbox 2.3.2, easiest way is to define a Ranger police to grant access to hive user, another solution would be to execute pyspark using hive user.
... View more
12-15-2015
02:56 AM
1 Kudo
Great article Wes!
... View more
12-15-2015
02:45 AM
1 Kudo
@Cui Lin If you have to perform huge scans and joins between big tables, you can also consider creating hive tables using hbase storage handler and using hive to perform your sql queries. See examples below: https://community.hortonworks.com/questions/1558/bestoptimized-way-to-move-data-from-phoenix-to-hiv.html https://community.hortonworks.com/questions/1652/how-can-i-query-hbase-from-hive.html
... View more
12-15-2015
02:06 AM
@Vitor Batista How difficult is it for you to upgrade spark? When you run spark on yarn (with hortonworks), the upgrade process is really simple, like the steps describe here: http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/ This is one of the advantages to run spark on yarn instead of spark standalone mode. Have you considered this option as well?
... View more
12-12-2015
01:33 AM
@Kit Menke If you want to access your table from hive, you have two options: 1- create table ahead and use df.write.fromat("orc") 2- use Brandon's suggestion here, register df as temp_table and do create table as select from temp_table. See code examples here: https://community.hortonworks.com/questions/6023/orgapachesparksparkexception-task-failed-while-wri.html#answer-6048 If you use saveAsTable function, it will create a table in hive metastore, but hive wont be able to query it. Only spark can use the table with this method.
... View more
12-11-2015
11:45 PM
Awesome!
... View more
12-11-2015
10:36 PM
Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use rdd.repartition(8) See the comments and tests from here:
https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c.html
... View more
12-11-2015
06:24 PM
1 Kudo
@Nir Kumar See examples from two questions below, tested with latest hdp 2.3.2: HBASE 1.1.1.2.3
HADOOP 2.7.1.2.3
HIVE 1.2.1.2.3 https://community.hortonworks.com/questions/1558/bestoptimized-way-to-move-data-from-phoenix-to-hiv.html https://community.hortonworks.com/questions/1652/how-can-i-query-hbase-from-hive.html
... View more