Member since
09-19-2016
23
Posts
12
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3097 | 12-12-2017 11:23 AM | |
5272 | 11-07-2016 07:41 PM |
01-05-2020
05:39 AM
Hi, if there are more no of files are present in spark history Server, then FS would not be working as expected. In that case, We may need to write a script to delete the old files that are more then 7 days ( or as per your requirement) from the Spark history server location . Thanks Arun
... View more
02-15-2019
02:45 PM
Hi @Raj Zalavadia
As described in SPARK-16996 and SPARK-15348, Spark currently doesn't support Hive ACID ( v1 (Hive 1.XX) or v2 (3.XX) )
To circumvent that you can use the Hive Warewhouse connector.
It will create the necessary link between the 2 components, by getting Spark to connect via Hive Server2. I'm not sure if it's directly bundled into HDI (should be). In any case, it's available publicly at :
https://github.com/hortonworks/hive-warehouse-connector-release/tree/HDP-3.0.1.10-7-tag You'll find the documentation here :
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
Here's another HCC article that gives you a concrete example on how to use it. : https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html To get you started, here's a quick example of how to use it :
1. The Hive Warehouse connector must be given as a dependency to spark
spark.jars=[path to the Hive warehouse connector]
usually : /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-[Build version].jar
2. It also requiers a few more configuration - basically describing where the Hive meta store and Hive Server2 instances reside
spark.datasource.hive.warehouse.metastoreUri=thrift://[YOUR METASTORE URI]:9083
spark.hadoop.hive.llap.daemon.service.hosts=@llap0
spark.hadoop.hive.zookeeper.quorum=[YOUR HIVE ZOOKEEPER QUORUM]:2181
spark.sql.hive.hiveserver2t.jdbc.url=[YOUR HIVE LLAP JDBC URL]
These can be passed as spark conf items
--conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0
or as extra configuration parameters for spark notebooks (ex : zeppelin)
3. Create a hiveWarhouse context import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
//set a database
hive.setDatabase("airline_ontime")
//show table
hive.showTables().show(100) 4. Query Data val flights_df = hive.executeQuery("SELECT * FROM flights WHERE year = 1989")
flights_df.createOrReplaceTempView("flights_1998")
+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+
|month|dayofmonth|dayofweek|deptime|crsdeptime|arrtime|crsarrtime|uniquecarrier|flightnum|tailnum|actualelapsedtime|crselapsedtime|airtime|arrdelay|depdelay|origin|dest|distance|taxiin|taxiout|cancelled|cancellationcode|diverted|carrierdelay|weatherdelay|nasdelay|securitydelay|lateaircraftdelay|year|
+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+
| 12| 25| 1| 1415| 1415| 1547| 1552| US| 1478| NA| 92| 97| null| -5| 0| TPA| CLT| 508| null| null| 0| NA| 0| null| null| null| null| null|1989| 5. Write data Back to Hive ( in ACID Format ) hive.table("flights").filter("month = 01")
.write
.format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "flight_2019_01")
.save()
... View more
11-07-2018
07:07 PM
Correct @Matthieu Lamairesse Druid is case sensitive while Hive is not, thus, to make it work you need to make sure that all the columns are in lowercase format.
... View more
09-26-2018
08:14 AM
2 Kudos
@Srikanth t
The easiest approach is to use lateral views.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
It allows you to split an array into multiple line.
1. Let's create an array from the items in your column "items" select key, split(items, ',') as valArray
from test
result
+------+---------------------------------------+--+
| key | _c1 |
+------+---------------------------------------+--+
| 22 | ["1001 abc"," 1002 pqr"," 1003 tuv"] |
| 33 | ["1004 def"," 1005 xyz"] |
+------+---------------------------------------+--+
2. Now let's use lateral view to split these items into lines (using "trim" to clean up the space) select key, trim(uniqueVal)
from(
select key, split(items, ',') as valArray
from test ) a lateral view explode(a.valArray) exploded as uniqueVal ;
+------+-----------+--+
| key | _c1 |
+------+-----------+--+
| 22 | 1001 abc |
| 22 | 1002 pqr |
| 22 | 1003 tuv |
| 33 | 1004 def |
| 33 | 1005 xyz |
+------+-----------+--+
3. Finally let's use split again to get separate values. select key, split(trim(uniqueVal), ' ')[0], split(trim(uniqueVal), ' ')[1]
from(
select key, split(items, ',') as valArray
from test
) a lateral view explode(a.valArray) exploded as uniqueVal ;
+------+-------+------+--+
| key | _c1 | _c2 |
+------+-------+------+--+
| 22 | 1001 | abc |
| 22 | 1002 | pqr |
| 22 | 1003 | tuv |
| 33 | 1004 | def |
| 33 | 1005 | xyz |
+------+-------+------+--+ Note : I used the following to create the table create table test (
key string,
value string )
STORED AS ORC ;
INSERT INTO test (key, value )
VALUES (22, '1001 abc, 1002 pqr, 1003 tuv'),
(33, '1004 def, 1005 xyz');
... View more
12-12-2017
11:23 AM
2 Kudos
Hi @Fernando Lopez Bello Sharing of interpreter processes is easily adjustable Go to the interpreter setting page : And scroll down to the spark interpreter : By default interpreters are shared globaly : ie - all notes/users share the same interpreter instance (hence the same spark context) Change the setting to either "per note" or "per user" depending on your use case : - Per Note : ie - each note will instantiate a separate interpreter process - Per User : ie - each user instantiates a seperate interpreter process (which is shared amongst the notes for which he/she has ownership) Below an article written by one of the original developpers of zeppelin describing interpreter modes : https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555 Zeppelin documentation: https://zeppelin.apache.org/docs/latest/manual/interpreters.html#interpreter-binding-mode
... View more
10-03-2017
04:11 AM
Thanks @Matthieu Lamairesse
... View more
09-07-2017
09:58 AM
1 Kudo
It's better you declare every field as VARCHAR and then use functions to convert them to numbers[1] for mathematical operations. [1] https://phoenix.apache.org/language/functions.html#to_number
... View more
08-14-2017
10:55 AM
Correction for syntax using beeline : Formated data for referecence {
"user": {
"location": "",
"id": 1514008171,
"name": "Auzzie Jet",
"screenname": "metalheadgrunge",
"geoenabled": false
},
"tweetmessage": "Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm",
"createddate": "2013-06-20T12:08:44",
"geolocation": null
}
<br> From here on I refer to
`user`, `tweetmessage`, `createddate`, `geolocation` => Level 1 fields `location`, `id`, `name`, `screenname`, `geoenabled` => Level 2 fields Corrected select query : SELECT t2.name, t1.tweetmessage
FROM tweets t
LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 as `user`, `tweetmessage`
LATERAL VIEW json_tuple(t1.`user`, 'name', 'location') t2 as `name`, `location`
where t2.`name`="Auzzie Jet"
; Other examples : Select level 1 : `user`, `tweetmessage`, `createddate`, `geolocation` SELECT t1.`user`, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t
LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1
as `user`, `tweetmessage`, `createddate`, `geolocation`
; Select level 1 and 2 => Flatten everything SELECT t2.location, t2.id, t2.name, t2.screenname, t2.geoenabled, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t
LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1
as `user`, `tweetmessage`, `createddate`, `geolocation`
LATERAL VIEW json_tuple(t1.`user`, 'location', 'id', 'name', 'screenname', 'geoenabled' ) t2
as `location`, `id`, `name`, `screenname`, `geoenabled`
;
... View more