About mlamairesse

AKR · ‎01-05-2020

Hi, if there are more no of files are present in spark history Server, then FS would not be working as expected. In that case, We may need to write a script to delete the old files that are more then 7 days ( or as per your requirement) from the Spark history server location . Thanks Arun

mlamairesse · ‎02-15-2019

Hi @Raj Zalavadia As described in SPARK-16996 and SPARK-15348, Spark currently doesn't support Hive ACID ( v1 (Hive 1.XX) or v2 (3.XX) ) To circumvent that you can use the Hive Warewhouse connector. It will create the necessary link between the 2 components, by getting Spark to connect via Hive Server2. I'm not sure if it's directly bundled into HDI (should be). In any case, it's available publicly at : https://github.com/hortonworks/hive-warehouse-connector-release/tree/HDP-3.0.1.10-7-tag You'll find the documentation here : https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html Here's another HCC article that gives you a concrete example on how to use it. : https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html To get you started, here's a quick example of how to use it : 1. The Hive Warehouse connector must be given as a dependency to spark spark.jars=[path to the Hive warehouse connector] usually : /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-[Build version].jar 2. It also requiers a few more configuration - basically describing where the Hive meta store and Hive Server2 instances reside spark.datasource.hive.warehouse.metastoreUri=thrift://[YOUR METASTORE URI]:9083 spark.hadoop.hive.llap.daemon.service.hosts=@llap0 spark.hadoop.hive.zookeeper.quorum=[YOUR HIVE ZOOKEEPER QUORUM]:2181 spark.sql.hive.hiveserver2t.jdbc.url=[YOUR HIVE LLAP JDBC URL] These can be passed as spark conf items --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 or as extra configuration parameters for spark notebooks (ex : zeppelin) 3. Create a hiveWarhouse context import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build() //set a database hive.setDatabase("airline_ontime") //show table hive.showTables().show(100) 4. Query Data val flights_df = hive.executeQuery("SELECT * FROM flights WHERE year = 1989") flights_df.createOrReplaceTempView("flights_1998") +-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+ |month|dayofmonth|dayofweek|deptime|crsdeptime|arrtime|crsarrtime|uniquecarrier|flightnum|tailnum|actualelapsedtime|crselapsedtime|airtime|arrdelay|depdelay|origin|dest|distance|taxiin|taxiout|cancelled|cancellationcode|diverted|carrierdelay|weatherdelay|nasdelay|securitydelay|lateaircraftdelay|year| +-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+ | 12| 25| 1| 1415| 1415| 1547| 1552| US| 1478| NA| 92| 97| null| -5| 0| TPA| CLT| 508| null| null| 0| NA| 0| null| null| null| null| null|1989| 5. Write data Back to Hive ( in ACID Format ) hive.table("flights").filter("month = 01") .write .format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR) .option("table", "flight_2019_01") .save()

sbouguerra · ‎11-07-2018

Correct @Matthieu Lamairesse Druid is case sensitive while Hive is not, thus, to make it work you need to make sure that all the columns are in lowercase format.

mlamairesse · ‎09-26-2018

@Srikanth t The easiest approach is to use lateral views. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView It allows you to split an array into multiple line. 1. Let's create an array from the items in your column "items" select key, split(items, ',') as valArray from test result +------+---------------------------------------+--+ | key | _c1 | +------+---------------------------------------+--+ | 22 | ["1001 abc"," 1002 pqr"," 1003 tuv"] | | 33 | ["1004 def"," 1005 xyz"] | +------+---------------------------------------+--+ 2. Now let's use lateral view to split these items into lines (using "trim" to clean up the space) select key, trim(uniqueVal) from( select key, split(items, ',') as valArray from test ) a lateral view explode(a.valArray) exploded as uniqueVal ; +------+-----------+--+ | key | _c1 | +------+-----------+--+ | 22 | 1001 abc | | 22 | 1002 pqr | | 22 | 1003 tuv | | 33 | 1004 def | | 33 | 1005 xyz | +------+-----------+--+ 3. Finally let's use split again to get separate values. select key, split(trim(uniqueVal), ' ')[0], split(trim(uniqueVal), ' ')[1] from( select key, split(items, ',') as valArray from test ) a lateral view explode(a.valArray) exploded as uniqueVal ; +------+-------+------+--+ | key | _c1 | _c2 | +------+-------+------+--+ | 22 | 1001 | abc | | 22 | 1002 | pqr | | 22 | 1003 | tuv | | 33 | 1004 | def | | 33 | 1005 | xyz | +------+-------+------+--+ Note : I used the following to create the table create table test ( key string, value string ) STORED AS ORC ; INSERT INTO test (key, value ) VALUES (22, '1001 abc, 1002 pqr, 1003 tuv'), (33, '1004 def, 1005 xyz');

mlamairesse · ‎12-12-2017

Hi @Fernando Lopez Bello Sharing of interpreter processes is easily adjustable Go to the interpreter setting page : And scroll down to the spark interpreter : By default interpreters are shared globaly : ie - all notes/users share the same interpreter instance (hence the same spark context) Change the setting to either "per note" or "per user" depending on your use case : - Per Note : ie - each note will instantiate a separate interpreter process - Per User : ie - each user instantiates a seperate interpreter process (which is shared amongst the notes for which he/she has ownership) Below an article written by one of the original developpers of zeppelin describing interpreter modes : https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555 Zeppelin documentation: https://zeppelin.apache.org/docs/latest/manual/interpreters.html#interpreter-binding-mode

guruvancharan · ‎10-03-2017

Thanks @Matthieu Lamairesse

asinghal · ‎09-07-2017

It's better you declare every field as VARCHAR and then use functions to convert them to numbers[1] for mathematical operations. [1] https://phoenix.apache.org/language/functions.html#to_number

mlamairesse · ‎08-14-2017

Correction for syntax using beeline : Formated data for referecence { "user": { "location": "", "id": 1514008171, "name": "Auzzie Jet", "screenname": "metalheadgrunge", "geoenabled": false }, "tweetmessage": "Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm", "createddate": "2013-06-20T12:08:44", "geolocation": null } <br> From here on I refer to `user`, `tweetmessage`, `createddate`, `geolocation` => Level 1 fields `location`, `id`, `name`, `screenname`, `geoenabled` => Level 2 fields Corrected select query : SELECT t2.name, t1.tweetmessage FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 as `user`, `tweetmessage` LATERAL VIEW json_tuple(t1.`user`, 'name', 'location') t2 as `name`, `location` where t2.`name`="Auzzie Jet" ; Other examples : Select level 1 : `user`, `tweetmessage`, `createddate`, `geolocation` SELECT t1.`user`, t1.tweetmessage, t1.createddate, t1.geolocation FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 as `user`, `tweetmessage`, `createddate`, `geolocation` ; Select level 1 and 2 => Flatten everything SELECT t2.location, t2.id, t2.name, t2.screenname, t2.geoenabled, t1.tweetmessage, t1.createddate, t1.geolocation FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 as `user`, `tweetmessage`, `createddate`, `geolocation` LATERAL VIEW json_tuple(t1.`user`, 'location', 'id', 'name', 'screenname', 'geoenabled' ) t2 as `location`, `id`, `name`, `screenname`, `geoenabled` ;

Online	Offline
Last Visited	‎03-17-2020 06:15 AM

Member Since	‎09-19-2016 10:10 AM
Last Visited	‎03-17-2020 06:15 AM
Posts	23
Kudos received	12

Cloudera Community

Re: How to avoid sharing spark context between two...

Re: Migrating from one hive table to another hive...

Re: how to delete all application logs from spark ...

Re: Issue while inserting into Managed table with ...

Re: Druid kafka ingestion from Hive - HDP 3.0

Re: Hive Split for columns

Re: How to avoid sharing spark context between two...

Re: what setting controls the max # of rows can be...

Re: When creating a Phoenix view over an existing ...

Re: Importing and Querying JSON data in Hive