Member since 
    
	
		
		
		09-19-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                23
            
            
                Posts
            
        
                12
            
            
                Kudos Received
            
        
                2
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3556 | 12-12-2017 11:23 AM | |
| 6371 | 11-07-2016 07:41 PM | 
			
    
	
		
		
		01-05-2020
	
		
		05:39 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Hi, if there are more no of files are present in spark history Server, then FS would not be working as expected. In that case, We may need to write a script to delete the old files that are more then 7 days ( or as per your requirement) from the Spark history server location . Thanks Arun
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-15-2019
	
		
		02:45 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
 Hi @Raj Zalavadia  
 As described in SPARK-16996 and SPARK-15348, Spark currently doesn't support Hive ACID ( v1 (Hive 1.XX) or v2 (3.XX) )   
 To circumvent that you can use the Hive Warewhouse connector.  
 It will create the necessary link between the 2 components, by getting Spark to connect via Hive Server2.    I'm not sure if it's directly bundled into HDI (should be).   In any case, it's available publicly at :  
 https://github.com/hortonworks/hive-warehouse-connector-release/tree/HDP-3.0.1.10-7-tag   You'll find the documentation here :   
 https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html  
 Here's another HCC article that gives you a concrete example on how to use it.  :   https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html   To get you started, here's a quick example of how to use it :   
1. The Hive Warehouse connector must be given as a dependency to spark   
 spark.jars=[path to the Hive warehouse connector]
usually : /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-[Build version].jar 
  
 2. It also requiers a few more configuration - basically describing where the Hive meta store and Hive Server2 instances reside   
 spark.datasource.hive.warehouse.metastoreUri=thrift://[YOUR METASTORE URI]:9083
spark.hadoop.hive.llap.daemon.service.hosts=@llap0 
spark.hadoop.hive.zookeeper.quorum=[YOUR HIVE ZOOKEEPER QUORUM]:2181
spark.sql.hive.hiveserver2t.jdbc.url=[YOUR HIVE LLAP JDBC URL] 
  
 These can be passed as spark conf items  
 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0
  or as extra configuration parameters for spark notebooks (ex : zeppelin)       
 3. Create a hiveWarhouse context   import com.hortonworks.hwc.HiveWarehouseSession 
import com.hortonworks.hwc.HiveWarehouseSession._ 
val hive = HiveWarehouseSession.session(spark).build() 
//set a database 
hive.setDatabase("airline_ontime") 
//show table 
hive.showTables().show(100)  4. Query Data   val flights_df = hive.executeQuery("SELECT * FROM flights WHERE year = 1989") 
flights_df.createOrReplaceTempView("flights_1998")
+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+
|month|dayofmonth|dayofweek|deptime|crsdeptime|arrtime|crsarrtime|uniquecarrier|flightnum|tailnum|actualelapsedtime|crselapsedtime|airtime|arrdelay|depdelay|origin|dest|distance|taxiin|taxiout|cancelled|cancellationcode|diverted|carrierdelay|weatherdelay|nasdelay|securitydelay|lateaircraftdelay|year|
+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+----+
|   12|        25|        1|   1415|      1415|   1547|      1552|           US|     1478|     NA|               92|            97|   null|      -5|       0|   TPA| CLT|     508|  null|   null|        0|              NA|       0|        null|        null|    null|         null|             null|1989|  5. Write data Back to Hive ( in ACID Format )   hive.table("flights").filter("month = 01")
  .write
  .format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
  .option("table", "flight_2019_01")
  .save() 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-07-2018
	
		
		07:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Correct @Matthieu Lamairesse Druid is case sensitive while Hive is not, thus, to make it work you need to make sure that all the columns are in lowercase format. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-26-2018
	
		
		08:14 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 @Srikanth t  
	The easiest approach is to use lateral views.   
	https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView  
	It allows you to split an array into multiple line.  
	1. Let's create an array from the items in your column "items"  select key, split(items, ',') as valArray
	from test
result 
+------+---------------------------------------+--+
| key  |                  _c1                  |
+------+---------------------------------------+--+
| 22   | ["1001 abc"," 1002 pqr"," 1003 tuv"]  |
| 33   | ["1004 def"," 1005 xyz"]              |
+------+---------------------------------------+--+
  2. Now let's use lateral view to split these items into lines (using "trim" to clean up the space)  select key, trim(uniqueVal)
from(
 select key, split(items, ',') as valArray
 from test ) a lateral view explode(a.valArray) exploded as uniqueVal ;
+------+-----------+--+
| key  |    _c1    |
+------+-----------+--+
| 22   | 1001 abc  |
| 22   | 1002 pqr  |
| 22   | 1003 tuv  |
| 33   | 1004 def  |
| 33   | 1005 xyz  |
+------+-----------+--+
  3.  Finally let's use split again to get separate values.   select key, split(trim(uniqueVal), ' ')[0], split(trim(uniqueVal), ' ')[1]
from(
	select key, split(items, ',') as valArray
	from test 
	) a lateral view explode(a.valArray) exploded as uniqueVal ;
+------+-------+------+--+
| key  |  _c1  | _c2  |
+------+-------+------+--+
| 22   | 1001  | abc  |
| 22   | 1002  | pqr  |
| 22   | 1003  | tuv  |
| 33   | 1004  | def  |
| 33   | 1005  | xyz  |
+------+-------+------+--+   Note :   I used the following to create the table   create table test (
key string, 
value string )
STORED AS ORC ;
INSERT INTO test (key, value )
VALUES (22, '1001 abc, 1002 pqr, 1003 tuv'),
(33, '1004 def, 1005 xyz');
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-12-2017
	
		
		11:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Hi @Fernando Lopez Bello  Sharing of interpreter processes is easily adjustable  Go to the interpreter setting page :      And scroll down to the spark interpreter :       By default interpreters are shared globaly : ie - all notes/users share the same interpreter instance (hence the same spark context)  Change the setting to either "per note" or "per user" depending on your use case :   - Per Note : ie - each note will instantiate a separate interpreter process   - Per User : ie - each user instantiates a seperate interpreter process (which is shared amongst the notes for which he/she has ownership)   Below an article written by one of the original developpers of zeppelin describing interpreter modes :   https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555  Zeppelin documentation:   https://zeppelin.apache.org/docs/latest/manual/interpreters.html#interpreter-binding-mode   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-03-2017
	
		
		04:11 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks @Matthieu Lamairesse  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-07-2017
	
		
		09:58 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 It's better you declare every field as VARCHAR and then use functions to convert them to numbers[1] for mathematical operations.  [1] https://phoenix.apache.org/language/functions.html#to_number 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-14-2017
	
		
		10:55 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Correction for syntax using beeline :   Formated data for referecence   {
  "user": {
    "location": "",
    "id": 1514008171,
    "name": "Auzzie Jet",
    "screenname": "metalheadgrunge",
    "geoenabled": false
  },
  "tweetmessage": "Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm",
  "createddate": "2013-06-20T12:08:44",
  "geolocation": null
}
<br>  From here on I refer to   
 `user`, `tweetmessage`, `createddate`, `geolocation`   => Level 1 fields   `location`, `id`, `name`, `screenname`, `geoenabled`   => Level 2 fields   Corrected select query :  SELECT t2.name, t1.tweetmessage 
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 as `user`, `tweetmessage` 
	LATERAL VIEW json_tuple(t1.`user`, 'name', 'location') t2 as `name`, `location` 
where t2.`name`="Auzzie Jet"
;  Other examples :   Select level 1 : `user`, `tweetmessage`, `createddate`, `geolocation`  SELECT t1.`user`, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 
	as `user`, `tweetmessage`, `createddate`, `geolocation`
;  Select level 1 and 2 => Flatten everything  SELECT t2.location, t2.id, t2.name, t2.screenname, t2.geoenabled, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 
	as `user`, `tweetmessage`, `createddate`, `geolocation`
	LATERAL VIEW json_tuple(t1.`user`, 'location', 'id', 'name', 'screenname', 'geoenabled' ) t2 
	as `location`, `id`, `name`, `screenname`, `geoenabled` 
;
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













