Member since 
    
	
		
		
		02-25-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                72
            
            
                Posts
            
        
                34
            
            
                Kudos Received
            
        
                5
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 4108 | 07-28-2017 10:51 AM | |
| 3431 | 05-08-2017 03:11 PM | |
| 1521 | 04-03-2017 07:38 PM | |
| 3662 | 03-21-2017 06:56 PM | |
| 1615 | 02-09-2017 08:28 PM | 
			
    
	
		
		
		08-03-2020
	
		
		07:46 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @ManuN  
 As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-03-2017
	
		
		04:34 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							@Viswa For regular unix timestamp field to human readable without T in it is lot simpler as you can use the below conversion for that.  pyspark
>>> hiveContext.sql("select from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS')").show(truncate=False)
+-----------------------+
|_c0                    |
+-----------------------+
|2017-11-02 21:35:16.000|
+-----------------------+  pyspark
>>>hiveContext.sql("select from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SSS')")  But you are expecting format as yyyy-MM-ddThh:mm:ss   For this case you need to use concat date and time with T letter  pyspark
>>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'),substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False) 
+-----------------------+
|_c0                    |
+-----------------------+
|2017-11-02T21:35:16.00|
+-----------------------+
  Your query:-  pyspark
>>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'),
substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False) //replace <unix-timestamp-column-name> with your column name  in case if you want to test in hive then use the below query  hive# select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),1,10),'T'),
substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),12));
+--------------------------+--+
|           _c0            |
+--------------------------+--+
| 2017-11-02T21:35:16.00  |
+--------------------------+--+
  Hope this will help to resolve your issue...!!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-25-2017
	
		
		04:19 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 lets think of basics.     RDD is being saved , which is a distributed across machines and hence, if all of them start writing to same file in HDFS , one can only append and write will undergo huge number of locks as multiple clients are writing at the same time. Its a classical case of distributed concurrent clients trying to write to a file ( imagine multiple threads write to same log file).  That´s the reason a directory is made and individual task write in their own file. Collectively all the files present in your output directory is the output of your Job.     Solutions :  1.  rdd.coalesce(1).saveAsTextFile('/path/outputdir'), and then In your driver use hdfs mv to move part-0000 to  finename.txt.  2.  assuming data is less ( as you want to write to a single file ) perform a rdd.collect() and write on to hdfs in the driver  , by getting a hdfs handler.       
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-20-2017
	
		
		06:34 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Try this code  from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf1 = SparkConf().setAppName('sort_desc')
sc1 = SparkContext(conf=conf1)
sql_context = SQLContext(sc1)
csv_file_path = 'emp.csv'
employee_rdd = sc1.textFile(csv_file_path).map(lambda line: line.split(','))
print(type(employee_rdd))
employee_rdd_sorted = employee_rdd.sortByKey(ascending= False)
employee_df = employee_rdd.toDF(['dept','ctc'])
employee_df_sorted = employee_rdd_sorted.toDF(['dept','ctc']) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-13-2017
	
		
		07:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Dinesh Chitlangia  Thank you for explanation. In that case i would rather use reducByKey() to get the number of occurence.  thanks for the info on CountByValue() 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-10-2017
	
		
		05:47 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Spark 1.6.3 does not support this.  https://spark.apache.org/docs/1.6.3/sql-programming-guide.html#creating-dataframes 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-30-2017
	
		
		07:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		5 Kudos
		
	
				
		
	
		
					
							 
	@Viswa  
	Here are the 2 major aspects on which they differ:  
	1. Connection:  
	
 The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services. 	
 Beeline, which connects to HiveServer2 and requires access to only one .jar file:  hive-jdbc-<version>-standalone.jar .   
	2. Authentication   Hive CLI uses only Storage Based Authentication  Beeline uses SQL standard-based authorization or Ranger-based authorization. Thus greater security.   It is better to use Beeline for the above reasons than Hive CLI (I believe it will soon be deprecated).  Read here for greater understanding on beeline : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/beeline-vs-hive-cli.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-28-2017
	
		
		12:55 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 One point: if you specify a delimiter that is not the true delimiter in the file ... no error will be thrown.  Rather, it will treat the full record (including its true delimiters) as a single field. In this case, the true delims will just be characters in a string. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













