Member since
02-25-2016
72
Posts
34
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3684 | 07-28-2017 10:51 AM | |
2868 | 05-08-2017 03:11 PM | |
1204 | 04-03-2017 07:38 PM | |
2937 | 03-21-2017 06:56 PM | |
1207 | 02-09-2017 08:28 PM |
08-03-2020
07:46 AM
@ManuN
As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question.
... View more
11-03-2017
04:34 AM
@Viswa For regular unix timestamp field to human readable without T in it is lot simpler as you can use the below conversion for that. pyspark
>>> hiveContext.sql("select from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS')").show(truncate=False)
+-----------------------+
|_c0 |
+-----------------------+
|2017-11-02 21:35:16.000|
+-----------------------+ pyspark
>>>hiveContext.sql("select from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SSS')") But you are expecting format as yyyy-MM-ddThh:mm:ss For this case you need to use concat date and time with T letter pyspark
>>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'),substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False)
+-----------------------+
|_c0 |
+-----------------------+
|2017-11-02T21:35:16.00|
+-----------------------+
Your query:- pyspark
>>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'),
substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False) //replace <unix-timestamp-column-name> with your column name in case if you want to test in hive then use the below query hive# select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),1,10),'T'),
substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),12));
+--------------------------+--+
| _c0 |
+--------------------------+--+
| 2017-11-02T21:35:16.00 |
+--------------------------+--+
Hope this will help to resolve your issue...!!!
... View more
10-25-2017
04:19 PM
1 Kudo
lets think of basics. RDD is being saved , which is a distributed across machines and hence, if all of them start writing to same file in HDFS , one can only append and write will undergo huge number of locks as multiple clients are writing at the same time. Its a classical case of distributed concurrent clients trying to write to a file ( imagine multiple threads write to same log file). That´s the reason a directory is made and individual task write in their own file. Collectively all the files present in your output directory is the output of your Job. Solutions : 1. rdd.coalesce(1).saveAsTextFile('/path/outputdir'), and then In your driver use hdfs mv to move part-0000 to finename.txt. 2. assuming data is less ( as you want to write to a single file ) perform a rdd.collect() and write on to hdfs in the driver , by getting a hdfs handler.
... View more
10-20-2017
06:34 AM
Try this code from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf1 = SparkConf().setAppName('sort_desc')
sc1 = SparkContext(conf=conf1)
sql_context = SQLContext(sc1)
csv_file_path = 'emp.csv'
employee_rdd = sc1.textFile(csv_file_path).map(lambda line: line.split(','))
print(type(employee_rdd))
employee_rdd_sorted = employee_rdd.sortByKey(ascending= False)
employee_df = employee_rdd.toDF(['dept','ctc'])
employee_df_sorted = employee_rdd_sorted.toDF(['dept','ctc'])
... View more
10-13-2017
07:22 PM
@Dinesh Chitlangia Thank you for explanation. In that case i would rather use reducByKey() to get the number of occurence. thanks for the info on CountByValue()
... View more
10-10-2017
05:47 PM
3 Kudos
Spark 1.6.3 does not support this. https://spark.apache.org/docs/1.6.3/sql-programming-guide.html#creating-dataframes
... View more
08-30-2017
07:07 PM
5 Kudos
@Viswa
Here are the 2 major aspects on which they differ:
1. Connection:
The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services.
Beeline, which connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-<version>-standalone.jar .
2. Authentication Hive CLI uses only Storage Based Authentication Beeline uses SQL standard-based authorization or Ranger-based authorization. Thus greater security. It is better to use Beeline for the above reasons than Hive CLI (I believe it will soon be deprecated). Read here for greater understanding on beeline : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/beeline-vs-hive-cli.html
... View more
07-28-2017
12:55 PM
One point: if you specify a delimiter that is not the true delimiter in the file ... no error will be thrown. Rather, it will treat the full record (including its true delimiters) as a single field. In this case, the true delims will just be characters in a string.
... View more