About sreeviswa_athic

VidyaSargur · ‎08-03-2020

@ManuN As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question.

sreeviswa_athic · ‎01-30-2018

@Ramya Jayathirtha it worked thank you for your timely response

dineshc · ‎12-02-2017

@Viswa - Did this help?

Shu_ashu · ‎11-03-2017

@Viswa For regular unix timestamp field to human readable without T in it is lot simpler as you can use the below conversion for that. pyspark >>> hiveContext.sql("select from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS')").show(truncate=False) +-----------------------+ |_c0 | +-----------------------+ |2017-11-02 21:35:16.000| +-----------------------+ pyspark >>>hiveContext.sql("select from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SSS')") But you are expecting format as yyyy-MM-ddThh:mm:ss For this case you need to use concat date and time with T letter pyspark >>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'),substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False) +-----------------------+ |_c0 | +-----------------------+ |2017-11-02T21:35:16.00| +-----------------------+ Your query:- pyspark >>>hiveContext.sql("""select concat(concat(substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),1,10),'T'), substr(cast(from_unixtime(cast(<unix-timestamp-column-name> as bigint),'yyyy-MM-dd HH:mm:ss.SS') as string),12))""").show(truncate=False) //replace <unix-timestamp-column-name> with your column name in case if you want to test in hive then use the below query hive# select concat(concat(substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),1,10),'T'), substr(cast(from_unixtime(cast(1509672916 as bigint),'yyyy-MM-dd HH:mm:ss.SSS') as string),12)); +--------------------------+--+ | _c0 | +--------------------------+--+ | 2017-11-02T21:35:16.00 | +--------------------------+--+ Hope this will help to resolve your issue...!!!

kgautam · ‎10-25-2017

lets think of basics. RDD is being saved , which is a distributed across machines and hence, if all of them start writing to same file in HDFS , one can only append and write will undergo huge number of locks as multiple clients are writing at the same time. Its a classical case of distributed concurrent clients trying to write to a file ( imagine multiple threads write to same log file). That´s the reason a directory is made and individual task write in their own file. Collectively all the files present in your output directory is the output of your Job. Solutions : 1. rdd.coalesce(1).saveAsTextFile('/path/outputdir'), and then In your driver use hdfs mv to move part-0000 to finename.txt. 2. assuming data is less ( as you want to write to a single file ) perform a rdd.collect() and write on to hdfs in the driver , by getting a hdfs handler.

thangarajanp · ‎10-20-2017

Try this code from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext conf1 = SparkConf().setAppName('sort_desc') sc1 = SparkContext(conf=conf1) sql_context = SQLContext(sc1) csv_file_path = 'emp.csv' employee_rdd = sc1.textFile(csv_file_path).map(lambda line: line.split(',')) print(type(employee_rdd)) employee_rdd_sorted = employee_rdd.sortByKey(ascending= False) employee_df = employee_rdd.toDF(['dept','ctc']) employee_df_sorted = employee_rdd_sorted.toDF(['dept','ctc'])

sreeviswa_athic · ‎10-13-2017

@Dinesh Chitlangia Thank you for explanation. In that case i would rather use reducByKey() to get the number of occurence. thanks for the info on CountByValue()

dineshc · ‎10-10-2017

Spark 1.6.3 does not support this. https://spark.apache.org/docs/1.6.3/sql-programming-guide.html#creating-dataframes

dineshc · ‎08-30-2017

@Viswa Here are the 2 major aspects on which they differ: 1. Connection: The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services. Beeline, which connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-<version>-standalone.jar . 2. Authentication Hive CLI uses only Storage Based Authentication Beeline uses SQL standard-based authorization or Ranger-based authorization. Thus greater security. It is better to use Beeline for the above reasons than Hive CLI (I believe it will soon be deprecated). Read here for greater understanding on beeline : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/beeline-vs-hive-cli.html

gkeys · ‎07-28-2017

One point: if you specify a delimiter that is not the true delimiter in the file ... no error will be thrown. Rather, it will treat the full record (including its true delimiters) as a single field. In this case, the true delims will just be characters in a string.

Online	Offline
Last Visited	‎04-04-2020 07:07 PM

Member Since	‎02-25-2016 11:18 PM
Last Visited	‎04-04-2020 07:07 PM
Posts	72
Kudos received	34

Cloudera Community

Re: Hive query execution taking longer time

Re: HBase/Phoenix - How to specify autocommit in J...

Re: Hive 1.2.x - CTAS behavior when using CAST to ...

Re: Oozie > HiveActionExecutor > LauncherMapper di...

Re: How to apply configuration when creating more ...

Re: Hive tables are split into files, how can we k...

Re: Kakfa Spark Streaming Error

Re: Spark read and overwrtite hive table

Re: pyspark convert unixtimestamp to datetime

Re: pyspark - creating directory when trying to RD...

Re: Spark sort by key with descending order

Re: not able to save countByValue() RDD to textFil...

Re: Spark 1.6.3 bucketBy error

Re: Hive CLI vs Beeline

Re: getting error when trying to deilimit the rows...