About balavignesh_nag

HasanAmmori · ‎04-17-2020

They are actually not the same. SORT BY sorts data inside partition, while ORDER BY is global sort. SORT BY calls sortWithinPartitions() function, while ORDER BY calls sort() Both of these functions call sortInternal(), but with different global flag: def sortWithinPartitions ... sortInternal(global = false, sortExprs) def sort ... sortInternal(global = true, sortExprs)

Bindal · ‎11-22-2019

Hi @mqureshi , you have explained beautifully. But how the replication of blocks will impact this calculation? Please explain. Regards.

unblockedmovie · ‎09-25-2018

i am interesting too

kgautam · ‎02-02-2018

Lets approach your problems from basics. 1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too. 2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop. 3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table. 4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks. The logic is the same as hadoop. 5. Spark handles compression in the same way as MR . 6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage. Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.

james1 · ‎12-19-2017

you should be able to use show table extended partition to see if you can get info on it and not try to open anyone who is zero bytes. Like this: scala> var sqlCmd="show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01')" sqlCmd: String = show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01') scala> var partitionsList=sqlContext.sql(sqlCmd).collectAsList partitionsList: java.util.List[org.apache.spark.sql.Row] = [[mydb,mytable,false,Partition Values: [date_time_date=2017-01-01] Location: hdfs://mycluster/apps/hive/warehouse/mydb.db/mytable/date_time_date=2017-01-01 Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties: [serialization.format=1] Partition Parameters: {rawDataSize=441433136, numFiles=1, transient_lastDdlTime=1513597358, totalSize=4897483, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=37825} ]] Let me know if that works and you can avoid the 0 byter's with such or if you still get null pointer.. James

balavignesh_nag · ‎11-16-2017

Thanks @Geoffrey Shelton Okot. This is what I'm looking for.

ccotter · ‎11-13-2017

@Bala Vignesh N V So you've renamed the partition in Hive and can see the new name there but when you look on HDFS it still has the original partition name, correct? In my example in the previous post I originally had 2 partitions (part=a and part=b) in Hive and I renamed part=a to part=z. On HDFS, part=a never changed but the PART_NAME column in the metastore database was updated to part=z. In Hive, I can only see part=z and part=b and if I do a SELECT for the data in part=z then it will lookup the LOCATION column from the metastore database for part=z, which still points to the part=a directory on HDFS, and read data for part=z from there. So this way for external tables, you can rename the partitions in Hive to whatever you like without affecting the underlying data on HDFS.

ekoifman · ‎11-07-2017

That's not possible w/o rewriting data.

pradeepy128 · ‎11-30-2018

Hello folks, Please help me with the following query: There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount. Table : T1 ================================ ProductID | ProductName | Price -------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2 ======================= ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10 4 | 15 5 | 20 , Hello everyone, Please help me with the following query in Hive. There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount. Table : T1 ================================ ProductID | ProductName | Price -------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2 ======================= ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10 4 | 15 5 | 20

balavignesh_nag · ‎10-30-2017

@Saurabh It happens sometimes because of limitation of no of lines displayed in CLI. try this --> hive -e "show create table sample_db.i0001_ivo_hdr;" > ddl.txt

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: Spark DataFrame - difference between sort and ...

Re: How a NameNode Heap size is calculated?

Re: hive.warehouse.subdir.inherit.perms=false

Re: Handling compression in Spark

Re: Is there any fix or work around available for ...

Re: Hadoop filesystem Commands

Re: Partition rename in Hive & HDFS path

Re: Updating the bucketted Hive table

Re: How to merge two rows having same values into ...

Re: show create table view_name not showing comple...