Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13380 | 02-20-2018 12:33 PM | |
1514 | 02-19-2018 05:12 AM | |
1864 | 12-28-2017 06:13 AM | |
7149 | 09-28-2017 09:25 AM | |
12190 | 09-25-2017 11:19 AM |
04-17-2020
02:24 PM
They are actually not the same. SORT BY sorts data inside partition, while ORDER BY is global sort. SORT BY calls sortWithinPartitions() function, while ORDER BY calls sort() Both of these functions call sortInternal(), but with different global flag: def sortWithinPartitions ... sortInternal(global = false, sortExprs) def sort ... sortInternal(global = true, sortExprs)
... View more
11-22-2019
10:56 PM
Hi @mqureshi , you have explained beautifully. But how the replication of blocks will impact this calculation? Please explain. Regards.
... View more
09-25-2018
05:08 PM
i am interesting too
... View more
02-02-2018
03:58 PM
Lets approach your problems from basics. 1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too. 2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop. 3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table. 4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks. The logic is the same as hadoop. 5. Spark handles compression in the same way as MR . 6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage. Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.
... View more
12-19-2017
11:12 PM
you should be able to use show table extended partition to see if you can get info on it and not try to open anyone who is zero bytes. Like this: scala> var sqlCmd="show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01')" sqlCmd: String = show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01') scala> var partitionsList=sqlContext.sql(sqlCmd).collectAsList partitionsList: java.util.List[org.apache.spark.sql.Row] = [[mydb,mytable,false,Partition Values: [date_time_date=2017-01-01] Location: hdfs://mycluster/apps/hive/warehouse/mydb.db/mytable/date_time_date=2017-01-01 Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties: [serialization.format=1] Partition Parameters: {rawDataSize=441433136, numFiles=1, transient_lastDdlTime=1513597358, totalSize=4897483, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=37825} ]] Let me know if that works and you can avoid the 0 byter's with such or if you still get null pointer.. James
... View more
11-13-2017
08:36 PM
@Bala Vignesh N V So you've renamed the partition in Hive and can see the new name there but when you look on HDFS it still has the original partition name, correct? In my example in the previous post I originally had 2 partitions (part=a and part=b) in Hive and I renamed part=a to part=z. On HDFS, part=a never changed but the PART_NAME column in the metastore database was updated to part=z. In Hive, I can only see part=z and part=b and if I do a SELECT for the data in part=z then it will lookup the LOCATION column from the metastore database for part=z, which still points to the part=a directory on HDFS, and read data for part=z from there. So this way for external tables, you can rename the partitions in Hive to whatever you like without affecting the underlying data on HDFS.
... View more
11-30-2018
02:17 PM
Hello folks, Please help me with the following query: There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount.
Table : T1 ================================ ProductID | ProductName | Price
-------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2 ======================= ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10
4 | 15
5 | 20 , Hello everyone, Please help me with the following query in Hive. There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount.
Table : T1 ================================ ProductID | ProductName | Price
-------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2
=======================
ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10
4 | 15
5 | 20
... View more
10-30-2017
12:13 PM
@Saurabh It happens sometimes because of limitation of no of lines displayed in CLI. try this --> hive -e "show create table sample_db.i0001_ivo_hdr;" > ddl.txt
... View more