Member since
02-01-2018
37
Posts
2
Kudos Received
0
Solutions
03-05-2018
06:09 AM
Will usage of DISTRIBUTE BY or SORT BY be helpful?
... View more
03-02-2018
05:42 PM
@Akshat Mathur That is handy, thank you for sharing! If you think my response was helpful, please accept the answer to make it easier for others to find answers.
... View more
02-26-2018
07:48 PM
Serialization is the algorithm by which data is written to disk or transmitted somewhere. Different applications have different ways to serialize data to optimize for a specific outcome, whether it is dealing with reads or writes. As it says in the Hive language manual, integers and strings are encoded to disk and compressed in different ways, and it lists out the rules which it uses to do so. For example, variable-width encoding optimizes the space usage of the data, as it uses less space to encode smaller data. See the following Wikipedia article for more detail: https://en.wikipedia.org/wiki/Serialization
... View more
02-27-2018
08:03 AM
1 Kudo
Adding columns to the end of the table works from Hive 1.2 (via HDP 2.5.4). In Hive 2.1, you get additional abilities to change the column types. In the eventual Hive 2.2, you'll get the ability to delete and reorder columns. Hive .13 is a little early for those features
... View more
02-23-2018
07:52 AM
Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.
... View more
02-20-2018
12:23 PM
yes, that's correct. The only interesting point is, how your client works. If it is a proper hadoop client, it will run directly and in parallel on the nodes storing the file blocks. If you have a non-hadoop client it will really retrieve the full file from hdfs, process it and write back to hdfs. In a Spark application each of the steps a-d will be executed in parallel on different nodes, while the hadoop framework takes care to bring the execution to the data. And if the stripes are unluckily distributed on the blocks (and therefore hdfs nodes), the data tranfer between the nodes is much higher, than if the stripes are well distributed. But this is exactly because the stripes are created independent from the blocks. And it also is the key to optimize the stripe size (together with your use pattern).
... View more
03-02-2018
06:39 PM
@owen My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.
... View more
02-19-2018
11:42 AM
1 Kudo
Yess its possible only and only if there are no repetition in a column. In this case one will end up with the file and meta info of the columnar file format.
... View more