Hive partition divides the table into a number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets or Clusters. The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.
The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket. Here, CLUSTERED BY clause is used to divide the table into buckets. each partition will be created as a directory. But in Hive Buckets, each bucket will be created as a file. Bucketing can also be done even without partitioning on Hive tables.
Bucketed tables allow much more efficient sampling than the non-bucketed tables. Allowing queries on a section of data for testing and debugging purpose when the original data sets are very huge. Here, the user can fix the size of buckets according to the need. This concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.
The Client can interact with the Hive in the below three ways:-
ü Hive Thrift Client: The Hive server is exposed as thrift service. Hence it is possible to interact with HIVE with any programming language that supports thrift.
ü JDBC Driver: Hive uses pure Type 4 JDBC driver to connect to the server which is defined in org.apache.hadoop.HIVE.JDBC.HiveDriver class. Pure Java applications may use this driver in order to connect to an application using separate host and port.
The BeeLine CLI uses JDBC Driver to connect to the HIVE Server.
ü ODBC Driver: An ODBC Driver allows an application that supports ODBC to connect to the HIVE server. By default, Apache does not ship the ODBC Driver but it is freely available by many vendors.