Created on 03-29-2018 06:53 PM
Hive tables are organized and stored as partitions. It helps improve the query performance. By default, Hive stores each partitions data in a separate folder under the table’s locations in the file system. Hive also provides support for individually setting the location of each partition. This allows users to store hive partitioned data in different locations.
Let us take an example of a hive table called student with the four columns - id, name, dept and year and partitioned on the year column. Let’s say we insert data into this table for the years 2009 and 2011. By default, the hive table data is stored under the hive metastore directory in HDFS specified by the parameter $hive.metastore.warehouse.dir in hive-default.xml. Fig.1 shows the two partition folders under the student directory in the hive metastore.
Fig.1. Partitions in default location in Warehouse.
Now, let us say we want to add a partition for the year 2018 but store the data in a different location in HDFS - /data/hive/student/currentYear/2018. This can be achieved by running the following command in hive.
hive> ALTER TABLE student ADD PARTITION (year=2018) LOCATION ‘/data/hive/student/currentYear/2018’;
The location of an already existing partition can be changed by running the following command. Executing this command will not move any existing data in this partition to the new location.
hive> ALTER TABLE student PARTITION (year=2018) SET LOCATION ‘/data/hive/student/currentYear/2018’;
After inserting data into this partition, the data will be stored in the new location in HDFS, as shown in Fig.2 below.
Fig.2. Partitions in configured location in Warehouse.
In a federated cluster, we can store partitions on different locations on different namespaces as well. This is particularly useful when we want to separate archived data from current data so as to not overload the file system used for processing current data.
Let us take an example where we have a federated cluster with two HA namespaces - NS1 (hdfs://ns1) and NS2 (hdfs://ns2). And we want to keep the old/ archived partitions in NS2 and keep the current hive partitions in NS1.
It is a little tricky to set the location of a partition to a different namespace. Let’s say the defaultFS points to NS1 (fs.defaultFs = hdfs://ns1) and the default location for student table’s partitions is in NS1 -> hdfs://ns1/user/hive/warehouse. If we want to set the location of a partition to a different location in NS1 itself, it can be done by directly altering the partition properties, same as for a non-federated cluster.
But if we want to set the location of a partition to a different namespace, we would first have to alter the default location of the table itself and then add/ alter the partitions. Even while inserting or loading data into these partitions, the default location of the table must be set to the required namespace.
hive> ALTER TABLE student SET LOCATION 'hdfs://ns2/user/hive/warehouse/student'; hive> ALTER TABLE student ADD PARTITION (year=1990); hive> INSERT INTO student PARTITION (year=1990) VALUES (8,'H','CSE');
The above commands would set the location of the partition year 1990 to hdfs://ns2/user/hive/warehouse/student/year=1990 and store the entry with id 8 in this location (as shown in Fig.3).
Fig.3. Partitions in a different Namespace.
Hive partition locations are very dynamic and can change with each new operation. Whenever data is inserted or loaded into a table, if the scheme and authority of the corresponding partition’s location URI does not match the scheme and authority of the table’s location URI, then the partition location is reset to default as per the table’s current location.
In the example above, if after inserting one record in partition year 1990, say we reset the table location to the original namespace NS1. Now when we insert another record, the new record would be stored in NS1 rather than NS2. The partition year 1990's default location would also be updated to point to NS1 (hdfs://ns1/user/hive/warehouse/student/year=1990).
hive> ALTER TABLE student SET LOCATION 'hdfs://ns1/user/hive/warehouse/student'; hive> INSERT INTO student PARTITION (year=1990) VALUES (16,'P','CSE');
Fig.4. Partition's location reset to default location under the table folder.
Let’s take another example where a partition’s URI path will be automatically reset to the default path of the table. Say we have an existing partition for year 2018 at location hdfs://ns1/data/hive/student/currentYear/2018. And we execute the following commands:
hive> ALTER TABLE student SET LOCATION 'hdfs://ns2/user/hive/warehouse/student' hive> INSERT INTO student PARTITION (year=2018) VALUES (24,'X','EEE');
The above commands would alter the location of the partition year 2018 to hdfs://ns2/user/hive/warehouse/student/year=2018. Note that along with the scheme and the authority, the path component of the URI is also reset to the default location.
For a federated cluster with ViewFs as the defaultFs, we need not worry about altering the table’s default location every time we do an operation on a partition in a different namespace. We can set a partition’s default location to a different namespace via ViewFs mount points. ViewFs takes care of resolving locations to their respective namespaces. So, partitions can be stored on different namespaces without altering the table’s default location.
Hive partitions can be stored in different namespaces in a federated cluster. Before performing operations such as inserting or loading data into the partition or altering the partition properties, the location of the table must be updated to point to the namespace corresponding to that partition. The scheme and authority of a partition’s URI are overridden by the scheme and authority of the table’s URI.