PARTITIONS IN HIVE
Hive tables are organized and stored as partitions. It helps improve the query performance. By default, Hive stores each partitions data in a separate folder under the table’s locations in the file system. Hive also provides support for individually setting the location of each partition. This allows users to store hive partitioned data in different locations.
Let us take an example of a hive table called
student with the four columns - id, name, dept and year and partitioned on the year column. Let’s say we insert data into this table for the years 2009 and 2011. By default, the hive table data is stored under the hive metastore directory in HDFS specified by the parameter $hive.metastore.warehouse.dir in hive-default.xml. Fig.1 shows the two partition folders under the student directory in the hive metastore.
Fig.1. Partitions in default location in Warehouse.
Now, let us say we want to add a partition for the year
2018 but store the data in a different location in HDFS - /data/hive/student/currentYear/2018. This can be achieved by running the following command in hive.
hive> ALTER TABLE student ADD PARTITION (year=2018) LOCATION ‘/data/hive/student/currentYear/2018’;
The location of an already existing partition can be changed by running the following command. Executing this command will not move any existing data in this partition to the new location.
hive> ALTER TABLE student PARTITION (year=2018) SET LOCATION ‘/data/hive/student/currentYear/2018’;
After inserting data into this partition, the data will be stored in the new location in HDFS, as shown in Fig.2 below.
Fig.2. Partitions in configured location in Warehouse. PARTITIONS IN FEDERATED CLUSTERS
federated cluster, we can store partitions on different locations on different namespaces as well. This is particularly useful when we want to separate archived data from current data so as to not overload the file system used for processing current data.
Let us take an example where we have a federated cluster with two HA namespaces - NS1 (hdfs://ns1) and NS2 (hdfs://ns2). And we want to keep the old/ archived partitions in NS2 and keep the current hive partitions in NS1.
It is a little tricky to set the location of a partition to a different namespace. Let’s say the defaultFS points to NS1 (fs.defaultFs = hdfs://ns1) and the default location for student table’s partitions is in NS1 -> hdfs://ns1/user/hive/warehouse. If we want to set the location of a partition to a different location in NS1 itself, it can be done by directly altering the partition properties, same as for a non-federated cluster.
But if we want to set the location of a partition to a different namespace, we would first have to alter the default location of the table itself and then add/ alter the partitions. Even while inserting or loading data into these partitions, the default location of the table must be set to the required namespace.
hive> ALTER TABLE student SET LOCATION 'hdfs://ns2/user/hive/warehouse/student';
hive> ALTER TABLE student ADD PARTITION (year=1990);
hive> INSERT INTO student PARTITION (year=1990) VALUES (8,'H','CSE');
The above commands would set the location of the partition year 1990 to hdfs://ns2/user/hive/warehouse/student/year=1990 and store the entry with id 8 in this location (as shown in Fig.3).
Fig.3. Partitions in a different Namespace.
Hive partition locations are very dynamic and can change with each new operation. Whenever data is inserted or loaded into a table, if the scheme and authority of the corresponding partition’s location URI does not match the scheme and authority of the table’s location URI, then the partition location is reset to default as per the table’s current location.
In the example above, if after inserting one record in partition year 1990, say we reset the table location to the original namespace NS1. Now when we insert another record, the new record would be stored in NS1 rather than NS2. The partition year 1990's default location would also be updated to point to NS1 (hdfs://ns1/user/hive/warehouse/student/year=1990).
hive> ALTER TABLE student SET LOCATION 'hdfs://ns1/user/hive/warehouse/student';
hive> INSERT INTO student PARTITION (year=1990) VALUES (16,'P','CSE');
Fig.4. Partition's location reset to default location under the table folder.
Let’s take another example where a partition’s URI path will be automatically reset to the default path of the table. Say we have an existing partition for year 2018 at location hdfs://ns1/data/hive/student/currentYear/2018. And we execute the following commands:
hive> ALTER TABLE student SET LOCATION 'hdfs://ns2/user/hive/warehouse/student'
hive> INSERT INTO student PARTITION (year=2018) VALUES (24,'X','EEE');
The above commands would alter the location of the partition year 2018 to hdfs://ns2/user/hive/warehouse/student/year=2018. Note that along with the scheme and the authority, the path component of the URI is also reset to the default location. PARTITIONS VIA VIEWFS
For a federated cluster with
ViewFs as the defaultFs, we need not worry about altering the table’s default location every time we do an operation on a partition in a different namespace. We can set a partition’s default location to a different namespace via ViewFs mount points. ViewFs takes care of resolving locations to their respective namespaces. So, partitions can be stored on different namespaces without altering the table’s default location. SUMMARY
Hive partitions can be stored in different namespaces in a federated cluster. Before performing operations such as inserting or loading data into the partition or altering the partition properties, the location of the table must be updated to point to the namespace corresponding to that partition. The scheme and authority of a partition’s URI are overridden by the scheme and authority of the table’s URI.
... View more
Device Behavior Analytics allows administrators to detect straggler nodes/disks which reduce the cluster performance. Finding and fixing these nodes can improve the overall cluster throughput. This feature will be available in HDP-2.6.1 and later. There are two parts to Device Behavior Analytics – Slow Datanode Detection and Slow Disk Detection. SLOW DATANODE (PEER) DETECTION
Datanode will collect latency statistics about their peers during the normal operation of the datanode write pipeline. These latency stats are used to detect outliers among the peers. Slow peer detection is not performed unless the datanode has statistics of at least 10 peers. Namenode maintains the list of slow peers and administrators can read it via JMX. Datanodes also expose the average write latency of their peers through Datanode JMX. SLOW DISK DETECTION
Each datanode will collect I/O statistics from all its disks. We can configure the percentage of file I/O events to be sampled to limit the performance impact of I/O profiling. Slow disk detection is not performed unless the datanode has at least 5 disks. Slow disk information is available via Datanode JMX. Namenode also exposes the slowest disks in the cluster via Namenode JMX. ENABLING DEVICE BEHAVIOR ANALYTICS
To enable Device Behavior Analytics, the following configurations must be set.
Set to true to enable Slow Datanode Detection.
Set to a value between 1 and 100 to enable Slow Disk detection. This setting controls the fraction of file IO events which will be sampled for profiling disk statistics. Setting this value to 100 will sample all disk IO. Sampling a large fraction of disk IO events might have a small performance impact.
This setting allows you to control how frequently datanodes will report their peer latencies to the Namenode via heartbeat and the frequency of disk outlier detection by the datanode. The default value for this setting is 30 minutes.
These settings should be added to hdfs-site.xml. If it is an Ambari installed cluster, then the settings can be added via custom hdfs-site.xml. SAMPLE JMX OUTPUTS
Sample Namenode JMX output reporting slow nodes
Sample Datanode JMX output showing average write latency of peers
Sample Datanode JMX output showing slow disks
The Datanode JMX output above reports disk3 and disk4 as outliers for the Datanode. The JMX also reports the latencies and number of operations per volume in another metric. The sample JMX output below for DataNodeVolume information for disk3 shows high average latencies for metadata and write IO operations.
Sample Namenode JMX output reporting slow disks and their latencies
Please follow the blog post link
here for a detailed explanation about Device Behavior Analytics.
... View more
I would like to add that you might have to edit the IntelliJ VM settings, if not done already. This is to avoid out of memory errors and also improve performance. This can be done by editing the idea.vmoptions file in one of the following ways: Use the main menu command Help | Edit Custom VM Options to create a copy of the idea.vmoptions file in the user home. Copy the existing file from the IntelliJ IDEA installation folder somewhere and save the path to this location in the environment variable IDEA_VM_OPTIONS. Edit (or create) idea.vmoptions file in /Users/<username>/Library/Preferences/IdeaIC2016.1 directory. Change the VM properties as follows. -Xms4096m
-XX:+UseCompressedOops Note that these are just sample values and you can have your own custom set of values for these parameters. Also note that there is a limit on what theses values can be set to. Setting the ReservedCodeCacheSize to 4096M in a 16GB RAM machine will prevent IntelliJ from booting up (It can at most be 2048M).
... View more