About Jim_B

Jim_B · ‎12-31-2017

Hive is very powerful, but sometimes you need to add some procedural code for a special circumstance such as complex parsing of a field. Hive provides the ability to easily create User Defined Table Functions (UDFT’s). These allow you to transform your Hive results, pass them through the UDTF and return data as a set of rows that can then be used like any other Hive result set. These can be written in Java or Python, and we will use Python for this article. However, the techniques here are applicable to both with some syntax changes. There are a lot of great articles on building these such as https://community.hortonworks.com/articles/72414/how-to-create-a-custom-udf-for-hive-using-python.html. These pretty much work as advertised, but don’t get into how to debug or troubleshoot your code. However, when you get into any significant logic (and sometimes not so significant!), you are likely to create a few bugs. This can be an issue parsing, or in the case of Python, can even be syntax errors in your code! So we are going to look at two techniques that can be used to debug these UDTF’s. The problem When Hive encounters an error in the UDTF it simply blows up with a pretty confusing error. The underlying error is hidden, and you are left scratching your head. The error will probably look something like: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script. Or from a Yarn perspective: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 NOT VERY HELPFUL!!! Our test scenario We need to parse json text stored in a column of a Hive table into columns. Hive serde’s can handle most json formats, but there are still a few outlier situations, where you need a more flexible json parser. Quick UDTF Review First, two important things to remember about Hive UDTF’s: Hive sends data in to your UDTF through stdin as strings of column values separated by tabs Your UDTF sends rows back to Hive through stdout as column values separated by tabs So, this means that if you are getting an error in your UDTF, you can’t just print debug statements to stdout. Hive is expecting its’ output here, and just won’t print them. Rather, it would cause a format error. The Table CREATE EXTERNAL TABLE `default.events`(`json_text` string) STORED AS TEXTFILE LOCATION '/tmp/events'; The Data { "deviceId": "13a46b21-9528-4eb1-93bd-303a3b3e6b6a", "events": [ { "Process_Started": { "timestamp": "2017-06-01T18:26:24.444Z" } }, { "Process_Stopped": { "timestamp": "2017-06-01T18:26:24.444Z", "errorReason": "-1", "errorMsg": "The operation couldn’t be completed." } } ] }{ "deviceId": "9cd57d50-4d0e-457e-9fd3-05b9e56644e6", "events": [ { "Process_Started": { "timestamp": "2017-06-02T00:20:20.400Z" } }, { "Process_Completed": { "timestamp": "2017-06-02T02:20:29.020" } } ] } The Query We will save this in select_json.hql DELETE FILE /home/<your id>/parse_events.py; ADD FILE /home/<your id>/parse_events.py; SELECT TRANSFORM (json_text) USING 'python parse_events.py' AS deviceId, eventType, eventTime, errorReason, errorMsg FROM default.events; The UDFT #!/usr/bin/python ################################################################################################## Hive UDTF to parse json data ################################################################################################## import sys import json reload(sys) sys.setdefaultencoding('utf8') def parse_json(json_string): j = json.loads(json_string) deviceId=j["deviceId"] events=j["events"] # Force a stupid error! x=1 y=0 z=x/y # Flatten Events Array for evt in events: try: eventType = evt.keys()[0] e = evt[eventType] edata = []edata.append(eventType) edata.append(e.get("timestamp",u'')) edata.append(e.get("errorReason",u'')) edata.append(e.get("errorMsg",u'')) # Send a tab-separated string back to Hive print u'\t'.join(edata) except Exception as ex: sys.stderr.write('AN ERROR OCCURRED IN PYTHON UDTF\n %s\n' % ex.message) def main(argv): # Parse each line sent from Hive (note we are only receiving 1 column, so no split needed) for line in sys.stdin: parse_json(line) if __name__ == "__main__": main(sys.argv[1:]) Let's Run It! Here's a hint, Python should throw and error "ZeroDivisionError: integer division or modulo by zero". Assuming you have saved the query in select_json.hql, this would go something like this: hive -f select_json.hql ... Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator ... Task with the most failures(4): ----- Task ID: task_1514310228021_3433_m_000000 ----- Diagnostic Messages for this Task: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:210) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script. at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:560) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:192) ... 8 more Nothing about divide by Zero anywhere. Ugh! TECHNIQUE 1 - Forget Hive! You are writing a Hive UDTF, but you are also just writing a program that reads from stdin and writes to stdout. So, it is a great idea to develop your logic completely outside of Hive, and once you have adequately tested you can plug it in and continue development. The easiest way to do this, which also allows you to test later with no changes is to pull out the data that Hive would send your UDTF and feed it to stdin. Given our table, it could be done like this: hive -e "INSERT OVERWRITE LOCAL DIRECTORY '/tmp/events' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT json_text FROM default.events;" -bash-4.1$ ls -l /tmp/events total 4 -rw-r--r-- 1 screamingweasel screamingweasel 486 Dec 31 23:02 000000_0 cat /tmp/events/* { "deviceId": "13a46b21-9528-4eb1-93bd-303a3b3e6b6a", "events": [ { "Process_Started": { "timestamp": "2017-06-01T18:26:24.444Z" } }, { "Process_Stopped": { "timestamp": "2017-06-01T18:26:24.444Z", "errorReason": "-1", "errorMsg": "The operation couldn’t be completed." } } ] } { "deviceId": "9cd57d50-4d0e-457e-9fd3-05b9e56644e6", "events": [ { "Process_Started": { "timestamp": "2017-06-02T00:20:20.400Z" } }, { "Process_Completed": { "timestamp": "2017-06-02T02:20:29.020" } } ] } # DO THE ACTUAL TEST (note there may be >1 file in the directory) cat /tmp/events/* | python parse_events.py Traceback (most recent call last): File "parse_events.py", line 42, in <module> main(sys.argv[1:]) File "parse_events.py", line 39, in main parse_json(line) File "parse_events.py", line 18, in parse_json z=x/y ZeroDivisionError: integer division or modulo by zero Simple as that! Export the columns you will be passing to the UDTF to a tab-separated file and pipe it into your UDTF. This will simulate Hive calling your UDTF, but doesn't bury any error messages. In addition, you can print whatever debug messages you like to stdout or stderr to help in debugging. TECHNIQUE 2 - stderr is your friend! As noted, Hive expects the results from the UDTF to be in stdout. The stderr file is fair game for writing debug statements. This is pretty old school debugging, but it's still effective. Print out values and locations in your code to help you determine where the error occurs or what values are in variables at certain times. For example, you might add the following to the UDTF script to help identify where the issue is happening: sys.stderr.write("Before stupid error\n") x=1 y=0 z=x/y sys.stderr.write("After stupid error!\n") The trick is to find these in the logs when running on a Yarn cluster. These scripts are set to use mapreduce, which makes it a little easier, but basically, you find the Yarn job, drill down on one of the failed containers and examine its' stderr. Attached are some screen prints from the Yarn RM showing this process. Winner, Winner, Here are our debugging statements! SUPPORTING FILES FOR THIS ARTICLE ARE AVAILABLE ON GITHUB AT https://github.com/screamingweasel/articles/tree/master/udtf_debugging

Jim_B · ‎11-16-2017

First of all, Nic Swart's comment is VERY IMPORTANT! Thanks much. Without the extra parentheses it just doesn't work (doesn't bother to tell you that the syntax of the ldap query is wrong, grrr!) Second, To be clear, the ldapRealm.xxx settings are recommended for use with Active Directory starting with Zeppelin 0.7. This can be a little confusing.

Jim_B · ‎10-25-2017

One thing I wish I had known when starting with python UDF's is that you can write to stderr to assist in debugging. Then look in the Yarn RM for the logs. import sys sys.stderr.write('>>>> Read a line \n' + line + '\n')

Jim_B · ‎10-07-2017

Thanks! Very subtle difference, but obviously important to Spark! For everyone's reference, this tar command can be used to create a tar.gz with the jars in the root of the archive: cd /usr/hdp/current/spark2-client/jars/ tar -zcvf /tmp/spark2-hdp-yarn-archive.tar.gz * # List the files in the archive. Note that they are in the root! tar -tvf /tmp/spark2-hdp-yarn-archive.tar.gz -rw-r--r-- root/root 69409 2016-11-30 03:31 activation-1.1.1.jar -rw-r--r-- root/root 445288 2016-11-30 03:31 antlr-2.7.7.jar -rw-r--r-- root/root 302248 2016-11-30 03:31 antlr4-runtime-4.5.3.jar -rw-r--r-- root/root 164368 2016-11-30 03:31 antlr-runtime-3.4.jar ... # Then upload to hdfs, fix ownership and permissions if needed, and good to go!

Jim_B · ‎10-04-2017

I was getting a zero-length error on /usr/hdp/apps/spark2/spark2-hdp-yarn-archive.tar.gz, which is documented as an issue after some upgrades. So I created and uploaded the file to hdfs using the following commands: tar -zcvf spark2-hdp-yarn-archive.tar.gz /usr/hdp/current/spark2-client/jars/* hadoop fs -put spark2-hdp-yarn-archive.tar.gz /hdp/apps/2.5.3.0-37/spark2/ Now when running any spark job in yarn (say the example pi app), I get the following error: Error: 'Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster' Other info: This is HDP 2.5.3 Running Spark 2.1 Upgraded from HDP 2.2.8 -> 2.4.3 -> 2.5.3 I believe the missing class is in spark/lib/spark-hdp-assembly.jar, but this does not exist. HERE'S THE WEIRD PART - If I completely remove the spark2-hdp-yarn-archive.tar.gz from HDFS then Spark jobs start to run again! So, here are the questions: Is this file (spark2-hdp-yarn-archive.tar.gz) needed? If so, any direction on correcting this error. Thanks in advance!

Jim_B · ‎09-30-2017

Bridging the Process Time – Event Time gap with Hive (Part 1) Synopsis Reconciling the difference between event time and collection/processing time is critical to understand for any system that analyses event data. This is important whether events are processed in batch or near real-time streams. This post focuses on batch processing with Hive, and demonstrates easily replicable mechanisms for bridging this gap. We will look at the issues surrounding this and prevent two repeatable solution patterns using Hive and Hive ACID. This first post will look at the issue and present the solution using Hive only, and the follow-up article will introduce Hive ACID and a solution using that technology. Overview One of the most common big data ingestion cases is event data, and as IoT becomes more important, so does this use case. This is one of the most common Hadoop use cases, but I have not found many detailed step by step patterns for implementing it. In addition, I think it is important to understand some of the thinking around events, and specifically, the gap between event time and processing times. One of the key considerations in event analysis is the difference between data collection time (process time) and the time that the event occurred (event time.) A more formal definition might be: Event Time – The time that the event occurred Processing Time – The time that the event was observed in the processing system In an ideal world, these two times would be the same or very close. However, in the real world there is always some time lag or “skew”. And, this skew may be significant, and this exists whether you are processing events in batches or in near real-time. This skew can be caused by many different factors including Resource Limitations – Bandwidth, CPU, etc. may not allow events to be immediately forwarded and processed. Software Features/Limitations – Software may be intentionally programmed to queue events and send them at predetermined times. For example, cable TV boxes that report information once or twice a day, or fitness trackers that send some information, such as sleep data only daily. Network Discontinuity – Any mobile application needs to plan for disruptions in Internet connectivity. Whether because of dead-spots in wireless coverage, airplane-mode, or dead batteries, these interruptions can happen regularly. To mitigate these, any good mobile app will queue event messages for sending the next time that a connection is available, which may be minutes or months! Time Windows Much of the current interest is around near real-time ingestion of event data. There are many advantages to this, but a lot of use cases only require event data to be processed in larger windows of data. That’s is the focus of the remainder of this article. I was surprised to find a lack of posts about the mechanics of dealing with event skew and reporting by event time in batch systems. So, I wanted to layout some repeatable patterns that can be used for this. As you probably know, event streams are essentially unbounded stream of logs. We often deal with this as a series of bounded datasets each representing some time period. Our main consideration here is a batched process that deals with large windows (15 min to 1 hour), but applies down to any level, since we almost always analyze event data by time in the target system. The Problems There are two main issues in dealing with this—Completeness and Restatement. Completeness—When event data can come in for some time past the end of a time window, it is very difficult to assess the completeness of the data. Most the data may arrive within a period (hour or day) of the time window. However, data may continue to trickle in for quite some time afterwards. This presents issues of Processing and combining data that arrives over time Determining a cutoff when data is considered complete As we can see in this figure, most event data is received in the few windows after the event time. However, data continues to trickle in, and in fact, 100% theoretical completeness may never be achieved! So, if we were to report on the event data at day 3 and at day 7 the results would be very different. Restatement—By this we mean the restatement of data that has arrived and been grouped by process time into our desired dimension of event time. This would not be an issue if we could simply scan through all the data each time we want to analyze it, but this becomes unworkable as the historical data grows. We need to find a way to process just the newly arrived data and combine it with the existing data. Other Requirements In addition, with dealing with our two main issues, we want to a solution that will Be Scalable – Any solution must be able to scale to large volumes of data, particularly as event history grows over time. Any solution that relies on completely reprocessing the entire set of data will quickly become unworkable. Provide the ability to reprocess data – Restating event data by Event Time is pretty straightforward if everything goes right. However, if we determine that source data was corrupt or needs to be reloaded for any reasons, things get messy. In that case, we potentially have data from multiple processing periods co-mingled for the same event time partition. So, to reprocess a process period, we need to separate out those rows for the process period and replace them, while leaving the other rows in the partition intact. Not always an easy task with HDFS! As an aside, to reprocess data, you need to keep the source around for a while. Pretty obvious, but just saying! Sample Use Case and Data For an example use case we will use events arriving for a mobile device representing streaming video viewing events. For this use case, we will receive a set of files hourly and place them in a landing folder in HDFS with an external Hive table laid on top. The processing (collection) time is stamped into the filename using the format YYYYMMDDHH-nnnn.txt. This external table will contain one period’s data at a time and serves as an initial landing zone. We are also going to assume that we need to save this data in detail, and that analysis will be done directly on the detailed data. Thus, we need to restate the data by event time in the detail store. Raw Input Source Format Of particular interest is the event_time columns which is an ISO timestamp in the form: YYYY-MM-DDTHH:MM:SS.sssZ CREATE EXTERNAL TABLE video_events_stg ( device_id string, event_type string, event_time string, play_time_ms bigint, buffer_time_ms bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/landing/video_events_stg'; https://raw.githubusercontent.com/screamingweasel/sample-data/master/schema/video_events_stg.hql Detailed Table Format CREATE TABLE video_events ( device_id string, event_type string, event_time string, play_time_ms bigint, buffer_time_ms bigint) PARTITIONED BY ( event_year string, event_month string, event_day string, event_hour string, process_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/schema/video_events.hql Sample Data I have put together three files, each containing one hour of processing data. You can pull them from GitHub and load the first hour into hdfs. mkdir -p /tmp/video cd /tmp/video wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011200-00001.txt wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011201-00001.txt wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011202-00001.txt hadoop fs -rm -r /landing/video_events_stg hadoop fs -mkdir -p /landing/video_events_stg hadoop fs -put /tmp/video/2017011200.00001.txt /landing/video_events_stg/ Solutions Let’s look at two possible solutions that meet our criteria above. The first utilizes Hive without the newer ACID features. The second post in this series details how to solve this using Hive ACID. Per our requirements, both will have to restate the data as it is ingested into the detailed Hive table and both must support reprocessing of data. Solution 1 This solution uses pure Hive and does not rely on the newer ACID transaction feature. As noted one hour’s worth of raw input may contain data from any number of event times. We want to reorganize this and store it in the detailed table partitioned by event time for easy reporting. This can be visualized as: Loading Restatement We are going to achieve this through Hive Dynamic Partitioning. Later versions of Hive (0.13+) support efficient dynamic partitioning that can accomplish this. Dynamic partitioning is, unfortunately, a bit slower than inserting to a static fixed partition. Our approach of incrementally ingesting should mitigate this, but you would need to benchmark this with your volume. set hive.exec.dynamic.partition.mode=nonstrict; set hive.optimize.sort.dynamic.partition=true; INSERT INTO TABLE default.video_events PARTITION (event_year, event_month, event_day, event_hour, process_time) SELECT device_id,event_type, CAST(regexp_replace(regexp_replace(event_time,'Z',''),'T',' ') as timestamp) as event_time, play_time_ms, buffer_time_ms, substr(event_time,1,4) AS event_year, substr(event_time,6,2) AS event_month, substr(event_time,9,2) AS event_day,substr(event_time,12,2) AS event_hour,substr(regexp_extract(input__file__name, '.*\/(.*)', 1),1,10) AS process_timeFROM default.video_events_stg; You can see from a “show partitions” that three partitions were created, one for each event time period. Show partitions default.video_events; event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 Now let’s process the rest of the data and see the results: hadoop fs -rm -skipTrash /landing/video_events_stg/* hadoop fs -put /tmp/video/2017011201-00001.txt /landing/video_events_stg/ hive -f video_events_insert.hql hadoop fs -rm -skipTrash /landing/video_events_stg/* hadoop fs -put /tmp/video/2017011202-00001.txt /landing/video_events_stg/ hive -f video_events_insert.hql show partitions default.video_events; event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 3000 So, we can see that our new data is being nicely added by event time. Note that now there are multiple partitions for the event hour, each corresponding to a processing event. We will see how that is used in the next section. Reprocessing In order to reprocess input data for a specific process period, we need to be able to identify that data in the restated detail and remove it before reprocessing. The approach we are going to take here is to keep the process period as part of the partition scheme, so that those partitions can be easily identified. In this case, the partitioning would be: Event Year Event Month Event Day Event Hour Process Timestamp (concatenated) Ex. year=2017/month=01/day=10/hour=01/process_date=2017011202 year=2017/month=01/day=12/hour=01/process_date=2017011202 year=2017/month=01/day=12/hour=02/process_date=2017011202 This makes it fairly simple to reprocess a period of source data. 1.List all the partitions of the table and identify ones from the specific processing hour to be reprocessed. 2.Manually drop those partitions. 3.Restore the input data and reprocess the input data as normal Let’s assume that the data for hour 2017-01-12 01 was incorrect and needs reprocessed. From the show partitions statement, we can see that there are three partitions containing data from that processing time. event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 Let’s drop ‘em and see what we get ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='1',event_hour='22',process_time='2017011201'); ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='11',event_hour='23',process_time='2017011201'); ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='12',event_hour='00',process_time='2017011201'); show partitions video_events; event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 2000 Now, finally let’s put that data back and reprocess it. hadoop fs -rm -skipTrash /landing/video_events_stg/* hadoop fs -put /tmp/video/2017011201-00001.txt /landing/video_events_stg/ hive -f video_events_insert.hql show partitions default.video_events; event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 3000 Comments on this Solution One drawback of this solution is that you may end up with small files as event trickle in for older event times. For example, if you only get a handful of events that come in 4 weeks after the event time, you are going to get some very small files, indeed! Our next solution will overcome that issue by using Hive ACID. Conclusion When handling event data, we must always be aware of the skew between event time and processing time in order to provide accurate analytics. Our solution to restating the data in terms of event time must be scalable, performant, and allow for reprocessing of data. We looked at one solution using plain Hive and partitioning. In the next of this series we will look at Hive ACID transactions to develop a more advanced and simpler solution. Accompanying files can be found at: https://github.com/screamingweasel/articles/tree/master/event_processing_part_1

Jim_B · ‎02-23-2017

Yes, if you want to be more restrictive you could use the user hdfs or @hadoop to indicate any user in the hadoop group.

Jim_B · ‎02-17-2017

Centralized Cache Management in HDFS is a mechanism that explicitly caches specific files or directories in memory for improved performance. This is useful for relatively small files that are accessed repeatedly. For example, reference/lookup tables or fact tables that are used in many joins. Once enabled, HDFS will automatically cache selected files, and periodically check for changes and recache the files. While HDFS and the underlying file system do some caching of files when memory is available, explicit caching using Centralized Cache Management prevents the data from being evicted from memory when processes consume all of the physical memory. As a corollary of this, if you ARE working on a lightly loaded system where there is free memory, you may not see any performance improvement from this method, as the data was already in disk cache. So, your performance testing needs to stress the system. Let’s look at some key terms and concepts: Cache Pools A cache pool is an administrative entity used to manage groups of cache directives. One of the key attributes of the pool it the maximum number of bytes that can be cached for all directives in this pool. Cache pools can be managed from the command line using the hdfs cacheadmin utility. Some common commands include: hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>] hdfs cacheadmin -listPools -stats hdfs cacheadmin -removeDirective <id>hdfs cacheadmin -listDirectives [-stats] [-path <path>][-pool <pool>] Cache Directives A cache directive defines a path that should be cached. This can be either a specific file or a single directory. Note that directives are not recursive—They apply to a single directory only, not any sub-directories. So, they would usually be applied to the lowest level directory that contains the actual data files. Cache directives can be managed from the command line using the hdfs cacheadmin utility. Some common commands include: hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>][-maxTtl <maxTtl> hdfs cacheadmin -removePool <name> hdfs cacheadmin -listPools [-stats] [<name>] HDFS Configuration Settings There is really only one Hadoop configuration setting that is required to turn on Centralized Caching. There are a few others to control the frequency that caching looks for new files, which you can usually leave at default. The following, which is added to the custom hdfs-site.xml, specifies the maximum number of bytes that can be cached on each datanode. dfs.datanode.max.locked.memory Remember that this value is in bytes, in contrast with the OS limits which are set in KB. OS Limits Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”. ulimit -l To set this, you need to add the following to /etc/security/limits.conf. This is for Centos/Red Hat and may be different if you are using another Linux distro. The setting will take effect when you log out of a terminal session and log back in. * hard memlock 1048576 * soft memlock 1048576 How to Implement Let’s walk through an example. 1. Set memlock limits on each datanode. This will take effect after you logout and login again. # On each datanode (max cacheable memory in KB) example for 1.0 GB echo "* hard memlock 1048576" >> /etc/security/limits.conf echo "* soft memlock 1048576" >> /etc/security/limits.conf 2. Create a folder to be cached hadoop fs -mkdir -p /cache/usstates hadoop fs -chmod -R 777 /cache 3. Create a Cache Pool hdfs cacheadmin -addPool testPool -mode 0777 -maxTtl never 4. Create one or more Cache Directives hdfs cacheadmin -addDirective -path /cache/usstates -pool testPool -replication 3 -ttl never 5. Change HDFS configurations to support Caching. Add the following to HDFS configs, Custom hdfs-site.xml in Ambari. This example is for 0.5 GB (in bytes) dfs.datanode.max.locked.memory=536870912 6. Restart HDFS using Ambari 7. Get some test data and load into cached directory wget http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv # Strip the column headers and double quotes tail -n +2 states.csv > usstates.csv sed -i 's/\"//g' usstates.csv hadoop fs -put –f usstates.csv /cache/usstates 8. Look for cached data. You should see a value in BYTES_CACHED and FILES_CACHED. hdfs cacheadmin -listDirectives -stats Found 1 entry ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED 1 testPool 3 never /cache/usstates 2550 2550 1 1 9. Query the data with Hive (Optional) CREATE EXTERNAL TABLE states (state_name string, state_cd string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/cache/usstates'; select * from usstates; After that you can update the file (add some dummy states) and re-upload to hadoop, and verify that the changed are picked up, add additional folders, files, etc. and generally experiment. You can performance test on your particular system, remembering that you may not see much difference unless the memory is used forcing the normal disk cache to be evicted from memory. References: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_hdfs-administration/content/ch03.html

Jim_B · ‎12-20-2016

Hive Server2 logs are in /var/log/hive

Jim_B · ‎12-16-2016

We have two Clusters, H6 and H7 that have been linked for HA distcp as well as HBase sharing as described in http://henning.kropponline.de/2015/03/15/distcp-two-ha-cluster/. When running an hdfs rebalancer command on one of the clusters, it attempts to run the command ON BOTH CLUSTERS. Is this expected behavior? If this is expected, then will the rebalancer command be run separately, as if we were executing on two separate cluster. Specifically, we want to make sure that it is not going to try to balance ACROSS the clusters and mix up data blocks Is there any way to prevent this behavior. We are thinking of a custom hdfs-site.xml with only one clusters info in the path when running rebalancer. Thoughts? Here is the configuration, assuming two clusters named h6 and h7. Component H6 H7 Name Service nn-h6 nn-h7 HA Namenode Name nn-1, nn2 nn-1, nn2 On H6 the following have been added to the hdfs-site.xml file: dfs.nameservices nn-h6,nn-h7 dfs.ha.namenodes.nn-h7=nn1,nn2 dfs.client.failover.proxy.provider.nn-h7= org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.namenode.http-address.nn-h7.nn1=hdpnn-h7-example.com:50070 dfs.namenode.http-address.nn-h7.nn2=hdpnn-h7-awsw02.example.com:50070 dfs.namenode.https-address.nn-h7.nn1=hdpnn-h7-example.com:50470 dfs.namenode.https-address.nn-h7.nn2=hdpnn-h7-awsw02.example.com:50470 dfs.namenode.rpc-address.nn-h7.nn1=hdpnn-h7-example.com:8020 dfs.namenode.rpc-address.nn-h7.nn2=hdpnn-h7-awsw02.example.com:8020 On H7 the following have been added to the hdfs-site.xml file: dfs.nameservices nn-h7,nn-h6 dfs.ha.namenodes.nn-h6=nn1,nn2 dfs.client.failover.proxy.provider.nn-h6= org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.namenode.http-address.nn-h6.nn1=hdpnn-h6-example.com:50070 dfs.namenode.http-address.nn-h6.nn2=hdpnn-h6-awsw02.example.com:50070 dfs.namenode.https-address.nn-h6.nn1=hdpnn-h6-example.com:50470 dfs.namenode.https-address.nn-h6.nn2=hdpnn-h6-awsw02.example.com:50470 dfs.namenode.rpc-address.nn-h6.nn1=hdpnn-h6-example.com:8020 dfs.namenode.rpc-address.nn-h6.nn2=hdpnn-h6-awsw02.example.com:8020

Online	Offline
Last Visited	‎11-12-2020 12:26 AM

Member Since	‎05-22-2019 10:28 AM
Last Visited	‎11-12-2020 12:26 AM
Posts	70
Kudos received	22

Cloudera Community

Re: Hive queries are failing in Ineractive quey HD...

Re: Disable Hive shell for user and provide access...

Re: Unable to get Nifi site-to-site RPG to balance...

Re: Ranger permissions to create temporary functio...

Re: Ambari client install performs unwanted JDK up...

How to troubleshoot Hive UDTF functions

Re: HDP 2.6+ - Configuring Zeppelin for Active Dir...

Re: How to create a custom UDF for Hive using Pyth...

Re: Spark2 - Getting 'Could not find or load main ...

Spark2 - Getting 'Could not find or load main clas...

Bridging the gap between event and process date wi...

Re: Using HDFS Centralized Cache Management

Using HDFS Centralized Cache Management

Re: hive Query logs

With two HA clusters configured for cross-cluster ...