Member since
05-22-2019
70
Posts
24
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1147 | 12-12-2018 09:05 PM | |
1166 | 10-30-2018 06:48 PM | |
1602 | 08-23-2018 11:17 PM | |
7314 | 10-07-2016 07:54 PM | |
1931 | 08-18-2016 05:55 PM |
02-12-2020
11:06 PM
Hi @vignesh_radhakr . you can simply access your hive by using below :- URL:- conn = hive.Connection(host="masterIP", port=10000, username="cdh123")
note:- MasterIP need to pass with port 10000 Thanks HadoopHelp
... View more
08-23-2018
11:17 PM
After some research and help, I found that I had incorrectly set the following nifi.remote.input.host property. It should be set as follows in Advance nifi properties: nifi.remote.input.host={{nifi_node_host}} Each node should have a different value for nifi.remote.input.host, it's the value the current node is going to advertise for s2s comms... if you set that the same on all nodes then they are all advertising the same hostname and thus all data going to same host. You have to set the other multi-threading parameters such as "Maximum Timer Driven Thread Count" in controller settings, and "Concurrent Tasks" in the appropriate processors. But, this gets multiple nodes to listen for the RPG requests.
... View more
12-31-2017
11:57 PM
Hive
is very powerful, but sometimes you need to add some procedural code for a
special circumstance such as complex parsing of a field. Hive provides the
ability to easily create User Defined Table Functions (UDFT’s). These allow you
to transform your Hive results, pass them through the UDTF and return data as a
set of rows that can then be used like any other Hive result set. These can be
written in Java or Python, and we will use Python for this article. However,
the techniques here are applicable to both with some syntax changes. There
are a lot of great articles on building these such as https://community.hortonworks.com/articles/72414/how-to-create-a-custom-udf-for-hive-using-python.html. These
pretty much work as advertised, but don’t get into how to debug or troubleshoot
your code. However, when you get into any significant logic (and sometimes not
so significant!), you are likely to create a few bugs. This can be an issue
parsing, or in the case of Python, can even be syntax errors in your code! So
we are going to look at two techniques that can be used to debug these UDTF’s. The problem When
Hive encounters an error in the UDTF it simply blows up with a pretty confusing
error. The underlying error is hidden, and you are left scratching your head.
The error will probably look something like: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error
occurred when trying to close the Operator running your custom script. Or
from a Yarn perspective: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143 NOT VERY HELPFUL!!! Our
test scenario We
need to parse json text stored in a column of a Hive table into columns. Hive
serde’s can handle most json formats, but there are still a few outlier
situations, where you need a more flexible json parser. Quick UDTF Review First, two important things to remember about Hive UDTF’s: Hive sends data in to your UDTF through stdin as strings of column values separated by tabs Your UDTF sends rows back to Hive through stdout as column values separated by tabs So, this means that if you are getting an error in your UDTF, you can’t just print debug statements to stdout. Hive is expecting its’ output here, and just won’t print them. Rather, it would cause a format error. The Table CREATE EXTERNAL TABLE `default.events`(`json_text` string)
STORED AS TEXTFILE
LOCATION '/tmp/events'; The Data { "deviceId":
"13a46b21-9528-4eb1-93bd-303a3b3e6b6a", "events": [ {
"Process_Started": { "timestamp":
"2017-06-01T18:26:24.444Z" } }, { "Process_Stopped": {
"timestamp": "2017-06-01T18:26:24.444Z",
"errorReason": "-1", "errorMsg": "The
operation couldn’t be completed." } } ] }{ "deviceId":
"9cd57d50-4d0e-457e-9fd3-05b9e56644e6", "events": [ {
"Process_Started": { "timestamp":
"2017-06-02T00:20:20.400Z" } }, { "Process_Completed": {
"timestamp": "2017-06-02T02:20:29.020" } } ] } The Query We will save this in select_json.hql DELETE FILE /home/<your id>/parse_events.py;
ADD FILE /home/<your id>/parse_events.py;
SELECT TRANSFORM (json_text)
USING 'python parse_events.py'
AS deviceId, eventType, eventTime, errorReason, errorMsg
FROM default.events; The UDFT #!/usr/bin/python
##################################################################################################
Hive UDTF to parse json data
##################################################################################################
import sys
import json
reload(sys)
sys.setdefaultencoding('utf8')
def parse_json(json_string):
j = json.loads(json_string)
deviceId=j["deviceId"]
events=j["events"]
# Force a stupid error!
x=1
y=0
z=x/y
# Flatten Events Array
for evt in events:
try:
eventType = evt.keys()[0]
e = evt[eventType]
edata = []edata.append(eventType)
edata.append(e.get("timestamp",u''))
edata.append(e.get("errorReason",u''))
edata.append(e.get("errorMsg",u''))
# Send a tab-separated string back to Hive
print u'\t'.join(edata)
except Exception as ex:
sys.stderr.write('AN ERROR OCCURRED IN PYTHON UDTF\n %s\n' % ex.message)
def main(argv):
# Parse each line sent from Hive (note we are only receiving 1 column, so no split needed)
for line in sys.stdin:
parse_json(line)
if __name__ == "__main__":
main(sys.argv[1:]) Let's Run It! Here's a hint, Python should throw and error "ZeroDivisionError: integer division or modulo by zero". Assuming you have saved the query in select_json.hql, this would go something like this: hive -f select_json.hql
...
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
Task with the most failures(4):
-----
Task ID:
task_1514310228021_3433_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:560)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:192)
... 8 more Nothing about divide by Zero anywhere. Ugh! TECHNIQUE 1 - Forget Hive! You are writing a Hive UDTF, but you are also just writing a program that reads from stdin and writes to stdout. So, it is a great idea to develop your logic completely outside of Hive, and once you have adequately tested you can plug it in and continue development. The easiest way to do this, which also allows you to test later with no changes is to pull out the data that Hive would send your UDTF and feed it to stdin. Given our table, it could be done like this: hive -e "INSERT OVERWRITE LOCAL DIRECTORY '/tmp/events'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
SELECT json_text FROM default.events;"
-bash-4.1$ ls -l /tmp/events
total 4
-rw-r--r-- 1 screamingweasel screamingweasel 486 Dec 31 23:02 000000_0
cat /tmp/events/*
{ "deviceId": "13a46b21-9528-4eb1-93bd-303a3b3e6b6a", "events": [ { "Process_Started": { "timestamp": "2017-06-01T18:26:24.444Z" } }, { "Process_Stopped": { "timestamp": "2017-06-01T18:26:24.444Z", "errorReason": "-1", "errorMsg": "The operation couldn’t be completed." } } ] }
{ "deviceId": "9cd57d50-4d0e-457e-9fd3-05b9e56644e6", "events": [ { "Process_Started": { "timestamp": "2017-06-02T00:20:20.400Z" } }, { "Process_Completed": { "timestamp": "2017-06-02T02:20:29.020" } } ] }
# DO THE ACTUAL TEST (note there may be >1 file in the directory)
cat /tmp/events/* | python parse_events.py
Traceback (most recent call last):
File "parse_events.py", line 42, in <module>
main(sys.argv[1:])
File "parse_events.py", line 39, in main
parse_json(line)
File "parse_events.py", line 18, in parse_json
z=x/y
ZeroDivisionError: integer division or modulo by zero Simple as that! Export the columns you will be passing to the UDTF to a tab-separated file and pipe it into your UDTF. This will simulate Hive calling your UDTF, but doesn't bury any error messages. In addition, you can print whatever debug messages you like to stdout or stderr to help in debugging. TECHNIQUE 2 - stderr is your friend! As noted, Hive expects the results from the UDTF to be in stdout. The stderr file is fair game for writing debug statements. This is pretty old school debugging, but it's still effective. Print out values and locations in your code to help you determine where the error occurs or what values are in variables at certain times. For example, you might add the following to the UDTF script to help identify where the issue is happening: sys.stderr.write("Before stupid error\n")
x=1
y=0
z=x/y
sys.stderr.write("After stupid error!\n") The trick is to find these in the logs when running on a Yarn cluster. These scripts are set to use mapreduce, which makes it a little easier, but basically, you find the Yarn job, drill down on one of the failed containers and examine its' stderr. Attached are some screen prints from the Yarn RM showing this process. Winner, Winner, Here are our debugging statements! SUPPORTING FILES FOR THIS ARTICLE ARE AVAILABLE ON GITHUB AT https://github.com/screamingweasel/articles/tree/master/udtf_debugging
... View more
Labels:
10-07-2017
05:52 PM
1 Kudo
Thanks! Very subtle difference, but obviously important to Spark! For everyone's reference, this tar command can be used to create a tar.gz with the jars in the root of the archive: cd /usr/hdp/current/spark2-client/jars/
tar -zcvf /tmp/spark2-hdp-yarn-archive.tar.gz *
# List the files in the archive. Note that they are in the root!
tar -tvf /tmp/spark2-hdp-yarn-archive.tar.gz
-rw-r--r-- root/root 69409 2016-11-30 03:31 activation-1.1.1.jar
-rw-r--r-- root/root 445288 2016-11-30 03:31 antlr-2.7.7.jar
-rw-r--r-- root/root 302248 2016-11-30 03:31 antlr4-runtime-4.5.3.jar
-rw-r--r-- root/root 164368 2016-11-30 03:31 antlr-runtime-3.4.jar
...
# Then upload to hdfs, fix ownership and permissions if needed, and good to go!
... View more
09-30-2017
05:28 AM
Bridging the Process Time – Event Time gap with Hive (Part 1) Synopsis Reconciling the difference between event time and
collection/processing time is critical to understand for any system that
analyses event data. This is important whether events are processed in batch or
near real-time streams. This post focuses on batch processing with Hive, and
demonstrates easily replicable mechanisms for bridging this gap. We will look at the issues surrounding this and prevent two repeatable
solution patterns using Hive and Hive ACID. This first post will look at the
issue and present the solution using Hive only, and the follow-up article will
introduce Hive ACID and a solution using that technology. Overview One of the most common big data
ingestion cases is event data, and as IoT becomes more important, so does this
use case. This is one of the most common
Hadoop use cases, but I have not found many detailed step by step patterns for
implementing it. In addition, I think it is important to understand some of the
thinking around events, and specifically, the gap between event time and
processing times. One of the key considerations in
event analysis is the difference between data collection time (process time)
and the time that the event occurred (event time.) A more formal definition might be: Event Time – The time that the event occurred Processing Time – The time that the event was observed in the
processing system In an ideal world, these two times would be the same or very
close. However, in the real world there is always some time lag or “skew”. And, this skew may be significant, and this
exists whether you are processing events in batches or in near real-time. This skew can be caused by many
different factors including Resource Limitations – Bandwidth, CPU, etc. may not allow events to
be immediately forwarded and processed. Software Features/Limitations – Software may be intentionally
programmed to queue events and send them at predetermined times. For example,
cable TV boxes that report information once or twice a day, or fitness trackers
that send some information, such as sleep data only daily. Network Discontinuity – Any mobile application needs to plan for
disruptions in Internet connectivity. Whether because of dead-spots in wireless
coverage, airplane-mode, or dead batteries, these interruptions can happen
regularly. To mitigate these, any good mobile app will queue event messages for
sending the next time that a connection is available, which may be minutes or
months! Time Windows Much of the current interest is around near real-time
ingestion of event data. There are many advantages to this, but a lot of use
cases only require event data to be processed in larger windows of data. That’s
is the focus of the remainder of this article. I was surprised to find a lack of posts about the mechanics
of dealing with event skew and reporting by event time in batch systems. So, I
wanted to layout some repeatable patterns that can be used for this. As you probably know, event streams are essentially unbounded
stream of logs. We often deal with this as a series of bounded datasets each
representing some time period. Our main consideration here is a batched process
that deals with large windows (15 min to 1 hour), but applies down to any
level, since we almost always analyze event data by time in the target system. The Problems There are two main issues in dealing with this—Completeness
and Restatement. Completeness—When
event data can come in for some time past the end of a time window, it is very
difficult to assess the completeness of the data. Most the data may arrive
within a period (hour or day) of the time window. However, data may continue to
trickle in for quite some time afterwards. This presents issues of
Processing and combining data that arrives over
time
Determining a cutoff when data is considered
complete As we can see in this figure, most event data is received in
the few windows after the event time. However, data continues to trickle in,
and in fact, 100% theoretical completeness may never be achieved! So, if we
were to report on the event data at day 3 and at day 7 the results would be
very different. Restatement—By
this we mean the restatement of data that has arrived and been grouped by
process time into our desired dimension of event time. This would not be an
issue if we could simply scan through all the data each time we want to analyze
it, but this becomes unworkable as the historical data grows. We need to find a
way to process just the newly arrived data and combine it with the existing
data. Other Requirements In addition, with
dealing with our two main issues, we want to a solution that will Be Scalable – Any solution must be able to scale to large volumes
of data, particularly as event history grows over time. Any solution that
relies on completely reprocessing the entire set of data will quickly become
unworkable. Provide the ability to reprocess data – Restating event data by Event
Time is pretty straightforward if everything goes right. However, if we
determine that source data was corrupt or needs to be reloaded for any reasons,
things get messy. In that case, we potentially have data from multiple
processing periods co-mingled for the same event time partition. So, to
reprocess a process period, we need to separate out those rows for the process
period and replace them, while leaving the other rows in the partition intact.
Not always an easy task with HDFS! As
an aside, to reprocess data, you need to keep the source around for a while. Pretty
obvious, but just saying! Sample Use Case and Data For an example use case we will use events arriving for a
mobile device representing streaming video viewing events. For this use case,
we will receive a set of files hourly and place them in a landing folder in
HDFS with an external Hive table laid on top. The processing (collection) time
is stamped into the filename using the format YYYYMMDDHH-nnnn.txt. This
external table will contain one period’s data at a time and serves as an
initial landing zone. We are also going to assume that we need to save this data
in detail, and that analysis will be done directly on the detailed data. Thus,
we need to restate the data by event time in the detail store. Raw Input Source Format Of particular interest is the event_time columns which is an ISO timestamp in the form: YYYY-MM-DDTHH:MM:SS.sssZ CREATE EXTERNAL TABLE video_events_stg (
device_id string,
event_type string,
event_time string,
play_time_ms bigint,
buffer_time_ms bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/landing/video_events_stg';
https://raw.githubusercontent.com/screamingweasel/sample-data/master/schema/video_events_stg.hql Detailed Table Format CREATE TABLE video_events (
device_id string,
event_type string,
event_time string,
play_time_ms bigint,
buffer_time_ms bigint)
PARTITIONED BY (
event_year string,
event_month string,
event_day string,
event_hour string,
process_time string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/schema/video_events.hql Sample Data I have put together three files, each containing one hour of
processing data. You can pull them from GitHub and load the first hour into
hdfs. mkdir -p /tmp/video
cd /tmp/video
wget
https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011200-00001.txt
wget
https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011201-00001.txt
wget https://raw.githubusercontent.com/screamingweasel/sample-data/master/video/2017011202-00001.txt
hadoop fs -rm -r /landing/video_events_stg
hadoop fs -mkdir -p /landing/video_events_stg
hadoop fs -put /tmp/video/2017011200.00001.txt /landing/video_events_stg/ Solutions Let’s look at two possible solutions that meet our criteria
above. The first utilizes Hive without the newer ACID features. The second post
in this series details how to solve this using Hive ACID. Per our requirements,
both will have to restate the data as it is ingested into the detailed Hive
table and both must support reprocessing of data. Solution 1 This solution uses pure Hive and does not rely on the newer
ACID transaction feature. As noted one hour’s worth of raw input may contain
data from any number of event times. We want to reorganize this and store it in
the detailed table partitioned by event time for easy reporting. This can be
visualized as: Loading Restatement We are going to achieve this through Hive Dynamic
Partitioning. Later versions of Hive (0.13+) support efficient dynamic partitioning
that can accomplish this. Dynamic partitioning is, unfortunately, a bit slower
than inserting to a static fixed partition. Our approach of incrementally
ingesting should mitigate this, but you would need to benchmark this with your
volume. set hive.exec.dynamic.partition.mode=nonstrict;
set hive.optimize.sort.dynamic.partition=true;
INSERT INTO TABLE default.video_events
PARTITION (event_year, event_month, event_day, event_hour, process_time)
SELECT device_id,event_type,
CAST(regexp_replace(regexp_replace(event_time,'Z',''),'T',' ') as
timestamp) as event_time,
play_time_ms,
buffer_time_ms,
substr(event_time,1,4) AS event_year,
substr(event_time,6,2) AS event_month,
substr(event_time,9,2) AS event_day,substr(event_time,12,2) AS event_hour,substr(regexp_extract(input__file__name, '.*\/(.*)', 1),1,10) AS process_timeFROM default.video_events_stg; You can see from a “show partitions” that three partitions
were created, one for each event time period. Show partitions default.video_events;
event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200
event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200
event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200
Now let’s process the rest of the data and see the results:
hadoop fs -rm -skipTrash /landing/video_events_stg/*
hadoop fs -put /tmp/video/2017011201-00001.txt /landing/video_events_stg/
hive -f video_events_insert.hql
hadoop fs -rm -skipTrash /landing/video_events_stg/*
hadoop fs -put /tmp/video/2017011202-00001.txt
/landing/video_events_stg/
hive -f video_events_insert.hql show partitions default.video_events;
event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 3000 So, we can see that our new data is being nicely added by
event time. Note that now there are multiple partitions for the event hour,
each corresponding to a processing event. We will see how that is used in the
next section. Reprocessing In order to reprocess input data for a specific process
period, we need to be able to identify that data in the restated detail and
remove it before reprocessing. The approach we are going to take here is to
keep the process period as part of the partition scheme, so that those partitions
can be easily identified. In this case, the partitioning would be:
Event Year
Event Month
Event Day
Event Hour
Process Timestamp (concatenated) Ex. year=2017/month=01/day=10/hour=01/process_date=2017011202 year=2017/month=01/day=12/hour=01/process_date=2017011202 year=2017/month=01/day=12/hour=02/process_date=2017011202 This makes it fairly simple to reprocess a period of source
data.
1.List
all the partitions of the table and identify ones from the specific processing
hour to be reprocessed.
2.Manually
drop those partitions.
3.Restore
the input data and reprocess the input data as normal Let’s assume that the data for hour 2017-01-12 01 was
incorrect and needs reprocessed. From the show partitions statement, we can see
that there are three partitions containing data from that processing time.
event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 Let’s drop ‘em and see what we get ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='1',event_hour='22',process_time='2017011201');
ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='11',event_hour='23',process_time='2017011201');
ALTER TABLE default.video_events DROP PARTITION (event_year='2017',event_month='01',event_day='12',event_hour='00',process_time='2017011201'); show partitions video_events;
event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 2000 Now, finally let’s put that data back and reprocess it. hadoop fs -rm -skipTrash /landing/video_events_stg/* hadoop fs -put /tmp/video/2017011201-00001.txt /landing/video_events_stg/ hive -f video_events_insert.hql show partitions default.video_events;
event_year=2017/event_month=01/event_day=11/event_hour=21/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=22/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011200 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011201 event_year=2017/event_month=01/event_day=11/event_hour=23/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011201 event_year=2017/event_month=01/event_day=12/event_hour=00/process_time=2017011202 event_year=2017/event_month=01/event_day=12/event_hour=01/process_time=2017011202 select count(*) from default.video_events 3000 Comments on this Solution One drawback of this solution is that you may end up with
small files as event trickle in for older event times. For example, if you only
get a handful of events that come in 4 weeks after the event time, you are
going to get some very small files, indeed! Our next solution will overcome
that issue by using Hive ACID. Conclusion When handling event data, we must always be aware of the
skew between event time and processing time in order to provide accurate
analytics. Our solution to restating the
data in terms of event time must be scalable, performant, and allow for
reprocessing of data. We looked at one solution using plain Hive and partitioning.
In the next of this series we will look at Hive ACID transactions to develop a
more advanced and simpler solution. Accompanying files can be found at: https://github.com/screamingweasel/articles/tree/master/event_processing_part_1
... View more
Labels:
02-23-2017
04:46 PM
Yes, if you want to be more restrictive you could use the user hdfs or @hadoop to indicate any user in the hadoop group.
... View more
12-18-2016
07:42 AM
1 Kudo
Hi @jbarnett, In order to run HDFS balancer, the new conf dfs.internal.nameservices, which distinguishes internal and
remote clusters, needs to be set so that Balancer will use it to locate the
local file system. Alternatively, Balancer and distcp need not share the same conf since
distcp may be used for multiple remote clusters. When adding a new
remote cluster, we need to add it to the distcp conf. However, it does
not make sense to change the Balancer conf. If we are going to use a
separated conf for Balancer, we may put only one file system (i.e. the local fs but not the remote fs) in
dfs.nameservices . As a summary, there are two ways to fix the conf. Set all the local and the remote file systems in dfs.nameservices
and then set the local file system in dfs.internal.nameservices. The
conf will work for both distcp and Balancer. Set only the local file system in dfs.nameservices in the Balancer conf. Use a different conf for distcp. Hope it helps.
... View more
09-06-2017
04:19 PM
"as per above my understanding is any user needs to have full permissions on the system tables while connecting to sqlline for the first time and then just granting read access on the system tables should help him re-establish the session." -- correct. "Also can you please point me to document that can provide information around restricting access via Ranger for Phoenix." -- I'd suggest you ask a new question for help on using Ranger. I am not familiar with the project.
... View more
05-16-2018
08:36 PM
So I identified that this is a bug which they will fix only in HDP 3. Looks like there is work around which worked on HDP 2.6.0 but it stopped working in 2.6.1. I upgraded my stack to 2.6.2 and it works fine now.
... View more
05-01-2018
07:09 AM
I'm having this same problem. I recently move our cluster to Ubuntu. When using the previous Centos it was working fine. I have tried the case conversion options with no luck. I can however access everything if I add the user to ranger and not the group.
... View more