Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5181 | 09-21-2018 09:54 PM | |
6594 | 03-31-2018 03:59 AM | |
2001 | 03-31-2018 03:55 AM | |
2207 | 03-31-2018 03:31 AM | |
4908 | 03-27-2018 03:46 PM |
09-20-2016
04:14 PM
Do you mean 50Mbps per mapper or for the cluster as a whole? (I assume you mean the former, as the latter would imply almost two days to
read a TB of S3 data.) Assuming you do mean
50Mbps per mapper, what is the limit on S3 throughput to the whole cluster—that’s
the key information. Do you have a ballpark number for this?
... View more
12-06-2016
01:17 PM
Hi Diego, Could you please explain how exactly you got your issue resolved. I am facing same issue and i am using my personal network (not company's)
... View more
07-17-2018
08:31 AM
@mb I am facing the same issue, could you please advice how to work around or troubleshoot this problem ? Thanks, Nam
... View more
09-01-2016
06:32 PM
@deepak sharma Crazy enough. I just reached to this customer and s simple restart of Kafka service addressed the issue. Kerberos was enabled recently and probably this service was not restarted. Not much to learn. The symlink suggestion from you is an interesting approach which while not applicable here, is worth it to remember for other situations. Thank you for the suggestion.
... View more
08-26-2016
05:14 PM
That did the trick! Thanks @Constantin Stanca!
... View more
08-26-2016
07:03 PM
Not Sure why, but when a user "x" was created in IPA, there was an entry for x under users and also under groups. Could be this lead to ambiguity for the search to locate the right user "x" (arun in my case). To resolve the ambiguity, I thought of referring users by their uid rather than the default cn, which could conflict.
... View more
08-25-2016
12:24 AM
5 Kudos
@suresh krish Answer from Santhosh B Gowda could be helpful, but that is brute force with 50-50% chance of luck. You need to understand query execution plan, how much data is processed, how many tasks execute the job. Each task has a container allocated. You could increase the RAM allocated for the container but if you have a single task performing the map and data is more than the container allocated memory you are still seeing "Out of memory". What you have to do is to understand how much data is processed and how to chunk it for parallelism. Increasing the size of the container is not always needed. It is almost like saying that instead of tuning a bad SQL, let's throw more hardware at it. It is better to have reasonable size containers and have enough of them to process your query data. For example, let's take a cross-join of a two tables that are small, 1,000,000 records each. The cartesian product will be 1,000,000 x 1,000,000 = 1,000,000,000,000. That is a big size input for a mapper. You need to translate that in GB to understand how much memory is needed. For example, assuming that the memory requirements are 10 GB and tez.grouping.max-size is set to the default 1 GB, 10 mappers will be needed. Those will use 10 containers. Now assume that each container is set to 6 GB each. You will be wasting 60 GB for 10 GB need. In that specific case, it would be actually better to have 1 GB container. Now, if your data is 10 GB and you have only one 6 GB container, that will generate "Out of memory". If the execution plan of the query has one mapper that means one container is allocated and if that is not big enough, you will get your out of memory error. However, if you reduce tez.grouping.max-size to a lower value that will force the execution plan to have multiple mappers, you will have one container for each and those tasks will work in parallel reducing the time and meeting data requirements. You can override the global tez.grouping.max-size for your specific query. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html describes Tez parameters and some of them could help, however, for your case you could give tez.grouping.max-size a shot. Summary: - Understanding data volume that needs to be processed - EXPLAIN SqlStatement to understand the execution plan - tasks and containers - use ResouceManager UI to see how many containers are used and cluster resources used for this query; Tez View can also give you a good understanding of Mapper and Reducer tasks involved. The more of them the more resources are used, but the response time is better. Balance that to use reasonably resources for a reasonable response time. - setting tez.grouping.max-size to a value that makes sense for your query; by default is set to 1 GB. That is a global value.
... View more
08-23-2016
10:28 PM
3 Kudos
@Kumar Veerappana
Assuming that you are only interested who has access to Hadoop services, extract all OS users from all nodes by checking /etc/passwd file content. Some of them are legitimate users needed by Hadoop tools, e.g. hive, hdfs, etc.For hdfs, they will have a /user/username folder in hdfs. You can see that with hadoop -fs ls -l /user executed as a user member of the hadoop group. If they have access to hive client, they are able to also perform DDL and DML actions in Hive. The above will allow you to understand the current state, however, this is your opportunity to improve security even without the bells and whistles of Kerberos/LDAP/Ranger. You can force the users to access Hadoop ecosystem client services via a few client/edge nodes, where only client services are running, e.g. Hive client. Users, other than power users, should not have accounts on name node, admin node or data nodes. Any user that can access those nodes where client services are running can access those services, e.g. hdfs or Hive.
... View more
12-02-2016
06:57 PM
What error(s) are you seeing? If it mentions Avro, then if your column names are in Chinese, it's likely that Avro does not accept them. This may be alleviated in NiFi 1.1.0 with NIFI-2262, but it would just replace non-Avro-compatible characters with underscores, so you may face a "duplicate field" exception. In that case you would need column aliases in your SELECT statement to use Avro-compatible names for the columns.
... View more