Member since
01-11-2016
355
Posts
230
Kudos Received
74
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8293 | 06-19-2018 08:52 AM | |
3216 | 06-13-2018 07:54 AM | |
3668 | 06-02-2018 06:27 PM | |
3966 | 05-01-2018 12:28 PM | |
5508 | 04-24-2018 11:38 AM |
10-06-2016
10:26 PM
1 Kudo
@Houssam Manik @Chethana Krishnakumar Mapping between users and groups is not done at Ranger level. It's done by the Hadoop Group Mapping . As you can see in the link, it's a prerequisite for Ranger installation. So that's correct that user/groups get synchronized in Ranger and can be used to create policies. However, at request time, Hadoop Group Mapping is used to map user to groups and not mapping in Ranger. Look at this thread: https://community.hortonworks.com/questions/2108/ranger-group-policy-not-being-applied-to-the-users.html
... View more
10-01-2016
10:34 AM
@Rajkumar Singh It looks like you are trying to submit the job to default queue which is unknown. Have you changed Yarn queues configuration?
... View more
10-01-2016
10:27 AM
1 Kudo
Hi @Gobi Subramani, Below answers to your questions: The role of NameNode is to manage the HDFS file system. The role of Resource Manager is to manage cluster's ressources (CPU, RAM, etc) by collaborating with Node Managers. I won't write too much on these aspects as lot of documentation is already available. You can read on HDFS and Yarn architecture in the official documentation. You can have High availability in Yarn by having active and standby Ressource Managers (more information here) If you have a distributed application (Spark, Tez, etc) that needs data from HDFS, it will use Yarn and HDFS. Yarn will enable the application to request containers (which contains the required resources : CPU, RAM, etc) on different nodes. The application will be deployed and running inside these containers. Then, the application will be responsible for getting data from HDFS by exchanging with NameNode and DataNodes. For the put command, only HDFS is involved. Without going into details : the client asks the NameNode to create a new file in the NameSpace. The NameNode do some checks (file doesn't exists, user have right to write in the directory, etc) and allow the client to write data. At this time, the new file have no data blocks. Then the client starts writing data in blocks. For writing blocks, the HDFS API exchanges with the NameNode to get a list of DataNode on which it can write each block. The number of dataNodes depends on the replication factor and the list is ordered by distance from the client. When NameNode gives the DataNodes list, the API write a data block to the first node, which replicates the same block to the next one and so on. Here's a picture from The Hadoop Definitive Guide that explains this process.
... View more
09-17-2016
11:08 AM
1 Kudo
Hi @Sunile Manjee Have you seen these?
Hadoop Summit presentation : https://www.youtube.com/watch?v=NtEyW27NkgA http://gethue.com/how-to-use-the-livy-spark-rest-job-server-api-for-sharing-spark-rdds-and-contexts/
... View more
09-17-2016
10:49 AM
2 Kudos
Hi @ARUN You have several option to do this:
CopyTable Export the table, copy the files into a new cluster and Import the table (see documentation section after CopyTable) In HDP 2.5, there's a new feature of snapshooting. I am not sure if this feature is complete since I didn't try it. There's an open Jira and the backup/restore feature is listed as Tech Preview. Note that the first two options can have an impact on the RegionServer while the third one has minimal impact.
... View more
09-14-2016
05:25 AM
1 Kudo
Hi @Ryan Cicak Several processors that call an API (if not all) have a property Connection Timeout. You can set this property to wait for a fixed duration depending on your data source, network condition and so on (look at GetHttp for instance). You can use this property with a max retry strategy. The processor wait until the time out expire, and try again until it reaches a max retry number. If the max retry is reached, the flowfile goes into a processor that handle this special case (alert an admin, store data in a dir for errors, etc)
... View more
08-26-2016
09:53 AM
Hi @Andread B, Why do you want to run NiFi on the NameNode ? If you are ingesting lot of data I would recommend running NiFi on a dedicated host or at least on edge node. Also, if you will ingest lot of data for a single NiFi instance, you can use GenerateTableFetch (coming in NiFi 1.0) to divide your import into several chunks, and distribute them on several NiFi nodes. This processor will generate several FlowFiles based on the Partition Size property where each FlowFile is a query to get a part of the data. You can try this by downloading NiFi 1.0 Beta : https://nifi.apache.org/download.html
... View more
06-17-2016
03:31 PM
Hi @Ryan Cicak The best practice is to configure Ranger audits to both Solr and HDFS. HDFS is used for long term audit storage so you won't want to delete audit data. Solr should be used for short term storage. By using Solr you have data indexed and you can query it quickly from Ranger UI. I am not aware of any setting or property in Ranger to set a TTL and automatically delete data. You may leverage Solr TTL feature to purge data (link) or schedule a job to issue a delete query periodically.
... View more
05-29-2016
05:20 PM
Hi @Lester Martin Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718 Hope this helps
... View more
05-29-2016
05:15 PM
Hi @suresh krish, Here are answers to your questions: 1) The absolute permission of your files is independent from Ranger. You should decide what permission to use for your data directory. This will depends on the nature of your data, sharing needs, security policy, etc. For recommendation on Posix permission to use with Ranger, see below. 2) I am not sure I fully understand the question. If the question is how to position the permissions between Posix and Ranger here's some information you need to know: Ranger plugin for HDFS is special. It checks for Ranger policies and if a policy exists, access is granted to user. If a policy doesn’t exist in Ranger, then the native permissions model in HDFS is used (POSIX or HDFS ACL). This point may creates some confusion in the beginning. Think about it as: I grant user/group X permission Y on a file Z. Ranger plugin for Hive works differently and can forbid access to a table/database. As a consequence of this, the recommendation is to have a restrictive permissions in HDFS and grant access to authorized users in Ranger. This way, you managing security will be easier, and centralized in ranger. 3) User will have access. Ranger policies have priority (see point 2). Ranger checks and finds a policy that grants access hence ACL will be ignored. 4) No. To access data in Hive you need to have the permission to access the table in Hive and the folder in HDFS. This is the case when you use Ranger or classical Hadoop permissions tools. 5) see answer 2
... View more