Member since
02-09-2016
40
Posts
13
Kudos Received
0
Solutions
10-31-2018
10:42 AM
@here community members, any update to my above questions please? Thanks
... View more
10-29-2018
05:23 PM
Hi, We are using HDP2.6.5. I have the following questions with respect to configuring security to our HDP cluster. Solicit responses from experts in the community: We want to restrict access to the Storm UI only through the Ambari dashboard. Alternatively, we want Storm UI to ask for username and password, which are controlled through AD. Is this achievable? We dont have Knox in our setup. But only want to configure Ambari with AD integration. While Knox is recommended from Hortonworks perspective, wondering whether there are any potential security holes if we dont have Knox in our setup? Can Ranger be integrated with AD such that authentication works through AD? Thanks in Advance
... View more
Labels:
05-29-2018
02:16 PM
Thanks @Aditya Sirna for the swift response.
... View more
05-29-2018
01:17 PM
Hi, We are having a HDP 2.5.3 production deployment. However, we are planning to have a separate HDP-Search installation for an independent SolrCloud setup, with no relation to any of the Hadoop components. However, the intention is to have SolrCloud cluster to be setup using Ambari (I know its easy to setup SolrCloud independently, but its an org wide practice to use Ambari where possible). As this is expected to be a purely Solr cluster, wondering whether I can remove all the basic services like HDFS, MapReduce, YARN etc while using Ambari to setup Solr? Is this possible at all or should we be using blueprints to customise our Ambari / Solr setup altogether? Thanks
... View more
Labels:
05-16-2018
09:14 AM
Hi, Seeking for expert opinion out there to understand the best recommended way to connect to Hive from Spark. As I understand, there seems to be multiple approaches connecting to Hive from Spark programs - 1. JDBC 2. Put hive-site.xml on your classpath , and specify hive.metastore.uri s to where your hive metastore hosted. 3. Any other approach? I am relatively novice to Hive and Spark and hence trying to understand what is the industry recommended practice. FYI, we are using Spark 1.6.x version Thanks
... View more
Labels:
08-29-2017
09:37 AM
Hi, Is there any step-by-step documented procedure for migrating from Cloudera distribution to Hortonworks HDP? We need to migrate from a production Cloudera distribution to a brand new HDP installation and hence looking for guidance on the below: 1. Is DistCp the best choice to migrate the data beween the clusters? 2. How to ensure all the security configurations are migrated appropriately? For example, from Sentry to Ranger, from Cloudera KMS to Ranger KMS etc? 3. How to ensure all the important configuration properties are set to Hortonworks recommended values? Will this be a manual effort or is there any script or something that can check CDH and then change accordingly? 4. What is the best mechanism to migrate the scripts? (Hive/Pig/Spark etc) Thanks in Advance.!
... View more
Labels:
06-07-2017
03:57 PM
Hi, I am very new to Nifi and HDF and hence finding it tough to understand the USP of Nifi with respect to other data transport mechanisms. So any help would be grateful. Is NiFi’s primary interaction only through UI? How different is Nifi from Kafka or any enterprise ESB apart from the visual data flow aspect? Especially when comparing with Kafka, what is common between them and where does they differ? My understanding of the Nifi features with respect to Kafka Visual Command and Control - Not available in Kafka Data lineage - Something that can be done with Apache Atlas for Kafka? Data prioritisation - I presume this can be controlled with a combination of topics and consumers and consumer groups in Kafka Back pressure - As Kafka can retain data, consumers can always replay the data and catch-up Control Latency vs Throughput - Similar to back pressure and prioritisation, this can be controlled with consumers and topics with data retention Security - Kafka also has got security implementation Scaling - Build a Kafka cluster
... View more
11-02-2016
12:36 AM
Further to my earlier question (https://community.hortonworks.com/questions/64438/hive-beeline-e-in-shell-script.html#comment-64493) Wondering how to use the commands beeline -e and beeline -f in shell scripts (bash shell). When I tried running beeline -e command directly on the bash, it says connection not available. So I presume we need to run beeline -u command or a combination of beeline;!connect commands together. But once we execute either of these commands, we will be in beeline shell rather than bash shell and hence beeline -e command is not needed anymore. So wondering what is the purpose of beeline -e command and how to use it without invoking beeline -u command earlier. I am sure my understanding is wrong somewhere and hence would request to please correct me.
... View more
Labels:
11-01-2016
11:37 PM
Thanks @Neeraj Sabharwal Just wondering if there is any standard approach to this. Without connecting how does someone use hive -e command. i.e. it looks like hive -u and hive -e commands are mutually exclusive to me.
... View more
11-01-2016
04:43 PM
2 Kudos
Hi, Our cluster is secured using Kerberos. Now I need to run hive queries in a shell script which would be scheduled to run periodically. In my shell script, I was thinking to use the below commands in sequence >beeline -u"jdbc:hive2://$hive_server2:10000/$hive_db;principal=$user_principal"
>beeline -e"SHOW DATABASES" But then I realised that once I run the beeline -u command, it would take me to the beeline shell instead of being in the bash shell. So wondering how to get this sorted out. I need to use beeline -e command, but need to connect to the cluster first using kerberos principal. Any ideas whats the best way to handle this? FYI, we are not using Oozie, but shell script with crontab scheduling. Thanks
... View more
Labels:
10-31-2016
05:07 PM
Thanks @Kuldeep Kulkarni
... View more
10-31-2016
03:22 PM
@Kuldeep Kulkarni Thanks for your response. Yes, cluster is integrated to AD and ranger-usersync is enabled. My question is around whether its needed to allow the app-usr to be able to login to master nodes and edge nodes vs just visible from these nodes. For security reasons, we wanted to disallow application users from logging into master nodes and data nodes.
... View more
10-31-2016
02:34 PM
Hi, I have a fundamental query on how permissions work in hadoop. We are setting up a cluster with master nodes, data nodes and edge nodes. Edge nodes are the ones exposed to outside world and all hadoop clients are installed on these machines. External applications stage their data on edge nodes first and then load them into hadoop. We are implementing security to our clusters and thinking to have data ownership and permissions defined through Ranger policies to the app-usr for both HDFS and Hive data. So if a application user app-usr is only given login access to edge nodes (through Active Directory groups), will the user be able to own any data in hadoop? For example, can I have a HDFS directory or Hive table that is owned by app-usr though the user is not available on the master nodes or data nodes but only on edge nodes. Will this allow me to configure Ranger policies for that user? Or should the user be able to login to all the nodes in the cluster? Looking for ideas on the best strategy around this. Thanks
... View more
Labels:
10-25-2016
11:15 AM
Hi,
Wondering how to retrieve the job id for a job that is submitted through a crontab scheduled to run at regular intervals. For example, if I run a distcp job in my script as below
hadoop distcp hdfs://nn1:8020/src_path hdfs://nn2:8020/dst_path
How to know the YARN job ID so that I can query the status of the job in my script for completion and then take appropriate action.
PS: For various reasons, we are not using Oozie and hence need to do this in script and schedule using crontab.
... View more
Labels:
09-28-2016
09:16 PM
@Sowmya Ramesh thanks for your response. Not sure I understood it correctly. For example, in the case of feed replication, if the first replication job is submitted at time T and is still in progress and another replication job submitted at T+1 hour, do you intend to say that both of them complete one after the other without any overlap, in a FIFO fashion? All I am trying to understand is would my feed replication / mirroring job have any adverse impacts if their scheduling is not handled properly i.e. scheduled too frequently which would cause overlap while execution
... View more
09-28-2016
03:34 PM
Trying to understand what happens if there is a scheduled Falcon replication that is running while another one starts? For example, if we have a hourly replication schedule and the one at T hour is still running, what happens if another one starts at T+1 hour?
... View more
Labels:
09-23-2016
10:52 AM
1 Kudo
Hi, Just wondering how the cluster topology should look like for Kafka alongside Hadoop? I presume Kafka brokers shouldn't be co-located alongside data nodes. Instead should probably be installed on nodes outside Hadoop cluster (probably gateway / edge nodes) as Kafka serves as the landing area and the data be eventually pushed to one of the Hadoop storage engines. Am I correct thinking this way? Please validate my understanding.
... View more
Labels:
08-22-2016
02:02 PM
Hi, I am trying to run a very simple command hdfs dfs -ls -t / However, it prompts me saying that -t is an illegal option. However, when I look for documentation it says -t is supported. FYI, I am using Hadoop 2.7.1 version. Any idea how to list the files / directories in HDFS sorted by time?
... View more
Labels:
08-03-2016
01:23 PM
Hi,
I am performing a basic check to see if a file exists in HDFS or not. I am using hdfs dos -test command for the same. But it doesn't seem to work correctly. In the documentation its mentioned as it would return 0 if the file exists. But I am not getting any output when the command is run.
Let me know what need to be done to get this working.
Please see the screenshot attached
Thanks
... View more
Labels:
07-27-2016
02:59 PM
Is there a kerberised version of HDP Sandbox image available which can be used for proof of concept purposes on AWS? I am planning to have two secured sandboxes on AWS and then play around with some functionality and hence trying to understand whats the best way to get around this. Thanks
... View more
07-19-2016
11:42 PM
Thanks @Arpit Agarwal for your response. Any specific reason still there are two branches maintained? Are they significantly different from one another and hence need to be tracked and maintained separately? I presume HDP and many commercial distributions follow 2.7.x lineage. So wondering who is using 2.6.x series? Thanks in Advance.
... View more
07-19-2016
10:59 AM
Thanks @rbiswas Any idea how does it work for other services like hive, yarn, hbase etc?
... View more
07-19-2016
10:38 AM
2 Kudos
Hi, Any idea how Apache Hadoop versioning works? When I go to Hadoop homepage on Apache page, it lists 2.7.2 as the latest stable release (I believe 2.7.1 is part of HDP2.4.2) But thats released in Jan 2016. But there is 2.6.4 released in Feb 2016. Which is the current branch to follow and when to use 2.6 release? Any idea when 2.8.0 release date is? Thanks
... View more
Labels:
07-18-2016
02:56 PM
Hi, Would it be possible to provide hdfs privileges for users defined on the cluster? For example, my-env-hdfs is a user we have on our cluster. Can I grant hdfs user level privileges to this user? If possible, how to do it? Likewise, how about other services like yarn, hive, ambari-qa etc? Thanks
... View more
Labels:
06-29-2016
02:24 PM
@Ian Roberts Thanks for the response. Can you please let me know why is there a need to have "hadoop-client" when hdfs-client and yarn-client specifically exist? What extra functionality does hadoop-client cover? hadoop-client seems to have jars related to aws, azure etc. So not sure on the exact scope of hadoop-client? For example, on a edge node in my cluster, would it be fine if I have hdfs-client and yarn-client individually without having hadoop-client libraries? TIA
... View more
06-29-2016
11:14 AM
Thanks @Benjamin Leonhardi Further to my question, what is the best strategy to remove old log files? Can I simply remove all the logs apart from the "current" ones without any issues? Is there any best practice around log management? Thanks
... View more
06-29-2016
11:11 AM
What is the difference between hadoop-client folder vs hfs-client and yarn-client folders on sandbox. I understand the role of hfs and yarn clients, but don't understand how different is hadoop client from these? Am I missing something?
... View more
06-29-2016
10:37 AM
Thanks @Sunile Manjee for your response. Apologies for the delayed response. I am still confused around this topic. Let me take an example. When I create a Hive table using CREATE TABLE command, the table's metadata i.e. information about the table name, column names, their datatypes etc need to be stored / persisted somewhere so that Hive can parse the underlying HDFS data using the metadata. Am I correct in saying the above statement? If so, is the persistent storage for the metadata a relational database which you were referring to above? If so, what is the role of HCatalog? Does it simply provide a mechanism for client applications to "read" the metadata already created in the underlying database? What is the role of Metastore service in all this? Apologies if I misunderstood something very basic here. TIA
... View more