Member since
02-09-2016
40
Posts
14
Kudos Received
0
Solutions
05-29-2018
02:16 PM
Thanks @Aditya Sirna for the swift response.
... View more
05-29-2018
01:17 PM
Hi, We are having a HDP 2.5.3 production deployment. However, we are planning to have a separate HDP-Search installation for an independent SolrCloud setup, with no relation to any of the Hadoop components. However, the intention is to have SolrCloud cluster to be setup using Ambari (I know its easy to setup SolrCloud independently, but its an org wide practice to use Ambari where possible). As this is expected to be a purely Solr cluster, wondering whether I can remove all the basic services like HDFS, MapReduce, YARN etc while using Ambari to setup Solr? Is this possible at all or should we be using blueprints to customise our Ambari / Solr setup altogether? Thanks
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Solr
08-29-2017
09:37 AM
Hi, Is there any step-by-step documented procedure for migrating from Cloudera distribution to Hortonworks HDP? We need to migrate from a production Cloudera distribution to a brand new HDP installation and hence looking for guidance on the below: 1. Is DistCp the best choice to migrate the data beween the clusters? 2. How to ensure all the security configurations are migrated appropriately? For example, from Sentry to Ranger, from Cloudera KMS to Ranger KMS etc? 3. How to ensure all the important configuration properties are set to Hortonworks recommended values? Will this be a manual effort or is there any script or something that can check CDH and then change accordingly? 4. What is the best mechanism to migrate the scripts? (Hive/Pig/Spark etc) Thanks in Advance.!
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
06-07-2017
03:57 PM
Hi, I am very new to Nifi and HDF and hence finding it tough to understand the USP of Nifi with respect to other data transport mechanisms. So any help would be grateful. Is NiFi’s primary interaction only through UI? How different is Nifi from Kafka or any enterprise ESB apart from the visual data flow aspect? Especially when comparing with Kafka, what is common between them and where does they differ? My understanding of the Nifi features with respect to Kafka Visual Command and Control - Not available in Kafka Data lineage - Something that can be done with Apache Atlas for Kafka? Data prioritisation - I presume this can be controlled with a combination of topics and consumers and consumer groups in Kafka Back pressure - As Kafka can retain data, consumers can always replay the data and catch-up Control Latency vs Throughput - Similar to back pressure and prioritisation, this can be controlled with consumers and topics with data retention Security - Kafka also has got security implementation Scaling - Build a Kafka cluster
... View more
Labels:
11-02-2016
12:36 AM
Further to my earlier question (https://community.hortonworks.com/questions/64438/hive-beeline-e-in-shell-script.html#comment-64493) Wondering how to use the commands beeline -e and beeline -f in shell scripts (bash shell). When I tried running beeline -e command directly on the bash, it says connection not available. So I presume we need to run beeline -u command or a combination of beeline;!connect commands together. But once we execute either of these commands, we will be in beeline shell rather than bash shell and hence beeline -e command is not needed anymore. So wondering what is the purpose of beeline -e command and how to use it without invoking beeline -u command earlier. I am sure my understanding is wrong somewhere and hence would request to please correct me.
... View more
Labels:
- Labels:
-
Apache Hive
11-01-2016
11:37 PM
Thanks @Neeraj Sabharwal Just wondering if there is any standard approach to this. Without connecting how does someone use hive -e command. i.e. it looks like hive -u and hive -e commands are mutually exclusive to me.
... View more
11-01-2016
04:43 PM
2 Kudos
Hi, Our cluster is secured using Kerberos. Now I need to run hive queries in a shell script which would be scheduled to run periodically. In my shell script, I was thinking to use the below commands in sequence >beeline -u"jdbc:hive2://$hive_server2:10000/$hive_db;principal=$user_principal"
>beeline -e"SHOW DATABASES" But then I realised that once I run the beeline -u command, it would take me to the beeline shell instead of being in the bash shell. So wondering how to get this sorted out. I need to use beeline -e command, but need to connect to the cluster first using kerberos principal. Any ideas whats the best way to handle this? FYI, we are not using Oozie, but shell script with crontab scheduling. Thanks
... View more
Labels:
- Labels:
-
Apache Hive
10-31-2016
05:07 PM
Thanks @Kuldeep Kulkarni
... View more
10-31-2016
03:22 PM
@Kuldeep Kulkarni Thanks for your response. Yes, cluster is integrated to AD and ranger-usersync is enabled. My question is around whether its needed to allow the app-usr to be able to login to master nodes and edge nodes vs just visible from these nodes. For security reasons, we wanted to disallow application users from logging into master nodes and data nodes.
... View more
10-31-2016
02:34 PM
Hi, I have a fundamental query on how permissions work in hadoop. We are setting up a cluster with master nodes, data nodes and edge nodes. Edge nodes are the ones exposed to outside world and all hadoop clients are installed on these machines. External applications stage their data on edge nodes first and then load them into hadoop. We are implementing security to our clusters and thinking to have data ownership and permissions defined through Ranger policies to the app-usr for both HDFS and Hive data. So if a application user app-usr is only given login access to edge nodes (through Active Directory groups), will the user be able to own any data in hadoop? For example, can I have a HDFS directory or Hive table that is owned by app-usr though the user is not available on the master nodes or data nodes but only on edge nodes. Will this allow me to configure Ranger policies for that user? Or should the user be able to login to all the nodes in the cluster? Looking for ideas on the best strategy around this. Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Ranger
10-25-2016
11:15 AM
Hi,
Wondering how to retrieve the job id for a job that is submitted through a crontab scheduled to run at regular intervals. For example, if I run a distcp job in my script as below
hadoop distcp hdfs://nn1:8020/src_path hdfs://nn2:8020/dst_path
How to know the YARN job ID so that I can query the status of the job in my script for completion and then take appropriate action.
PS: For various reasons, we are not using Oozie and hence need to do this in script and schedule using crontab.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
09-28-2016
09:16 PM
@Sowmya Ramesh thanks for your response. Not sure I understood it correctly. For example, in the case of feed replication, if the first replication job is submitted at time T and is still in progress and another replication job submitted at T+1 hour, do you intend to say that both of them complete one after the other without any overlap, in a FIFO fashion? All I am trying to understand is would my feed replication / mirroring job have any adverse impacts if their scheduling is not handled properly i.e. scheduled too frequently which would cause overlap while execution
... View more
09-28-2016
03:34 PM
Trying to understand what happens if there is a scheduled Falcon replication that is running while another one starts? For example, if we have a hourly replication schedule and the one at T hour is still running, what happens if another one starts at T+1 hour?
... View more
Labels:
- Labels:
-
Apache Falcon
09-23-2016
10:52 AM
1 Kudo
Hi, Just wondering how the cluster topology should look like for Kafka alongside Hadoop? I presume Kafka brokers shouldn't be co-located alongside data nodes. Instead should probably be installed on nodes outside Hadoop cluster (probably gateway / edge nodes) as Kafka serves as the landing area and the data be eventually pushed to one of the Hadoop storage engines. Am I correct thinking this way? Please validate my understanding.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Kafka
08-22-2016
02:02 PM
Hi, I am trying to run a very simple command hdfs dfs -ls -t / However, it prompts me saying that -t is an illegal option. However, when I look for documentation it says -t is supported. FYI, I am using Hadoop 2.7.1 version. Any idea how to list the files / directories in HDFS sorted by time?
... View more
Labels:
- Labels:
-
Apache Hadoop
08-03-2016
01:23 PM
Hi,
I am performing a basic check to see if a file exists in HDFS or not. I am using hdfs dos -test command for the same. But it doesn't seem to work correctly. In the documentation its mentioned as it would return 0 if the file exists. But I am not getting any output when the command is run.
Let me know what need to be done to get this working.
Please see the screenshot attached
Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
07-27-2016
02:59 PM
Is there a kerberised version of HDP Sandbox image available which can be used for proof of concept purposes on AWS? I am planning to have two secured sandboxes on AWS and then play around with some functionality and hence trying to understand whats the best way to get around this. Thanks
... View more
Labels:
07-19-2016
11:42 PM
Thanks @Arpit Agarwal for your response. Any specific reason still there are two branches maintained? Are they significantly different from one another and hence need to be tracked and maintained separately? I presume HDP and many commercial distributions follow 2.7.x lineage. So wondering who is using 2.6.x series? Thanks in Advance.
... View more
07-19-2016
10:59 AM
Thanks @rbiswas Any idea how does it work for other services like hive, yarn, hbase etc?
... View more
07-19-2016
10:38 AM
2 Kudos
Hi, Any idea how Apache Hadoop versioning works? When I go to Hadoop homepage on Apache page, it lists 2.7.2 as the latest stable release (I believe 2.7.1 is part of HDP2.4.2) But thats released in Jan 2016. But there is 2.6.4 released in Feb 2016. Which is the current branch to follow and when to use 2.6 release? Any idea when 2.8.0 release date is? Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
07-18-2016
02:56 PM
Hi, Would it be possible to provide hdfs privileges for users defined on the cluster? For example, my-env-hdfs is a user we have on our cluster. Can I grant hdfs user level privileges to this user? If possible, how to do it? Likewise, how about other services like yarn, hive, ambari-qa etc? Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
06-29-2016
11:14 AM
Thanks @Benjamin Leonhardi Further to my question, what is the best strategy to remove old log files? Can I simply remove all the logs apart from the "current" ones without any issues? Is there any best practice around log management? Thanks
... View more
06-29-2016
10:28 AM
Any idea on what are the various log files that are typically created under /var/log/hadoop/* folders? Is there a defined naming convention and mapping to hadoop deamons? The reason I ask is I see many files listed under /var/log/hadoop/hdfs folder, but don't understand / can't find documentation on what is the purpose of each log file. Any help please.
... View more
Labels:
06-29-2016
10:14 AM
1 Kudo
Hi, Being a novice, I am trying to understand answers to the below questions? 1. what is the difference of having configuration defined in hadoop-env.sh vs defining it hdfs-site.xml or yarn-site.xml? 2. My presumption is *-default.xml files will have the standard Apache defined configuration values and any custom values for the standard properties (either Hadoop vendor specific like Hortonworks / Cloudera or implementation specific at a project level) will be defined in the *-site.xml files. Am I correct in my understanding? 3. What is the difference of /usr/hdp/current and /usr/hdp/2.4.0.0.169 folders on Sandbox? What is the importance/ significance of each of these folders? Are they both required even on production deployments?
... View more
Labels:
05-25-2016
01:43 PM
4 Kudos
I have
some questions around HDFS snapshots which can be used for backup and DR purposes.
How does snapshots help for
Disaster Recovery? What are the best practices around using snapshots for
DR purposes? Especially trying to understand when data is directly stored on HDFS,
Hive data and HBase data
Can a directory be deleted
using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the
snapshots have to be deleted first and then the snapshotting be disabled
before allowing the directory be deleted?
As I understand, no data is
copied for snapshots, but only metadata is maintained for the blocks
added/ modified / deleted. If
that’s the case, just wondering what happens when the comamnd hdfs dfs -rm
/data/snapshot-dir/file1 is run. Will the file be moved to the trash? If
so, will the snapshot maintain the reference to the entry in trash? Will
trach eviction has any impact in this case?
What happens when one of the
sub-directory under the snapshot directory is deleted? For example, if the
command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from
snapshots?
Can snapshots be deleted /
archived automatically based on policies, for example time-based? In the
above example, how long will the sub-dir data be maintained in the
snapshot?
How does snapshots work along
with HDFS quotas. For example, assume a directory with a quota of 1GB with
snapshotting enabled. Assume the directory is closer to its full quota and
a user deleted a large file to store some other dataset. Will the new data
be allowed to be saved to the directory or will the operation be stopped
because the quota limits have been exceeded? Apologies
if some of the questions doesn’t make sense. I am still trying to understand
these concepts at a ground level.
... View more
Labels:
- Labels:
-
Apache Hadoop
04-27-2016
09:15 PM
@Abdelkrim Hadjidj great explanation, thanks! With the above understanding, I presume, if need be, there can be a mix of both physical and virtual in the same cluster without any additional overhead / performance impacts apart from the ones mentioned above?
... View more
04-27-2016
08:56 PM
2 Kudos
Hello, We are having an internal argument on whether its a good idea to have the cluster mainly running on VMs or is it better to have it on physical servers. What are the pros and cons of each hardware configuration. Also, is it a good idea to mix both physical and virtual machines in a single cluster, if need be.
... View more
Labels:
- Labels:
-
Apache Hadoop