About rbiswas1

egarelnabi · ‎12-16-2016

I haven't tested it, but I don't believe it should be an issue since the decryption should happen transparently by the platform, before the data is passed to the processor.

rbiswas1 · ‎07-08-2016

@Faisal Hussain I do not think there is one but definitely there is plan to develop one. https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support You can convert Avro to JSON (ConvertAvroToJSON) and then call a script (lots of example of scripts in the internet for converting JSON to csv using java/javascript/perl/bash/awk etc and etc) in the ExecuteStreamCommand processor. This would work. Let us know if you need more details.

rbiswas1 · ‎07-12-2016

Thank you @Ravi Mutyala

rbiswas1 · ‎07-06-2016

Driven first-hand experience: While researching for a Hive Query Metrics UI found this tool called driven. Step1: Register for a free trial at http://www.driven.io/hive/ Step2: Upon clicking the verification link sent over email came to this page: Click on “Install the Driven Agent” Step3: On clicking that two things will happen: You will be taken to the configuration and install instructions You will get an email with username and password Step4: Click on https://trial.driven.io/index.html#/apps from screenshot 2 You will be in the login screen. Then enter the id and password you got in your email. Step5: After logging in you will prompted to change the password. Change that and click submit. Step6: You will be logged into the user interface. But since the agent is not installed you cannot view anything. You will be also provided with your API Key. Step7: Hive is installed in the cluster in a node named hiveserver. Am going to install driven agent over there. From a terminal/shell, execute the following commands. # latest Hive agent bundle > wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-hive-bundle.txt # latest MapReduce agent bundle > wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-mr-bundle.txt After execution you will see the below jars: Step8: Now you need to export the configurations for the user you want "driven" to run: For Mapreduce 2.x export YARN_CLIENT_OPTS="-javaagent:/ path/to/driven-agent-mr-<version>.jar=drivenHosts=<driven host>;drivenAPIkey=<driven api key>" So my export becomes: export YARN_CLIENT_OPTS="-javaagent:/ sbin/driven-agent/driven-agent-mr-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>" **Substitute value of my key For Hive on Tez/MR export HADOOP_OPTS="-javaagent:/path/ to/driven-agent-hive-<version>.jar=drivenHosts=<driven host>;drivenAPIkey=<driven api key>" export HADOOP_OPTS="-javaagent:/sbin/ driven-agent/driven-agent-hive-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>" **Substitute value of my key I have a user called storm setup in my cluster. su – storm export YARN_CLIENT_OPTS="-javaagent:/ sbin/driven-agent/driven-agent-mr-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>" export HADOOP_OPTS="-javaagent:/sbin/ driven-agent/driven-agent-hive-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>" Step9: Run a program Step10: Refresh https://trial.driven.io/index.html#/apps and the user interface will start populating Now that we have successfully setup driven, we will explore the hive metrics user interface in part2 !!!

dale_preston · ‎04-10-2018

Will there be a part 3 of this? So far a good appetizer but no meat yet.

rbiswas1 · ‎07-05-2016

Disaster recovery plan or a business process contingency plan is a set of well-defined process or procedures that needs to be executed so that the effects of a disaster will be minimized and the organization will be able to either maintain or quickly resume mission-critical operations. Disaster usually comes in several forms and need to planned for recovery accordingly: Catastrophic failure at the data center level, requiring failover to a backup location Needing to restore a previous copy of your data due to user error or accidental deletion The ability to restore a point-in-time copy of your data for auditing purposes Disclaimer: 1. This article is solely my personal take on disaster recovery in a Hadoop cluster 2. Disaster Recovery is specialized subject in itself. Do not Implement something based on this article in production until you have a good understanding on what you are implementing. Key objectives: Minimal or no downtime for production cluster Ensure High Availability of HDP Services Ensure Backup and recovery of Databases, configurations and binaries No Data Loss Recover from hardware failure Recover from user error or accidental deletes Business Continuity Failover to DR cluster in case of Catastrophic failure or disaster This is the time I introduce RTO/RPO. RTO/RPO Drill Down RTO, or Recovery Time Objective, is the target time you set for the recovery of your IT and business activities after a disaster has struck. The goal here is to calculate how quickly you need to recover, which can then dictate the type or preparations you need to implement and the overall budget you should assign to business continuity. RPO, or Recovery Point Objective, is focused on data and your company’s loss tolerance in relation to your data. RPO is determined by looking at the time between data backups and the amount of data that could be lost in between backups. The major difference between these two metrics is their purpose. The RTO is usually large scale, and looks at your whole business and systems involved. RPO focuses just on data and your company’s overall resilience to the loss of it. Qs: What is your RTO/RPO? Ans: For a complex and large production system this answer would take some time to figure out and will progressively be defined. Also ideally there should be multiple values for this answer. What are you talking about? A 1-hour/1-hour RTO/RPO is very different (cost and architecture wise) from a 2-week/1-day RTO/RPO. When you choose the RTO/RPO requirements you are also choosing the required cost & architecture. By having well-defined RTO/RPO requirements you will avoid having an over-engineered solution (which may be far too expensive) and will also avoid having an under-engineered solution (which may fail precisely when you need it most - during a Disaster event) So ‘Band’ your data assets into different categories for RTO/RPO purposes. Example: Band 1 = 1 hour RTO. Band 2 = 1 day RTO. Band 3 = 1 week RTO, Band 4 = 1 month RTO, Band 5 = Not required in the event of a disaster. You would be surprised how much data can wait in the event of a SEVERE crash. For instance, datasets that are used to provide a report that is distributed once per month – you should never require a 1-hour RTO. Or even if it does that, it will only for the last day of the month. Rest of it, which is 29/30=97% should at max require a 1 day RTO even with maximum availability requirements. So the recommendation is to drill down your dataset and categorize it for RTO/RPO objectives. You will eventually get into a solution/architecture which would be better adaptive and more available without increasing your budget. This will be more of a journey rather than getting it 100% right at the first time. Qs: Who will decide the RTO/RPO of the wildly varying sets of data in my data lake? Ans: The data/business line owners ideally will be the person taking the decision. For log/troubleshooting/configuration type of data the admins and data engineers should take the decision which should accept feedback from the data/business line owners At this point of time we have not introduced any tools or low level strategy for Disaster recovery and Backup. More to come in series 2... Link to Series2 : https://community.hortonworks.com/content/kbentry/43575/disaster-recovery-and-backup-best-practices-in-a-t-1.html P.S. A very special note of thanks to @bpreachuk who pretty much penned down the RTO/RPO explanation. It was written so well that i almost copied it:) I also want to thank @Ravi Mutyala from whom I have learnt (and learning :)) a lot in this subject area.

sball · ‎07-05-2016

One way of doing this is to push templates created in a dev instance onto a production instance of NiFi. This would usually be done through scripted API calls. NiFi deliberately avoids including sensitive properties like passwords and connection strings in the template files. However, given that these are likely to change in a production environment anyway, this is more a benefit than a drawback. A good way on handling this is to use the API again to populate production properties in the template once deployed. A good starting point for this would be to take a look at https://github.com/aperepel/nifi-api-deploy which provides a script configured with a yaml file to deploy templates and then update properties in a production instance. This will obviously be a lot cleaner once the community has completed the variable registry effort, but will provide you a good solution for now. As Joe points out, it is also important to ensure you copy up any custom processors you have in nar bundles as well, but that's just file copy and restart (and should be kept in a custom folder as joe suggests to make upgrades easier).

cstanca · ‎04-18-2017

@rbiswas What about adding new host and assign it at the same time to a specific rack? I'd like to avoid the host (data node) and then have to set the rack and restart again. Is there a way via Ambari UI?

aps · ‎06-04-2018

Is it necessary to have same number of hosts in all the racks?

rbiswas1 · ‎06-23-2016

@james.jones if you can put this as an answer, I will accept. Thanks

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

Re: HDF for teeing encrypted data?

Re: Nifi - AVRO to CSV or Json to CSV,NIFI - conve...

Re: How to automate the setup of DR cluster?

Generating Hive Query Metrics and more using "driv...

Re: Disaster recovery and Backup best practices in...

Disaster recovery and Backup best practices in a t...

Re: hdf / nifi code promotion to production

Re: Rack Awareness Series 2

Re: Rack Awareness

Re: What are the best practices/guidelines for sol...