Member since
04-04-2016
166
Posts
168
Kudos Received
29
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2916 | 01-04-2018 01:37 PM | |
4976 | 08-01-2017 05:06 PM | |
1593 | 07-26-2017 01:04 AM | |
8974 | 07-21-2017 08:59 PM | |
2631 | 07-20-2017 08:59 PM |
07-12-2016
03:04 PM
Hi, How does HDF handle teeing of encrypted data from PORD to DR? Anything special hat needs to be done in terms of key management/decrypt process? Thanks
... View more
Labels:
07-08-2016
05:39 PM
2 Kudos
@Faisal Hussain I do not think there is one but definitely there is plan to develop one. https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support You can convert Avro to JSON (ConvertAvroToJSON) and then call a script (lots of example of scripts in the internet for converting JSON to csv using java/javascript/perl/bash/awk etc and etc) in the ExecuteStreamCommand processor. This would work. Let us know if you need more details.
... View more
07-08-2016
04:44 PM
@Ravi Mutyala So the process of replication of changed configuration of the different components in primary cluster in the course of time would involve some manual or scripting change to apply in DR configuration? Or this can be solved by keeping the names (DNS) of the nodes in the DR and Primary same?
... View more
07-08-2016
04:30 PM
2 Kudos
@Ramasamy Karuppannan You can use a shell script and replace the properties file before the second WF invocation. This shell script needs to be called from the first WF.
... View more
07-06-2016
10:59 PM
1 Kudo
How to automate the setup of DR cluster when you have a up and running production cluster? The long term goal is to use the DR for 6 months and PROD for other 6 months so exact same configuration is needed. Also assume the hardware is the same. The eventual DR strategy will be almost real time replication. I was thinking about using BluePrint but do we have any other custom tools? Thanks
... View more
Labels:
- Labels:
-
Apache Ambari
07-06-2016
01:12 PM
5 Kudos
Driven first-hand experience:
While researching for a Hive Query Metrics UI found this tool
called driven.
Step1:
Register for a free trial at http://www.driven.io/hive/
Step2:
Upon clicking the verification link sent over email came to
this page:
Click on “Install the Driven Agent”
Step3:
On clicking that two things will happen:
You will be taken to the configuration and install
instructions
You will get an email with username and password
Step4:
Click on https://trial.driven.io/index.html#/apps
from screenshot 2
You will be in the login screen. Then enter the id and password
you got in your email.
Step5:
After logging in you will prompted to change the password. Change
that and click submit.
Step6:
You will be logged into the user interface. But since the
agent is not installed you cannot view anything. You will be also provided with
your API Key.
Step7:
Hive is installed in the cluster in a node named hiveserver.
Am going to install driven agent over there.
From a terminal/shell, execute the following
commands.
# latest Hive agent bundle
> wget -i
http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-hive-bundle.txt
# latest MapReduce agent bundle
> wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-mr-bundle.txt
After execution you will see the below jars:
Step8:
Now you need to export the configurations for the user you want "driven" to run: For Mapreduce 2.x
export
YARN_CLIENT_OPTS="-javaagent:/
path/to/driven-agent-mr-<version>.jar=drivenHosts=<driven
host>;drivenAPIkey=<driven api key>"
So my export becomes:
export
YARN_CLIENT_OPTS="-javaagent:/
sbin/driven-agent/driven-agent-mr-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>"
**Substitute value of my key
For Hive on Tez/MR
export
HADOOP_OPTS="-javaagent:/path/
to/driven-agent-hive-<version>.jar=drivenHosts=<driven
host>;drivenAPIkey=<driven api key>"
export
HADOOP_OPTS="-javaagent:/sbin/
driven-agent/driven-agent-hive-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>"
**Substitute value of my key
I have a user called storm setup in my cluster.
su – storm
export
YARN_CLIENT_OPTS="-javaagent:/
sbin/driven-agent/driven-agent-mr-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>"
export
HADOOP_OPTS="-javaagent:/sbin/
driven-agent/driven-agent-hive-2.2.2.jar=drivenHosts=https://trial.driven.io;drivenAPIkey=<mykey>"
Step9:
Run a program
Step10:
Refresh https://trial.driven.io/index.html#/apps
and the user interface will start populating
Now that we have successfully setup driven, we will explore the hive metrics user interface in part2 !!!
... View more
Labels:
07-06-2016
01:11 PM
10 Kudos
Before we move on to
the more popular DR/HA strategies for Amabari,Hive,HBase and using Falcon/snapshot/distcp etc let us take a quick
tour of the tiered storage in Hadoop. Link to Series1: https://community.hortonworks.com/content/kbentry/43525/disaster-recovery-and-backup-best-practices-in-a-t.html The concept of tiered
storage revolves around the idea of decoupling storage of data and the
computing/performing operations on the data. Let us think about it
in a more general way: You have a data lake
and you are storing a 5 years’ worth of data amounting to a petabyte. You are observing the
following scenarios in your cluster:
Your storage has become heterogeneous with
time meaning you have added new generation racks/machines over time and some of
them are better performing (less dense, high spinning/ newer processor) than
the others. You are also observing a trend that the last 6 months of
data is getting accessed more frequently (Like 90% of the times). While the remaining 4.5 years of data is only
accessed 10%. This is where tiered storage comes into play. What we are saying
here is as your data is ageing, its getting hot to warm to cold (labelled in
terms of frequency of access). There will be some exceptions to this for
example look-up tables. A simple solution to
increase the overall performance of your data lake would be to store all the 3 replicas
(if possible) in the newer machines. While store only 1 or 0 replica of the
older data in the older lower performing machines. Even these older machines
can be utilized to back up your configuration/setup files. You can tag the data
node storage location as archive or disk or ram etc to indicate tiered storage. The following storage
policies can be setup:
Hot - for both
storage and compute. The data that is popular and still being used for
processing will stay in this policy. When a block is hot, all replicas are
stored in DISK.
Cold - only for
storage with limited compute. The data that is no longer being used, or
data that needs to be archived is moved from hot storage to cold storage.
When a block is cold, all replicas are stored in ARCHIVE.
Warm - partially
hot and partially cold. When a block is warm, some of its replicas are
stored in DISK and the remaining replicas are stored in ARCHIVE.
All_SSD - for
storing all replicas in SSD.
One_SSD - for
storing one of the replicas in SSD. The remaining replicas are stored in
DISK.
Lazy_Persist - for writing blocks with single replica
in memory. The replica is first written in RAM_DISK and then it is lazily
persisted in DISK. You need to configure
the following properties:
dfs.storage.policy.enabled - for enabling/disabling the storage
policy feature. The default value is true.
dfs.datanode.data.dir - on each data node, the comma-separated
storage locations should be tagged with their storage types. This allows
storage policies to place the blocks on different storage types according
to policy.
For example:
A datanode storage location
/grid/dn/disk0 on DISK should be configured with [DISK]file:///grid/dn/disk0
A datanode storage location
/grid/dn/ssd0 on SSD can should configured with [SSD]file:///grid/dn/ssd0
A datanode storage location
/grid/dn/archive0 on ARCHIVE should be configured with [ARCHIVE]file:///grid/dn/archive0
A datanode storage location
/grid/dn/ram0 on RAM_DISK should be configured with [RAM_DISK]file:///grid/dn/ram0 The default storage
type of a datanode storage location will be DISK if it does not have a storage
type tagged explicitly. Storage Policies can be enforced during file creation, and at any
point during the lifetime of the file. For Storage Policies that have changed
during the lifetime of the file, HDFS introduces a new tool called Mover
that can be run periodically to migrate all files across the cluster to correct
Storage Types based on their Storage policies. If you want to read
more about it please refer this excellent articles: http://hortonworks.com/blog/heterogeneous-storages-hdfs/ http://hortonworks.com/blog/heterogeneous-storage-policies-hdp-2-2/ http://www.slideshare.net/Hadoop_Summit/reduce-storage-costs-by-5x-using-the-new-hdfs-tiered-storage-feature And the
documentation: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/archival_storage_introduction.html I want to end this
article with some discussion on the In-Memory storage tier: For applications that
need to write data that are temporary or can be regenerated, memory (RAM) can
be an alternate storage medium that provides low latency for reads and writes.
Since memory is a volatile storage medium, data written to the memory tier will
be asynchronously persisted to disk. HDP introduces the
‘RAM_DISK’ Storage Type and ‘LAZY_PERSIST’ Storage Policy. To setup the memory
as storage, it needs to follow the steps below: 1. Shut Down the Data Node 2. Mount a Portion of Data Node Memory for HDFS To use Data Node
memory as storage, you must first mount a portion of the Data Node memory for
use by HDFS. For example, you
would use the following commands to allocate 2GB of memory for HDFS storage: sudo mkdir -p
/mnt/hdfsramdisk
sudo mount -t tmpfs -o size=2048m tmpfs /mnt/hdfsramdisk
sudo mkdir -p /usr/lib/hadoop-hdfs 3. Assign the RAM_DISK Storage Type and Enable
Short-Circuit Reads Edit the following
properties in the /etc/hadoop/conf/hdfs-site.xml file to assign the RAM_DISK
storage type to DataNodes and enable short-circuit reads.
The dfs.name.dir property determines
where on the local filesystem a DataNode should store its blocks. To
specify a DataNode as RAM_DISK storage, insert [RAM_DISK] at the beginning
of the local file system mount path and add it to the dfs.name.dir
property.
To enable short-circuit reads, set
the value of dfs.client.read.shortcircuit to true. For example: <property>
<name>dfs.data.dir</name>
<value>file:///grid/3/aa/hdfs/data/,[RAM_DISK]file:///mnt/hdfsramdisk/</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
<property>
<name>dfs.checksum.type</name>
<value>NULL</value>
</property> 4. Set the LAZY_PERSIST Storage Policy on Files
or Directories Set a storage policy
on a file or a directory. Example: hdfs dfsadmin
-setStoragePolicy /memory1 LAZY_PERSIST
5. Start the Data Node More to come in series 3.
... View more
07-05-2016
07:26 PM
16 Kudos
Disaster recovery plan or a business process
contingency plan is a set of well-defined process or procedures that needs to
be executed so that the effects of a disaster will be minimized and the
organization will be able to either maintain or quickly resume mission-critical
operations. Disaster usually comes in several forms and
need to planned for recovery accordingly: Catastrophic
failure at the data center level, requiring failover to a backup location Needing
to restore a previous copy of your data due to user error or accidental
deletion The
ability to restore a point-in-time copy of your data for auditing purposes Disclaimer: 1. This article is solely my
personal take on disaster recovery in a Hadoop cluster 2. Disaster Recovery is specialized subject in
itself. Do not Implement something based on this article in production until
you have a good understanding on what you are implementing. Key objectives:
Minimal or no downtime for production cluster
Ensure High Availability of HDP Services
Ensure Backup and recovery of Databases, configurations and
binaries
No Data Loss
Recover from hardware failure
Recover from user error or accidental deletes
Business Continuity
Failover to DR cluster in case of Catastrophic failure or
disaster
This is the time I introduce RTO/RPO. RTO/RPO Drill Down RTO, or Recovery Time Objective, is the target
time you set for the recovery of your IT and business activities after a
disaster has struck. The goal here is to calculate how quickly you need to
recover, which can then dictate the type or preparations you need to implement
and the overall budget you should assign to business continuity. RPO, or Recovery Point Objective, is focused on
data and your company’s loss tolerance in relation to your data. RPO is
determined by looking at the time between data backups and the amount of data
that could be lost in between backups. The major difference between these two metrics
is their purpose. The RTO is usually large scale, and looks at your whole
business and systems involved. RPO focuses just on data and your company’s
overall resilience to the loss of it. Qs: What
is your RTO/RPO? Ans: For a complex and large production system
this answer would take some time to figure out and will progressively be defined.
Also ideally there should be multiple values for this answer. What are you talking about? A
1-hour/1-hour RTO/RPO is very different (cost and architecture wise) from a
2-week/1-day RTO/RPO. When you choose the RTO/RPO requirements you are also
choosing the required cost & architecture. By
having well-defined RTO/RPO requirements you will avoid having an
over-engineered solution (which may be far too expensive) and will also avoid
having an under-engineered solution (which may fail precisely when you need it
most - during a Disaster event) So ‘Band’
your data assets into different categories for RTO/RPO purposes.
Example: Band 1 = 1 hour RTO. Band 2 = 1 day
RTO. Band 3 = 1 week RTO, Band 4 = 1 month RTO, Band 5 = Not required in the
event of a disaster. You would be surprised how much data can wait
in the event of a SEVERE crash. For instance, datasets that are used to provide
a report that is distributed once per month – you should never require a 1-hour
RTO. Or even if it does that, it will only for the last day of the month. Rest
of it, which is 29/30=97% should at max require a 1 day RTO even with maximum
availability requirements. So the recommendation is to drill down your
dataset and categorize it for RTO/RPO objectives. You will eventually get into
a solution/architecture which would be better adaptive and more available
without increasing your budget. This will be more of a journey rather than
getting it 100% right at the first time. Qs: Who will decide the RTO/RPO of the wildly
varying sets of data in my data lake? Ans: The data/business line owners ideally will be the
person taking the decision. For log/troubleshooting/configuration type of data the admins and data
engineers should take the decision which should accept feedback from the data/business
line owners At this point of time we have not introduced
any tools or low level strategy for Disaster recovery and Backup. More to come in series 2... Link to Series2 : https://community.hortonworks.com/content/kbentry/43575/disaster-recovery-and-backup-best-practices-in-a-t-1.html
P.S. A very special note of thanks to @bpreachuk who pretty much penned down the RTO/RPO explanation. It was written so well that i almost copied it:) I also want to thank @Ravi Mutyala from whom I have learnt (and learning :)) a lot in this subject area.
... View more
07-05-2016
06:19 PM
@Roberto Sancho Did you altered the table after creation and added any extra column?
... View more
07-05-2016
03:20 PM
@jwitt Thank you for answering. I was thinking of using Nifi templates from dev and then pass it through a perl/shell/awk script which changes the properties before promoting it to production. But what you are saying that it does not copy some of the the sensitive properties. So seems like this approach will be unnecessarily complex. Is there any way to guarantee that the template will copy all the properties (Even blank values will do)? What are your thoughts? Also is there a list of properties which it omits?
... View more