Member since
06-23-2016
5
Posts
2
Kudos Received
0
Solutions
04-11-2018
09:59 PM
Following bullets provide, vantage point from which Applications on Hive has to be analyzed for performance
Application types Ingestion intensive Staging/Storage intensive ETL intensive Consumption intensive
Data Model used
Level of Normalization Star schema
Table design
Storage format used compression Usage of collection data
types (struct,array,map) Partition Is there a possibility of
over partition Whether dynamic partition
enabled
Bucket (Review join conditions on bucketed column ) Functions
Usage of UDF,UDAF
Query pattern
Select with where ( map
only) Group by (map shuffle
reduce) Order by (map shuffle single
reduce ) Analytical functions Sort by (map shuffle multi reduce ) Join Map join ( these are mapper only but would seek heavier memory ) Sort merge join Partition column usage -(
Especially For huge transaction tables ) From source table - Usage of multi pass Table size : For huge tables
analyze everything from scan perspective
... View more
Labels:
12-26-2017
03:48 PM
1 Kudo
Introduction EDW (Enterprise Data Warehouse) is a traditional
methodology developed and grown over past three decades, highly matured with
its own best practices. Hadoop has emerged a decade ago but one of the fastest
growing eco system. Merging these two systems without giving importance to
details of both the worlds would easily make a scenario of not utilizing
strengths of both and eventually lead to frustrations and failures. Most of EDW migration fails due to ignoring fundamentals
of EDW and Hadoop. EDW An Enterprise Data warehouse is set of
tools and techniques to enable business to answer compelling questions about
what happened. It enables businesses to view the facts of various business
processes from a perspective of different dimensions. Example: What is profit generated out of customers
from Texas during Christmas season on each year for the past 5 years. In the above profit is a fact or “some
form of transaction”, customers, state Texas and Christmas season are
dimensions. Thus data warehouse Enables businesses to take informed decisions
based on facts. A Data warehouse is also defined as, subject
oriented, time variant, non volatile and integrated data store. Traditionally, it is feasible to
build an EDW from following approaches, Bottom up Build data marts with facts and
dimensions and combine them together which will eventually form an EDW.
A fact could
be in the following forms, Transaction:
made up of individual transactions. Accumulating:
which captures a process flow. (example: a cycle from order placed to products
sold out for a retail store) Periodic:
Data captured on periodic intervals. (example: In a bank, Amount on every
account at beginning of every month) A dimension
could be Date
and time Customer Classified on
things like geography, demography, age etc. Tracks
changes, This data set
should store history of a customer. Example: Customer
John, where he lived during different period of time etc. Product Employee Cost
center Profit
center Top down Build an EDW with 3NF model, which would to try to
represent the whole business. A typical 3NF model will include,
People Asset holdings Contracts Product Event Campaign Medium Location Organization ETL In a Data warehouse, ETL provides the skeleton
to the framework and also consumes 80% of effort, remaining goes to
presentation layer. ETL includes, Data modeling, ETL architecture, Meta data management,
ETL development and maintenance. Thus the core of data is different in Hadoop
and EDW. Hadoop is ingested with raw data,
processing and quality is later concern, where as EDW stores cleaned and integrated,
tested thus confidently consumable data. ETL Tools ETL tools have been significant part
in building an ETL infrastructure. These contributed in design, build, test and
maintain ETL flows with lot easier and with lower or feasible budget in terms
of developer hours, please be reminded that In a data warehouse budget spent on
ETL development and maintenance effort is significant. Before evolution of stable ETL tools, ETL was
carried out with SQL, PL/SQL and shell scripts. ETL
tools evolved due to following advantages
Graphical user
interface Short learning
curve Ease of
development Ease of
maintenance Thus complex ETL logics built on SQL,
PL/SQL and shell scripts were migrated to ETL tools, which addressed
Complexity in
data processing logics were reduced by replacing complex sqls. Complexity in
workflow design is reduced by replacing complex shell scripts and scheduling
logics. Well, why should there be lot of
concern on ETL tools while discussing EDW migration? Because most of
EDW migration effort fails due to undermining the importance of ETL tools and
by pass them with all their advantages being lost, and going back to scripting
approach on Hadoop. The key is ETL tools were used to bring down budget spent
on man hours. So should I keep my ETL tool untouched? Hadoop should
be used at the points where a ETL tool struggles. Its tough to handle enterprise
size loads for ETL tools, so they come up with two options,
Running ETL
logic on a grid of servers (example: Informatica grid). Leveraging the
power of MPP (example: By Informatica’s full push down optimization, ETL logic
would be built and maintained in Informatica, where as entire processing will
be carried out in Teradata). Full push down
approach of an ETL tool on top of Hadoop would be ideal, but maturity of this
approach has to be considered. Scenarios to consider
EDW -Hadoop merge Let us classify a Data warehouse or
EDW into following by data volume and computational needs. Small (Don’t consider merging this with Hadoop
now)
Data size occupied by consumption layer can
be handled by an RDBMS running on
a single, medium or high powered server, and users are enjoying low latency on
current reports. Rate of growth on data
size and computational needs are way below than improvement happens at hardware
size of single server in terms of number of cores, memory and storage. Example
Data marts
with an objective to serve small and limited set of users EDW for a
small business or a business which is not data oriented (example: A law firm) Medium (Strongly consider Hadoop) At present the scenario is as mentioned in previous
case, but rate of growth (on data volume and computational needs) is high and management
is considering to move from an RDBMS to MPP system. Large (Must merge with Hadoop) It’s an Enterprise wide Data warehouse running
on a MPP system. This is typical scenario on many enterprises across
industries. This kind of Data warehouse would grow from few hundred Teradata
bytes to few Peta bytes. How to merge
EDW with Hadoop An EDW is made up of Staging,
Transformation and Consumption layer. Consumption layer
This
is the data repository which stores ready to consume data. This layer can be
first to go into Hadoop using HIVE. As hive supports most of the features of a
MPP system, Parallel
computation Support to SQL
syntax Matured
drivers to connect with leading BI tools Able to load
and unload huge data in shorter time Latency
similar or slightly higher than existing MPP There is a misconception that latency
in data warehouse could be in seconds but in Hadoop it is in minutes, yes this
is true for smaller warehouses with few hundred GB and hostable on a single
server, but Enterprise warehouses with few hundred TBs and hosted in a MPP
system, the average latency would easily go for 5 min to 8 mins on ad-hoc
reports. Also note that latency is a concern for ad-hoc reports, in a typical
data warehouse 40 to 70% of reports would be canned reports, for this category
latency is a not a concern at all. Staging layer This another data store before
consumption layer, data extracted from different source systems lands here. This
layer is used by ETL tools to clean and integrate data. This can be next to
move into Hadoop, both hdfs and hive can be used for this. Traditionally data
staged on Linux servers, RDBMS or mainframe. Transformation layer This layer process the data at staging
to clean and integrate with data at consumption layer. As being core layer with
business logic, it is one of the most effort consuming layer. This should go
into Hadoop as last and with careful consideration. This can be considered, when
economy of moving this layer into Hadoop is justified over current ETL tool
approach. Ideal is keeping business logic on
ETL tools and leverage cheaper computation power from Hadoop through push down
approach.
Fact Transaction
(insert once read many) - hive table can solve this. Accumulating
(insert once update a row multiple times, read many) – hive with work around on
update logic can solve this. Periodic
(insert once read many) - hive table can solve. Dimension Data
and time - plain hive table Customer
(insert once update a row occasionally) - hive table can solve. Product
(insert once update a row occasionally) - hive table can solve. Cost
center (insert once update a row occasionally) - hive table can solve. Profit
center (insert once update a row occasionally) - hive table can solve. Why “EDW migration into Hadoop”
would fail?
Seeing data
warehouse as a voluminous data store (Actually a data warehouse is built by
defined data processing rules, tested data pipelines, built through confluence
of business processes and SMEs). Trying to
migrate all possible pieces and mechanisms into Hadoop instead of moving only
feasible things. Aggression
towards saving money on cost of data storage and cost on computation power but neglecting
cost involved in man hours to clean integrate and maintain business data. Migrating
pieces from EDW into Hadoop based on how it is possible instead of how it is
feasible. References [1] The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling 2nd Edition by Ralph Kimball, Margy Ross [2] Building the Data Warehouse by W. H. Inmon
... View more
Labels:
09-17-2017
03:06 PM
1.This option is tested in Ambari 2.2.2.0 and HDP 2.4.3.22. 2.In the Ambari Background Operations dialog, stop all pending commands and jobs. 3.Stop all services. 4.Backup the Ambari database.
(Default passwd for ambari : bigdata and mapred:mapred ) [root@palm02 db_dumps]# pg_dump -U ambari
ambari > ambari.sql Password: [root@palm02 db_dumps]# ls -lrt total 10668 -rw-r--r--. 1 root root 10920611 Oct 25
21:23 ambari.sql [root@palm02 db_dumps]# vim ambari.sql [root@palm02 db_dumps]# pg_dump -U mapred
ambarirca > ambarirca.sql Password: [root@palm02 db_dumps]# ls -lrt total 10680 -rw-r--r--. 1 root root 10920611 Oct 25
21:23 ambari.sql -rw-r--r--. 1 root root
9189 Oct 25 21:24 ambarirca.sql 5.Stop ambari-server and ambari-agents on all hosts.
ambari-server stop
ambari-agent stop 6.Create *.json file with host names changes. [root@palm02 ~]# cat
cluster_host.json { "palm" : { "palm02" :
"palm02.hwx.com", "palm03" :
"palm03.hwx.com", "palm20" :
"palm20.hwx.com" } } [root@palm02 ~]# where palm is
cluster name and "palm02" : "palm02.hwx.com"is
the host names pair in the format "current_host_name" :
"new_host_name". 7.Execute the following command on the ambari-server host: [root@palm02 ~]# ambari-server
update-host-names cluster_host.json Using python /usr/bin/python Updating host names Please, confirm Ambari services are
stopped [y/n] (n)? y Please, confirm there are no pending
commands on cluster [y/n] (n)? y Please, confirm you have made backup of
the Ambari db [y/n] (n)? y Ambari Server 'update-host-names'
completed successfully. 8.After successful end of this action, please update host names
for all nodes, according to changes that you added to *.json file. [root@palm02 ~]# cat
/etc/hosts 127.0.0.1 localhost
localhost.localdomain localhost4 localhost4.localdomain4 ::1
localhost localhost.localdomain localhost6 localhost6.localdomain6 10.0.0.5 palm02.hwx.com 10.0.0.6 palm03.hwx.com 10.0.0.7 palm20.hwx.com 9.If you changed the host name for the node on which the ambari
server resides, then you must update that name for every ambari-agent.
In /etc/ambari-agent/conf/ambari-agent.ini, update the
"hostname" field to the new host name for node on which the
ambari-server resides. 10.Start
ambari-server and ambari-agents on all hosts.
ambari-server start
ambari-agent start 11.Start
all services, using Ambari Web.
For each, browse to Services > <service_name> > Service
Actions, choose Start. Note : If you have NameNode HA enabled, after starting the ZooKeeper service, you must:
a.Start all ZooKeeper components.
b.Execute the following command on both NameNode hosts:
hdfs zkfc -formatZK -force
... View more
Labels:
06-09-2017
09:27 PM
STEP 1 Configure cluster for Tez View Enable ATS by :
yarn.timeline-service.enabled=true
Enable following yarn setting: yarn.resourcemanager.system-metrics-publisher.enabled=true yarn.timeline-service.webapp.address=IP:PORT of ATS
Enable following core-site setting hadoop.proxyuser.ambari-server.groups=*
hadoop.proxyuser.ambari-server.hosts=* Kerberize Ambari-server Create a principal in your KDC for the Ambari Server. For example, using kadmin: addprinc -randkey ambari-server@HADOOP.GCSKDC.CORP.APPLE.COM
Generate a keytab for that principal. xst -k ambari.server.keytab ambari-server@HADOOP.GCSKDC.CORP.APPLE.COM
Place that keytab on the Ambari Server host. Be sure to set the file permissions so the user running the Ambari Server daemon can access the keytab file. /etc/security/keytabs/ambari.server.keytab
Stop the ambari server. ambari-server stop
Run the setup-security command. ambari-server setup-security Select 3 for Setup Ambari kerberos JAAS configuration. Enter the Kerberos principal name for the Ambari Server you set up earlier. Enter the path to the keytab for the Ambari principal.
Restart Ambari Server. ambari-server restart
STEP 2 Set up Tez View Kerberos Setup for Tez Views yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.hosts=*
yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.users=*
yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.groups=* 2. Replace latest Tez view jar into /var/lib/ambari-server/resources/views/ either of the following tez-view-2.4.3.0.30.jar - Ambari 2.4.3 tez-view-2.5.1.0.159.jar - Ambari 2.5.1 You should get following Tez UI with Ambari 2.5.1 jar
... View more
Labels:
07-26-2016
04:38 AM
1 Kudo
Labels:
- Labels:
-
Apache Hive