About knatarasan

knatarasan · ‎04-11-2018

Following bullets provide, vantage point from which Applications on Hive has to be analyzed for performance Application types Ingestion intensive Staging/Storage intensive ETL intensive Consumption intensive Data Model used Level of Normalization Star schema Table design Storage format used compression Usage of collection data types (struct,array,map) Partition Is there a possibility of over partition Whether dynamic partition enabled Bucket (Review join conditions on bucketed column ) Functions Usage of UDF,UDAF Query pattern Select with where ( map only) Group by (map shuffle reduce) Order by (map shuffle single reduce ) Analytical functions Sort by (map shuffle multi reduce ) Join Map join ( these are mapper only but would seek heavier memory ) Sort merge join Partition column usage -( Especially For huge transaction tables ) From source table - Usage of multi pass Table size : For huge tables analyze everything from scan perspective

knatarasan · ‎12-26-2017

Introduction EDW (Enterprise Data Warehouse) is a traditional methodology developed and grown over past three decades, highly matured with its own best practices. Hadoop has emerged a decade ago but one of the fastest growing eco system. Merging these two systems without giving importance to details of both the worlds would easily make a scenario of not utilizing strengths of both and eventually lead to frustrations and failures. Most of EDW migration fails due to ignoring fundamentals of EDW and Hadoop. EDW An Enterprise Data warehouse is set of tools and techniques to enable business to answer compelling questions about what happened. It enables businesses to view the facts of various business processes from a perspective of different dimensions. Example: What is profit generated out of customers from Texas during Christmas season on each year for the past 5 years. In the above profit is a fact or “some form of transaction”, customers, state Texas and Christmas season are dimensions. Thus data warehouse Enables businesses to take informed decisions based on facts. A Data warehouse is also defined as, subject oriented, time variant, non volatile and integrated data store. Traditionally, it is feasible to build an EDW from following approaches, Bottom up Build data marts with facts and dimensions and combine them together which will eventually form an EDW. A fact could be in the following forms, Transaction: made up of individual transactions. Accumulating: which captures a process flow. (example: a cycle from order placed to products sold out for a retail store) Periodic: Data captured on periodic intervals. (example: In a bank, Amount on every account at beginning of every month) A dimension could be Date and time Customer Classified on things like geography, demography, age etc. Tracks changes, This data set should store history of a customer. Example: Customer John, where he lived during different period of time etc. Product Employee Cost center Profit center Top down Build an EDW with 3NF model, which would to try to represent the whole business. A typical 3NF model will include, People Asset holdings Contracts Product Event Campaign Medium Location Organization ETL In a Data warehouse, ETL provides the skeleton to the framework and also consumes 80% of effort, remaining goes to presentation layer. ETL includes, Data modeling, ETL architecture, Meta data management, ETL development and maintenance. Thus the core of data is different in Hadoop and EDW. Hadoop is ingested with raw data, processing and quality is later concern, where as EDW stores cleaned and integrated, tested thus confidently consumable data. ETL Tools ETL tools have been significant part in building an ETL infrastructure. These contributed in design, build, test and maintain ETL flows with lot easier and with lower or feasible budget in terms of developer hours, please be reminded that In a data warehouse budget spent on ETL development and maintenance effort is significant. Before evolution of stable ETL tools, ETL was carried out with SQL, PL/SQL and shell scripts. ETL tools evolved due to following advantages Graphical user interface Short learning curve Ease of development Ease of maintenance Thus complex ETL logics built on SQL, PL/SQL and shell scripts were migrated to ETL tools, which addressed Complexity in data processing logics were reduced by replacing complex sqls. Complexity in workflow design is reduced by replacing complex shell scripts and scheduling logics. Well, why should there be lot of concern on ETL tools while discussing EDW migration? Because most of EDW migration effort fails due to undermining the importance of ETL tools and by pass them with all their advantages being lost, and going back to scripting approach on Hadoop. The key is ETL tools were used to bring down budget spent on man hours. So should I keep my ETL tool untouched? Hadoop should be used at the points where a ETL tool struggles. Its tough to handle enterprise size loads for ETL tools, so they come up with two options, Running ETL logic on a grid of servers (example: Informatica grid). Leveraging the power of MPP (example: By Informatica’s full push down optimization, ETL logic would be built and maintained in Informatica, where as entire processing will be carried out in Teradata). Full push down approach of an ETL tool on top of Hadoop would be ideal, but maturity of this approach has to be considered. Scenarios to consider EDW -Hadoop merge Let us classify a Data warehouse or EDW into following by data volume and computational needs. Small (Don’t consider merging this with Hadoop now) Data size occupied by consumption layer can be handled by an RDBMS running on a single, medium or high powered server, and users are enjoying low latency on current reports. Rate of growth on data size and computational needs are way below than improvement happens at hardware size of single server in terms of number of cores, memory and storage. Example Data marts with an objective to serve small and limited set of users EDW for a small business or a business which is not data oriented (example: A law firm) Medium (Strongly consider Hadoop) At present the scenario is as mentioned in previous case, but rate of growth (on data volume and computational needs) is high and management is considering to move from an RDBMS to MPP system. Large (Must merge with Hadoop) It’s an Enterprise wide Data warehouse running on a MPP system. This is typical scenario on many enterprises across industries. This kind of Data warehouse would grow from few hundred Teradata bytes to few Peta bytes. How to merge EDW with Hadoop An EDW is made up of Staging, Transformation and Consumption layer. Consumption layer This is the data repository which stores ready to consume data. This layer can be first to go into Hadoop using HIVE. As hive supports most of the features of a MPP system, Parallel computation Support to SQL syntax Matured drivers to connect with leading BI tools Able to load and unload huge data in shorter time Latency similar or slightly higher than existing MPP There is a misconception that latency in data warehouse could be in seconds but in Hadoop it is in minutes, yes this is true for smaller warehouses with few hundred GB and hostable on a single server, but Enterprise warehouses with few hundred TBs and hosted in a MPP system, the average latency would easily go for 5 min to 8 mins on ad-hoc reports. Also note that latency is a concern for ad-hoc reports, in a typical data warehouse 40 to 70% of reports would be canned reports, for this category latency is a not a concern at all. Staging layer This another data store before consumption layer, data extracted from different source systems lands here. This layer is used by ETL tools to clean and integrate data. This can be next to move into Hadoop, both hdfs and hive can be used for this. Traditionally data staged on Linux servers, RDBMS or mainframe. Transformation layer This layer process the data at staging to clean and integrate with data at consumption layer. As being core layer with business logic, it is one of the most effort consuming layer. This should go into Hadoop as last and with careful consideration. This can be considered, when economy of moving this layer into Hadoop is justified over current ETL tool approach. Ideal is keeping business logic on ETL tools and leverage cheaper computation power from Hadoop through push down approach. Fact Transaction (insert once read many) - hive table can solve this. Accumulating (insert once update a row multiple times, read many) – hive with work around on update logic can solve this. Periodic (insert once read many) - hive table can solve. Dimension Data and time - plain hive table Customer (insert once update a row occasionally) - hive table can solve. Product (insert once update a row occasionally) - hive table can solve. Cost center (insert once update a row occasionally) - hive table can solve. Profit center (insert once update a row occasionally) - hive table can solve. Why “EDW migration into Hadoop” would fail? Seeing data warehouse as a voluminous data store (Actually a data warehouse is built by defined data processing rules, tested data pipelines, built through confluence of business processes and SMEs). Trying to migrate all possible pieces and mechanisms into Hadoop instead of moving only feasible things. Aggression towards saving money on cost of data storage and cost on computation power but neglecting cost involved in man hours to clean integrate and maintain business data. Migrating pieces from EDW into Hadoop based on how it is possible instead of how it is feasible. References [1] The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling 2nd Edition by Ralph Kimball, Margy Ross [2] Building the Data Warehouse by W. H. Inmon

knatarasan · ‎09-17-2017

1.This option is tested in Ambari 2.2.2.0 and HDP 2.4.3.22. 2.In the Ambari Background Operations dialog, stop all pending commands and jobs. 3.Stop all services. 4.Backup the Ambari database. (Default passwd for ambari : bigdata and mapred:mapred ) [root@palm02 db_dumps]# pg_dump -U ambari ambari > ambari.sql Password: [root@palm02 db_dumps]# ls -lrt total 10668 -rw-r--r--. 1 root root 10920611 Oct 25 21:23 ambari.sql [root@palm02 db_dumps]# vim ambari.sql [root@palm02 db_dumps]# pg_dump -U mapred ambarirca > ambarirca.sql Password: [root@palm02 db_dumps]# ls -lrt total 10680 -rw-r--r--. 1 root root 10920611 Oct 25 21:23 ambari.sql -rw-r--r--. 1 root root 9189 Oct 25 21:24 ambarirca.sql 5.Stop ambari-server and ambari-agents on all hosts. ambari-server stop ambari-agent stop 6.Create *.json file with host names changes. [root@palm02 ~]# cat cluster_host.json { "palm" : { "palm02" : "palm02.hwx.com", "palm03" : "palm03.hwx.com", "palm20" : "palm20.hwx.com" } } [root@palm02 ~]# where palm is cluster name and "palm02" : "palm02.hwx.com"is the host names pair in the format "current_host_name" : "new_host_name". 7.Execute the following command on the ambari-server host: [root@palm02 ~]# ambari-server update-host-names cluster_host.json Using python /usr/bin/python Updating host names Please, confirm Ambari services are stopped [y/n] (n)? y Please, confirm there are no pending commands on cluster [y/n] (n)? y Please, confirm you have made backup of the Ambari db [y/n] (n)? y Ambari Server 'update-host-names' completed successfully. 8.After successful end of this action, please update host names for all nodes, according to changes that you added to *.json file. [root@palm02 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.0.0.5 palm02.hwx.com 10.0.0.6 palm03.hwx.com 10.0.0.7 palm20.hwx.com 9.If you changed the host name for the node on which the ambari server resides, then you must update that name for every ambari-agent. In /etc/ambari-agent/conf/ambari-agent.ini, update the "hostname" field to the new host name for node on which the ambari-server resides. 10.Start ambari-server and ambari-agents on all hosts. ambari-server start ambari-agent start 11.Start all services, using Ambari Web. For each, browse to Services > <service_name> > Service Actions, choose Start. Note : If you have NameNode HA enabled, after starting the ZooKeeper service, you must: a.Start all ZooKeeper components. b.Execute the following command on both NameNode hosts: hdfs zkfc -formatZK -force

knatarasan · ‎06-09-2017

STEP 1 Configure cluster for Tez View Enable ATS by : yarn.timeline-service.enabled=true Enable following yarn setting: yarn.resourcemanager.system-metrics-publisher.enabled=true yarn.timeline-service.webapp.address=IP:PORT of ATS Enable following core-site setting hadoop.proxyuser.ambari-server.groups=* hadoop.proxyuser.ambari-server.hosts=* Kerberize Ambari-server Create a principal in your KDC for the Ambari Server. For example, using kadmin: addprinc -randkey ambari-server@HADOOP.GCSKDC.CORP.APPLE.COM Generate a keytab for that principal. xst -k ambari.server.keytab ambari-server@HADOOP.GCSKDC.CORP.APPLE.COM Place that keytab on the Ambari Server host. Be sure to set the file permissions so the user running the Ambari Server daemon can access the keytab file. /etc/security/keytabs/ambari.server.keytab Stop the ambari server. ambari-server stop Run the setup-security command. ambari-server setup-security Select 3 for Setup Ambari kerberos JAAS configuration. Enter the Kerberos principal name for the Ambari Server you set up earlier. Enter the path to the keytab for the Ambari principal. Restart Ambari Server. ambari-server restart STEP 2 Set up Tez View Kerberos Setup for Tez Views yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.hosts=* yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.users=* yarn.timeline-service.http-authentication.proxyuser.${ambari principal name}.groups=* 2. Replace latest Tez view jar into /var/lib/ambari-server/resources/views/ either of the following tez-view-2.4.3.0.30.jar - Ambari 2.4.3 tez-view-2.5.1.0.159.jar - Ambari 2.5.1 You should get following Tez UI with Ambari 2.5.1 jar

knatarasan · ‎07-26-2016

Online	Offline
Last Visited	‎02-01-2022 02:57 PM

Member Since	‎06-23-2016 07:07 PM
Last Visited	‎02-01-2022 02:57 PM
Posts	5
Kudos received	2

Cloudera Community

Elements of Hive Application Tuning

Merging EDW with Hadoop

Adopting hostname changes through Ambari

Back port newer version of Tez web UI on Ambari 2....

Whether Web UI configuration is supported for Hive...