About sunile_manjee

sunile_manjee · ‎08-02-2016

I have been working on EDW for last 10 years. Applying well established relational concepts to Hadoop I have seen many anti-patterns. How about some patterns which work? Lets get to work. Slowly changing dimensions are a known and well established design pattern. Patterns were established on relational theory. Why? Those were the dominant database tech used by virtually everyone. This article in no way expresses the only way to do SCD on Hadoop. I am sharing with you a few patterns which lead to victory. What is relational theory you ask? "In physics and philosophy, a relational theory is a framework to understand reality or a physical system in such a way that the positions and other properties of objects are only meaningful relative to other objects." - wiki So now we have a challenge. Hadoop and all the integrated animals were not based or found on relational theory. Hadoop was built on software engineering principles. This is extremely important to understand and absorb. Do not expect a 1:1 functionality. The platform paradigms are completely different. There is no lift and shift operation or turn key solution. If a vendor is selling you that..challenge them. Understand relational theory and how it different then software engineering principles. So that is out of the way lets start focusing on slowing changing dimension type 1. What is SCD type 1? "This methodology overwrites old with new data, and therefore does not track historical data." - Wiki This in my opinion is the easiest out of the several SCD types. Simply upset based on surrogate key. Requirements - Data ingested needs simple processing Target tables (Facts, and Dims) are of type 1. Simply upsert based on surrogate or natural key There are known and unknown query patterns (consumption of end product) There are know query patters (during integration/ETL) Step 1 We will first build staging tables in Phoenix (HBase). Why Phoenix? Phoenix/HBase handles upserts very well and handles known query patterns like a champ. I'm going with Phoenix. ETL will be performed on the staging tables and then finally load into our product/final output tables. The final output tables are the golden records. They will host all your post ETL'd data. Those tables will be available for end consumption for down stream BI, analytics, ETL, and etc. Using Apache NiFi, simply drag and drop your sources and your Phoenix staging tables onto the canvas and connect them. Do any simple transformation you wish here as well. Some requirements may state to land raw data and store transformed data into another table. Essentially creating a System of Record. That will work as well. We will mostly work with the post System of Record tables here. Step 2 Next we want to assign a primary keys to all records in the staging table. This primary key can either be a surrogate or natural key hash. Build a pig script to join both stage and final dimension records based on natural key. Records which have a match, use the primary key and upsert stage table for those records. For records that do not match you will need to generate a primary key. Again either generate a surrogate key or use natural key hash. Important the ExecuteProcess is a processor within Apache NiFi. I am just calling it out to be clear what needs to be done during the workflow. The part I purposely left out is the "how" to generate a surrogate key. There are many ways to skin a cat. Disgusting. I hate that phrase but you get the idea. Here are some ways of generate a surrogate key http://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/ https://github.com/manojkumarvohra/hive-hilo http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/ Using Pig Rank function http://amintor.com/1/post/2014/07/implement-scd-type-2-in-hadoop-using-hive-transforms.html Another option to point out - Use a RDBMS. I know many cringe when they hear this. I don't care. It works. Do the SCD1 processing for that incremental data set on a free and open source RDBMS. Then use RDBMS table to update Phoenix stage table. Want to join both data sets? You can also use Spark to join both RDBMS tables & HBase table. The connector information is here. Then you can do step 2 processing in Spark. I plan to write another article on this in the coming days/weeks. Stay tuned. This may end up being the dominant pattern. Step 4 Referential integrity. What is Referential integrity? Referential integrity is a property of data which, when satisfied, requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute (column) in a different (or the same) relation (table). For this topic I plan on creating a separate article. Basically you either code up all your validation here or build a rules engine. The rules engine will be leveraged to manage referential integrity. Bottom line. Hadoop does not adhere to relational theory. Applying relational theory concepts does not come naturally. There is some thinking involved. I call it engineering. Don't be afraid to take this on. Again I will post article on this. Step 5 Now we have stage table with our beautiful surrogate keys. Time to update our final tables. But notice I do not only update Phoenix tables. I have built the same tables and data set in hive. Why? For known query pattern Phoenix kicks butt. For unknown query patterns (Ad Hoc BI queries) I rather leverage Hive on Tez. Therefore using Apache NiFi pull your stage tables and upsert Phoenix and Hive final tables. Hive ACID is in technical preview. If you rather not do upsert in hive then this will involve another processing setup. This is well documented here so no reason for me to regurgitate that. I hope this help with your SCD type 1 design on Hadoop. Leverage Apache NiFi and other animals in Hadoop. Many way to skin..ahh i'm going there. Next article I will post design pattern on Hadoop for SCD type 2.

sunile_manjee · ‎07-20-2016

HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number: What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it. Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!

sunile_manjee · ‎07-20-2016

Another option. Ambari has most of the metrics yoi are looking for. Simple from you edge node using curl call ambari api to fetch the stats you are looking for. Here is more in ambari api https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Metrics+API+specification For example http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics/dfs/FSNamesystem/CapacityUsed or you can view all metrics http://:8080/api/v1/clusters//hosts//host_components/NAMENODE?fields=metrics

sunile_manjee · ‎07-20-2016

And one more $ hadoop fs -df -h Filesystem Size Used Available Use% hdfs://host-192-168-114-48.td.local:8020 7.0 G 467.5 M 18.3 M 7%

sunile_manjee · ‎07-20-2016

Report example sudo -u hdfs hdfs dfsadmin -report Configured Capacity: 7504658432 (6.99 GB) Present Capacity: 527142912 (502.72 MB) DFS Remaining: 36921344 (35.21 MB) DFS Used: 490221568 (467.51 MB) DFS Used%: 93.00% Under replicated blocks: 128 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 192.168.114.48:50010 (host-192-168-114-48.td.local) Hostname: host-192-168-114-48.td.local Decommission Status : Normal Configured Capacity: 7504658432 (6.99 GB) DFS Used: 490221568 (467.51 MB) Non DFS Used: 6977515520 (6.50 GB) DFS Remaining: 36921344 (35.21 MB) DFS Used%: 6.53% DFS Remaining%: 0.49% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 2

sunile_manjee · ‎07-20-2016

Run sudo -u hdfs hdfs dfsadmin -report This will give you full report on hdfs.

sunile_manjee · ‎07-09-2016

Short Description: Teragen and Terasort Performance testing on AWS Article This article should be used with extreme care. Do not use as benchmark. I performed this test to simply run a quick 1 Terabype teragen test on AWS to determine what type of performance I can get from mapreduce on AWS with VERY LITTLE configuration tweaking/tuning On my github page here you will find the following: teragen script hadoop,yarn,mapred,capacity scheduler configurations used during testing Hardware: (Master & Datanode) 1 Master, 3 Data nodes d2.4xlarge, 16vCPU, 122GB ram, (max) 12x2000 Storage TeraGen Results: 1hrs, 6mins, 38sec Job Counters: Terasort Results: 1hrs, 34mins, 20sec Teravalidate Results: 25mins, 27sec

sunile_manjee · ‎07-08-2016

@slachterman Good catch. fixed.

sunile_manjee · ‎07-07-2016

Great article.

sunile_manjee · ‎07-06-2016

This tutorial will show how to export data out of hbase table into csv format. We will use airport data from american statical association available here. Assume you have a sandbox up and running lets start. First ssh into your sandbox and switch user to hdfs sudo su - hdfs Then grab the airport data by issues a wget wget http://stat-computing.org/dataexpo/2009/airports.csv For my example the file is located /home/hdfs/airports.csv Now lets create a hbase table called "airports" with column family "info". Do this in hbase shell Now that the table is created lets load it. Get out of hbase shell. as user hdfs run the following to load the table hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,info:iata,info:airport,info:city,info:country,info:lat,info:long" airports hdfs://sandbox.hortonworks.com:/tmp/airports.csv That will kick off map reduce job to load airport table in hbase. once that is done you can do a quick verify in hbase shell by running counts 'airports' You should see 3368 records in the table. Now lets log into pig shell. We will create a variable called airport_data which we will load our hbase table into by issuing: airport_data = LOAD 'hbase://airports' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:iata,info:airport,info:city,info:country,info:lat,info:long', '-loadKey true') AS (iata,airport,city,country,lat,long); Now that we have our data in a variable lets dump it to hdfs using csv format by issuing: store airport_data into 'airportData/export' using PigStorage(','); So we have dumped the export into hdfs directory airportData/export. Lets go view it And there you go. We have loaded data into hbase table. Exported data from the table using pig in csv format. Happy pigging.

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: How to check the total storage capacity of the...

Using Apache NiFi for Slowly Changing Dimensions o...

Phoenix HBase Tuning - Quick Tips

Re: How to check the total storage capacity of the...

Re: How to check the total storage capacity of the...

Re: How to check the total storage capacity of the...

Re: How to check the total storage capacity of the...

Teragen, Terasort, and Teravalidate Performance te...

Re: Assign IP to HDP 2.5 Sandbox on Virtualbox

Re: JSON to SQL using Spark

Import HBase data in csv format using pig