Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
08-02-2016
07:30 PM
7 Kudos
I have been working on EDW for last 10 years. Applying well established relational concepts to Hadoop I have seen many anti-patterns. How about some patterns which work? Lets get to work. Slowly changing dimensions are a known and well established design pattern. Patterns were established on relational theory. Why? Those were the dominant database tech used by virtually everyone. This article in no way expresses the only way to do SCD on Hadoop. I am sharing with you a few patterns which lead to victory. What is relational theory you ask? "In physics and philosophy, a relational theory is a framework to understand reality or a physical system in such a way that the positions and other properties of objects are only meaningful relative to other objects." - wiki So now we have a challenge. Hadoop and all the integrated animals were not based or found on relational theory. Hadoop was built on software engineering principles. This is extremely important to understand and absorb. Do not expect a 1:1 functionality. The platform paradigms are completely different. There is no lift and shift operation or turn key solution. If a vendor is selling you that..challenge them. Understand relational theory and how it different then software engineering principles. So that is out of the way lets start focusing on slowing changing dimension type 1. What is SCD type 1? "This methodology overwrites old with new data, and therefore does not track historical data." - Wiki This in my opinion is the easiest out of the several SCD types. Simply upset based on surrogate key. Requirements -
Data ingested needs simple processing Target tables (Facts, and Dims) are of type 1. Simply upsert based on surrogate or natural key There are known and unknown query patterns (consumption of end product) There are know query patters (during integration/ETL) Step 1 We will first build staging tables in Phoenix (HBase). Why Phoenix? Phoenix/HBase handles upserts very well and handles known query patterns like a champ. I'm going with Phoenix. ETL will be performed on the staging tables and then finally load into our product/final output tables. The final output tables are the golden records. They will host all your post ETL'd data. Those tables will be available for end consumption for down stream BI, analytics, ETL, and etc. Using Apache NiFi, simply drag and drop your sources and your Phoenix staging tables onto the canvas and connect them. Do any simple transformation you wish here as well. Some requirements may state to land raw data and store transformed data into another table. Essentially creating a System of Record. That will work as well. We will mostly work with the post System of Record tables here. Step 2 Next we want to assign a primary keys to all records in the staging table. This primary key can either be a surrogate or natural key hash. Build a pig script to join both stage and final dimension records based on natural key. Records which have a match, use the primary key and upsert stage table for those records. For records that do not match you will need to generate a primary key. Again either generate a surrogate key or use natural key hash. Important the ExecuteProcess is a processor within Apache NiFi. I am just calling it out to be clear what needs to be done during the workflow. The part I purposely left out is the "how" to generate a surrogate key. There are many ways to skin a cat. Disgusting. I hate that phrase but you get the idea. Here are some ways of generate a surrogate key
http://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/ https://github.com/manojkumarvohra/hive-hilo http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/ Using Pig Rank function http://amintor.com/1/post/2014/07/implement-scd-type-2-in-hadoop-using-hive-transforms.html Another option to point out - Use a RDBMS. I know many cringe when they hear this. I don't care. It works. Do the SCD1 processing for that incremental data set on a free and open source RDBMS. Then use RDBMS table to update Phoenix stage table. Want to join both data sets? You can also use Spark to join both RDBMS tables & HBase table. The connector information is here. Then you can do step 2 processing in Spark. I plan to write another article on this in the coming days/weeks. Stay tuned. This may end up being the dominant pattern. Step 4 Referential integrity. What is Referential integrity? Referential integrity is a property of data which, when satisfied, requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute (column) in a different (or the same) relation (table). For this topic I plan on creating a separate article. Basically you either code up all your validation here or build a rules engine. The rules engine will be leveraged to manage referential integrity. Bottom line. Hadoop does not adhere to relational theory. Applying relational theory concepts does not come naturally. There is some thinking involved. I call it engineering. Don't be afraid to take this on. Again I will post article on this. Step 5 Now we have stage table with our beautiful surrogate keys. Time to update our final tables. But notice I do not only update Phoenix tables. I have built the same tables and data set in hive. Why? For known query pattern Phoenix kicks butt. For unknown query patterns (Ad Hoc BI queries) I rather leverage Hive on Tez. Therefore using Apache NiFi pull your stage tables and upsert Phoenix and Hive final tables. Hive ACID is in technical preview. If you rather not do upsert in hive then this will involve another processing setup. This is well documented here so no reason for me to regurgitate that. I hope this help with your SCD type 1 design on Hadoop. Leverage Apache NiFi and other animals in Hadoop. Many way to skin..ahh i'm going there. Next article I will post design pattern on Hadoop for SCD type 2.
... View more
07-20-2016
09:46 PM
6 Kudos
HBase is just awesome. Yes I will start with that. HBase tuning like any other service within the ecosystem requires understanding of the configurations and the impact (good or bad) it may have on the service. I will share some simple quick hitters which you can check for to increase performance on your phoenix/hbase cluster
What is your hfile locality? Do a simple check on ambari should show you the percentage. The metric is Regionserver.Server.percentFilesLocal What does this metric mean? More info here. Thanks to @Predrag Minovic it means "That's the ratio of HFiles associated with regions served by an RS having one replica stored on the local Data Node in HDFS. RS can access local files directly from a local disk if short-circuit is enabled. And If you, for example run a RS on a machine without DN, its locality will be zero. You can find more details here. And on HBase Web UI page you can find locality per table." So if your percentage is lower then 70.. time to run major compaction. This will bring your hfile locality back in order How many regions are being hosted per RegionServer? Generally I don't go over 200 regions per RegionServer. I personally set this to 150-175. This allows for failover scenario when RS dies its regions need to be redistributed to available RegionSevers. When this happens the existing RS will take on the additional load until the failed RS is back online. To allow for this failover I don't like to go over 150-175 regions per RS. Others may tell you different. From real work experience I don't go over that limit. Do as you wish. Simple way to check how many regions you have hosted per RS is by going to the HBase Master UI through ambari. On the front page you will find the number:
What are you region sizes? As a general practice I don't go over 10gig region size. Based on your use case it may make sense to increase or decrease this size. As a general starting point 10gig has worked from me. From there you can have at it.
Phoenix - Are you using indexes? Look are your query. Does suffer from terrible performance? You have been told phoenix/hbase queries are extremely fast. First place to look is your where clause. EVERY field in your where clause must be indexed using secondard indexes. More info here. Use global indexes for read heavy use case. Use local indexes for write heavy use cases. Leverage covered index when you are not fetching all the columns from the table. So you may ask what if I have balanced reads/write?. What type of index should I use? Take a look at my post here. Is your cluster sized properly This goes with the theme of regionserver and region size. At the end of it all you may need more nodes. Remember this is not the database we are all familiar with where you just beef up the box. With hbase add more cheap nodes DOES increase performance. It is all in the architecture. What does your GC activity look like when you suffer from slow performance? This one I find many/most don't have a clue. This is very important. Analyze the type of GC activity happening on your hbase region servers. Too much GC? Tune the JVM params. Use G1. etc. How to monitor GC? I wrote a post on it here. I hope this helps with the awesomeness hbase/phoenix provides. This is just the begining. As I engage with many customers I will continue to add more patterns to this article. Now go tune your cluster!
... View more
Labels:
07-09-2016
01:30 AM
6 Kudos
Short Description: Teragen and Terasort Performance testing on AWS Article This article should be used with extreme care. Do not use as benchmark. I performed this test to simply run a quick 1 Terabype teragen test on AWS to determine what type of performance I can get from mapreduce on AWS with VERY LITTLE configuration tweaking/tuning On my github page here you will find the following:
teragen script hadoop,yarn,mapred,capacity scheduler configurations used during testing Hardware: (Master & Datanode) 1 Master, 3 Data nodes d2.4xlarge, 16vCPU, 122GB ram, (max) 12x2000 Storage TeraGen Results: 1hrs, 6mins, 38sec Job Counters: Terasort Results: 1hrs, 34mins, 20sec Teravalidate Results: 25mins, 27sec
... View more
Labels:
07-08-2016
01:28 AM
@slachterman Good catch. fixed.
... View more
07-06-2016
10:28 PM
5 Kudos
This tutorial will show how to export data out of hbase table into csv format. We will use airport data from american statical association available here. Assume you have a sandbox up and running lets start. First ssh into your sandbox and switch user to hdfs sudo su - hdfs Then grab the airport data by issues a wget wget http://stat-computing.org/dataexpo/2009/airports.csv For my example the file is located /home/hdfs/airports.csv Now lets create a hbase table called "airports" with column family "info". Do this in hbase shell Now that the table is created lets load it. Get out of hbase shell. as user hdfs run the following to load the table hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,info:iata,info:airport,info:city,info:country,info:lat,info:long" airports hdfs://sandbox.hortonworks.com:/tmp/airports.csv That will kick off map reduce job to load airport table in hbase. once that is done you can do a quick verify in hbase shell by running counts 'airports' You should see 3368 records in the table. Now lets log into pig shell. We will create a variable called airport_data which we will load our hbase table into by issuing: airport_data = LOAD 'hbase://airports'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:iata,info:airport,info:city,info:country,info:lat,info:long', '-loadKey true')
AS (iata,airport,city,country,lat,long); Now that we have our data in a variable lets dump it to hdfs using csv format by issuing: store airport_data into 'airportData/export' using PigStorage(','); So we have dumped the export into hdfs directory airportData/export. Lets go view it And there you go. We have loaded data into hbase table. Exported data from the table using pig in csv format. Happy pigging.
... View more
Labels:
07-06-2016
08:39 PM
1 Kudo
Follow the instructions here on how to download and import the vm into virtual box Once you have imported the vm select the vm and click on setting Then click on network To assign a IP in the attach to down drop list select "Bridge Adapter" Then under option Promiscuous Mode select "Allow All" Now start your vm Once the machine is up verify you have a IP address Now you have IP for your vm. have fun.
... View more
Labels:
06-28-2016
10:19 PM
5 Kudos
How to get a docker image up and running which encapulates a PyCharm IDE integrated with spark and pybuilder. The IDE reside on the docker container and will be display on your laptop/machine. This is to isolate your development enviorment with has spark integrated with spark. Why? I am a spark developer and spend significant time trying to build a integrated environment. I am spending way too much time on integration before doing what I get paid to do --- Develop! Creating a isolated environment which is integrated with spark and a CIT, easily spun up and down, and repeatable is something which would accelerate my efficiency.
Download latest virtualbox from here. To run docker containers or build images a docker machine is required. Download docker machine from here. Download xQuartz to display the IDE on your laptop. View my docker page for information on the docker image here. Clone my PyCharm github repo. You are doing this bootstrap code sample code I have built to your docker container during launch. For example I performed git clone in my /Users/smanjee/docktest
git clone https://github.com/sunileman/pycharm.git To start this tutorial start docker machine in a new terminal. For example on my laptop here is the start script :/Applications/Docker/Docker*app/Contents/Resources/Scripts/start.sh Run docker-machine env to check the IP your machine is assigned (informational only) Pull the image docker pull sunileman/pycharm Build the image docker build -t sunileman/pycharm . Open another terminal and start port forwarding socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\" Get your IP address (not docker machines) Run the image docker run -it -v /tmp/.X11-unix/:/tmp/.X11-unix/ -v ~/docktest/pycharm/PycharmProjects:/root/PycharmProjects -v ~/docktest/pycharm/.Pycharm40:/root/.PyCharm40 -e DISPLAY=XXX.XX.XX.X:0 --rm sunileman/pycharm Replace XXX.xx.xx.x with your IP replace ~/docktest/pycharm/PycharmProjects with your path to pycharm which you downloaded from my github repo Replace ~/docktest/pycharm/.Pycharm40 with your path to pycharm which you downloaded from my github repo Click on I do not have previous versions Click on OK Click on OPEN to open the project you mounted to the docker container Find the PyCharm project to open
Now the project has been imported
So you have the project imported into your IDE which is running within the docker container. To prove the IDE is connected/integrated with spark simply run the python file and you will see spark modules have been imported
... View more
Labels:
05-24-2016
03:53 PM
@Constantin Stanca what do you mean by "Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely"? JVM start as they normally do to use this tool.
... View more
05-24-2016
03:51 PM
1 Kudo
@Constantin Stanca This is standard for hbase development. If you can't see what your JVM is doing your driving blind. tuning the flushes for the memstore and blockcache are vital for performance. Testing GC for G1 vs CMS on namenode or hbase is vital for performance. For production remote access yes you always need clearance. Monitoring JVM during development is highly useful for namenode and hbase.
... View more
- « Previous
- Next »