Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
10-14-2016
11:47 PM
8 Kudos
Phoenix secondary indexes are often misunderstood. Those coming from the relational world mistakenly apply the same prinples to apache phoenix. A simple data model will be used for this article. Airplane table with attributes carrier ID, Tail Number, Origin airport code, destination airport code, and flight date. The physical data model has been identified with primary access path of carreirID and TailNum, essentially the rowkey. note - it is important to understand the order the "primary access path" in which fields will be accessed. Here I have identified the first key in the access path is carrierID. If that is not the case, benefits from the underline database capabilities, HBase may be realized. Think of primary access path not as primary key but more as the core identified access pattern for reads & writes. Secondary indexes enrich & extend this functionality. What are Apache Phoenix secondary indexes? "Secondary indexes are an orthogonal way to access data from its primary access path." Orthogonal is key here. Think of this as an intersection. Personally I would argue this is different then RDBMS as RDBMS adheres to relational theory. HBase/Phoenix does not. So start training your mind to think of intersections when it comes to secondary indexes Use case example: For the airline table, origin airport code is starting to emerge as alternate intersection pattern. Meaning frequently the core access path + origin airport code are used for various processing and/or access. Options are either create a new phoenix table using this core access path pattern or create/apply secondary. Lets go with secondary index So what are my options?
Global Index
Single Value Local Index
Single Value Covered Index
Global Local Global Index Lets start with global. Global indexes are used for read heavy use cases. why? Global indexes are not co-located (region server) with the primary table. Therefore with global indexes are dispersing the read load by have the main and secondary index table on different region servers and serving different set of access patterns. Think of it as load balancing. Simply create a secondary index on origin airport code CREATE INDEX indexname ON airline(origin_airport_code); This new secondary index is orthogonal. Meaning an intersection of the primary row key and secondary key (s). Now the data model will support this query SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD'
Perfect point lookup SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD' AND DEST_AIRPORT_CODE = 'DFW'
Perfect point lookup with server filter on DFW Notice the secondary index is an INTERSECTION of the primary key. What if I ran this: SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD'
This would run a full table scan. Why? This is not an intersection of primary row key with secondary row key. To solve this challenge you have options such as covered index or using a hint. Hints SELECT /*+ /*+ INDEX(AIRLINE indexname) */ * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' This will cause each data row to be retrieved when the index is traversed to find the missing column values. Use this with care as you may find performance is better with covered index. You can always force hints to the optimizer to use the index of your choice. Covered index Covered index is a way to bundle data based on alternative access path. If the index can "cover" all fields in your select statement then only the index will be hit during the query. To continue from previous example, I would create covered index as follows CREATE INDEX indexname ON airline(origin_airport_code) INCLUDE (ALL THE FIELDS YOU WILL COVER IN YOUR SELECT STATEMENT) Issuing SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' will only hit the index table. Local index Local indexes are used for write heavy use cases. why? Local indexes are co-located (Region server) with the primary table. "Unlike global indexes, local indexes will use an index even when all columns referenced in the query are not contained in the index. This is done by default for local indexes because we know that the table and index data co-reside on the same region server thus ensuring the lookup is local." CREATE LOCAL INDEX indexname ON airline(origin_airport_code) Running SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' should take advantage of the secondary index That is a ton of info. It is important to understand secondary indexes on NoSQL databases do not adhere to relational theory. There is no substitute for understanding the principles. Now go create some smart secondary indexes 🙂
... View more
Labels:
10-14-2016
02:51 PM
@ScipioTheYounger most definitely. Simply use exact same process and connect to yarn jvm. you will get all attributes (metrics) available.
... View more
10-12-2016
04:43 PM
2 Kudos
To start pulling jmx metrics from hadoop you need to first enable it via jvm parameters. Go to ambari-->Yarn-->config-->yarn-env Enable jmx by adding the following params to YARN_RESOURCEMANAGER_OPTS "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8001" Set the port to whatever port is available on your cluster Save the new config. This will require restart of yarn. Now you have jmx enabled but you need to client to start pulling jmx metrics. Go to your data node and and download (from here http://wiki.cyclopsgroup.org/jmxterm/download.html) the latest jxmterm-xxx-xxx-uber.jar. For this article I use jmxterm-1.0-alpha-4-uber.jar Once you have the jmxterm client downloaded lets connect to jxm using java -jar jmxterm-1.0-alpha-4-uber.jar -l service:jmx:rmi:///jndi/rmi://localhost:<YOURPORT>/jmxrmi In this example I am set port to 8012. java -jar jmxterm-1.0-alpha-4-uber.jar -l service:jmx:rmi:///jndi/rmi://localhost:8012/jmxrmi Now I am connected to jxm. Lets look at all the beans available by issuing beans commands Well now I see all the beans available to pull metrics from. Lets say I want to pull metrics about resourcemanager cluster metrics. That is bean Hadoop:name=ClusterMetrics,service=ResourceManager. Lets find all the attributes available for that beans by issuing info -b Hadoop:name=ClusterMetrics,service=ResourceManager All the attributes are shown and notice there is a notification attribute. You can use this for notification into your enterprise operational system. To pull metrics for a specific attribute within a bean use get -b bean_name attribute For this example I want to know number of active Node Managers which is attribute NumActiveNMs. get -b Hadoop:name=ClusterMetrics,service=ResourceManager NumActiveNMs So there are 4 active node managers. I want to know how many node managers are down. That is attribute NumLostNMs get -b Hadoop:name=ClusterMetrics,service=ResourceManager NumLostNMs Returns 0 meaning all my node managers are available. Hope this helps you get started on interacting with JMX.
... View more
09-29-2016
09:54 PM
@Josh Elser extremely helpful article. nice work
... View more
09-28-2016
03:09 AM
@Ali Bajwa tons of great stuff in this article
... View more
09-13-2016
08:52 AM
4 Kudos
I often hear stories of wanting faster performance from Hadoop & spark without knowing basic statistics within ones environment. One of the first questions I ask is whether the hardware can perform at the level which is being expected. The software is still bound to the physics of the hardware. If your IO disk speed is 10MB per sec, Hadoop/Spark nor any other software will magically make that disk speed faster. Again we are bound to the physical limits of the hardware we choose. What makes Hadoop and other distributed processing engines amazing is its ability to add more "cheap" nodes to the cluster to increase performance. However we should be aware the maximum throughput per node. This will help level set expectations before committing to any SLA bound to performance. Typically I love to use the sysbench tool. SysBench is a modular & multi-threaded benchmark tool
for evaluating OS parameters ie. CPU, ram, IO, and mutex. I use sysbench prior to installing any software outside the kernel and pre/post Hadoop/Spark upgrades. Pre/post upgrades should not have any impact to your OS benchmarks but I play it safe. My neck is on the line when I commit to a SLA so I rather play it safe. The below tests I generally wrap in a shell script for ease of execution. For this article I call out each test for clarification. RAM test I start with testing RAM performance. This test can be used to benchmark sequential memory reads or writes. I test both. To test read performance I set memory block size to HDFS block size, number-threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=read --num-threads=4 --memory-total-size=10G run To test write performance I set memory block size to HDFS block size, number of threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load. sysbench --test=memory --memory-block-size=128M --memory-oper=write --num-threads=4 --memory-total-size=10G run CPU test Next I grab the CPU performance numbers. This test consists in calculation of prime numbers up to a value specified
by the --cpu-max-primes option. I set the number of threads = approx concurrency you expect on your cluster. sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run IO test Lastly I fetch the IO performance numbers. When using fileio, you will need to create a set of test files to work on. It is recommended that the size is larger than the available memory to ensure that file caching does not influence the workload too much - https://wiki.gentoo.org/wiki/Sysbench#Using_the_fileio_workload Run this command to prepare a file which is larger then the available memory (Ram) on the box. In this example my box has 128GB of ram. I set the file size to 150G. I named the file here fileio. sysbench --test=fileio --file-total-size=150G prepare Next I run the io test using the file I just created (fileio). file-test-mode is the type of workload to produce. Possible values: seqwr
sequential write
seqrewr
sequential rewrite
seqrd
sequential read
rndrd
random read
rndwr
random write
rndrw
combined random read/write
init-rng - specifies if random numbers generator
should be initialized from timer before the
test start - http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf max-time - is the limit for the total execution time in seconds. 0 means unlimited. be careful. set a limit. max-request - is the limit for the total request. 0 means unlimited sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
... View more
Labels:
08-25-2016
09:12 PM
4 Kudos
Often a timestamp is found within a phoenix table to support various business taxonomy requirements. Generally I have seen a pattern simply adding a timestamp datatype field. Many may not realize part of the hbase data model rowkey design is the timestamp. When a cell is written in hbase by default it is versioned using timestamp when the cell was created. This is out of the box. Apache Phoenix provides a way of mapping HBase’s native row timestamp to a Phoenix column. This leverages various optimizations which HBase provides for time ranges on the store files as well as various query optimization capabilities built within Phoenix. - https://phoenix.apache.org/rowtimestamp.html Based on your use case, set this "column" to a value of your liking. Take advantage of the built in data model design without adding yet another timestamp field. Commonly I find many end up creating secondary index on these additional timestamp fields due to time always being a variable at query time. Lets take a look at a simple data model. We have customer entity with various attributes. Now typically you would see the following create table statement on phoenix: CREATE TABLE IF NOT EXISTS customer (
firstName VARCHAR,
lastName VARCHAR,
address VARCHAR,
ssn integer NOT NULL,
effective_date TIMESTAMP,
ACTIVE_IND CHAR(1)
CONSTRAINT pk PRIMARY KEY (ssn) ) KEEP_DELETED_CELLS=false;
Often once table is created/populated SQL queries start pouring in. Soon a pattern is established where effective_date is most commonly used during query time. Thereafter DBA would create a secondary index. CREATE INDEX my_idx ON CUSTOMER(effective_date DESC); Now determine where this needs to be global or local index. I won't go into details about that now. However my point is there may be a easier way. Leverage the rowkey timestamp! Instead of creating additional column this time I will assign effective_date to the rowkey timestamp. The rowkey timestamp is baked into the hbase data model. This is how it is done: CREATE TABLE IF NOT EXISTS customer (
firstName VARCHAR,
lastName VARCHAR,
address VARCHAR,
ssn integer NOT NULL,
effective_date TIMESTAMP NOT NULL,
ACTIVE_IND CHAR(1)
CONSTRAINT pk PRIMARY KEY (ssn, effective_date ROW_TIMESTAMP) ) KEEP_DELETED_CELLS=false;
Now you can start querying customer table using the effective_date within your queries avoiding secondary index. There may be use cases where secondary index may make more sense then leveraging the core rowkey timestamp. Your use cases will drive that decision. The flexibility is there and you have choices.
... View more
08-25-2016
02:08 PM
6 Kudos
In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out. Started the testing with dfsio which is a distributed IO benchmark tool. Here are results: From 3 to 5 data nodes IO read performance increased approx. 36% From 3 to 5 data nodes IO write performance increased approx. 49% With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures! Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data. From 3 to 5 data nodes TeraGen performance increased approx. 65% Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order. From 3 to 5 data nodes TeraSort performance increased approx. 54%. Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted. From 3 to 5 data nodes TeraValidate performance increased approx. 64%. DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations. All my test results are here.
... View more
Labels:
08-16-2016
09:56 PM
4 Kudos
I am a junkie for faster & cheaper data processing. Exactly why I love IaaS. My personal REAL WORLD experience with the typically IaaS providers has been generally slow on performance. Not to say hadoop/hbase/spark/etc jobs will not perform; however, you need to be familiar with what you're getting into and set realistic expectations. Recently I meet the IaaS vendor Their liquid metal offering which provides all the greatness which comes with bare metal on-prem installations but in the cloud. Options for bonded NICs & DAS had me at hello. I decided to run the same performance test I ran on AWS (article here) on bigstep. All the details of the scripts I ran are in that article. Just a quick note - these performance articles do not advocate for or against any specific IaaS provider. Nor does it reflect the HDP software. I simply want to run the repeatable processing test with near/similar IaaS hardware profiles and gather performance statistics. Interrupt the numbers as you wish. 1xMaster Node Hardware Profile CPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 128 GB DDR3 ECCLocal storage disks: 1 NVMEDisk size: 745 GBNetwork bandwidth: 40 gbps
3xData Nodes Hardware ProfileCPU: 2 xIntel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz(8 x 2.40 GHz)
RAM: 256 GB DDR3 ECCLocal storage disks: 12 HDDDisk size: 1863 GBNetwork bandwidth: 40 gbps Teragen results: 11 Mins 49 Secs I want to remain as objective as possible but WOW. That is simply one of the fastest teragen results I have ever seen. TeraSort results 51 Mins 12 secs Fastest I have seen on the cloud so far. On-prem with 1 additional node I was able to get it down to 40 mins. So 51 mins on 1 less nodes is pretty good. TeraValidate Results 4 mins 42 seconds This again was the faster performance I have seen on 1TB using teravalidate. I hope this helps with some basical insights into similar test I have performed so far on various IaaS providers. In the coming weeks/months I plan on publishing performance test result using azure and GCP.
It is extremely important to understand zero performance tweaking as been done. Nor does this reflect how HDP runs on IaaS providers. This does not reflect anything about the IaaS provider as well. I simply want to run with minimum tweaking teragen/terasort/teravalidate test, with same parameters, and similar hardware profiles and document results. That's it. Keep it simple.
... View more
Labels: