About sunile_manjee

sunile_manjee · ‎10-14-2016

Phoenix secondary indexes are often misunderstood. Those coming from the relational world mistakenly apply the same prinples to apache phoenix. A simple data model will be used for this article. Airplane table with attributes carrier ID, Tail Number, Origin airport code, destination airport code, and flight date. The physical data model has been identified with primary access path of carreirID and TailNum, essentially the rowkey. note - it is important to understand the order the "primary access path" in which fields will be accessed. Here I have identified the first key in the access path is carrierID. If that is not the case, benefits from the underline database capabilities, HBase may be realized. Think of primary access path not as primary key but more as the core identified access pattern for reads & writes. Secondary indexes enrich & extend this functionality. What are Apache Phoenix secondary indexes? "Secondary indexes are an orthogonal way to access data from its primary access path." Orthogonal is key here. Think of this as an intersection. Personally I would argue this is different then RDBMS as RDBMS adheres to relational theory. HBase/Phoenix does not. So start training your mind to think of intersections when it comes to secondary indexes Use case example: For the airline table, origin airport code is starting to emerge as alternate intersection pattern. Meaning frequently the core access path + origin airport code are used for various processing and/or access. Options are either create a new phoenix table using this core access path pattern or create/apply secondary. Lets go with secondary index So what are my options? Global Index Single Value Local Index Single Value Covered Index Global Local Global Index Lets start with global. Global indexes are used for read heavy use cases. why? Global indexes are not co-located (region server) with the primary table. Therefore with global indexes are dispersing the read load by have the main and secondary index table on different region servers and serving different set of access patterns. Think of it as load balancing. Simply create a secondary index on origin airport code CREATE INDEX indexname ON airline(origin_airport_code); This new secondary index is orthogonal. Meaning an intersection of the primary row key and secondary key (s). Now the data model will support this query SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD' Perfect point lookup SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD' AND DEST_AIRPORT_CODE = 'DFW' Perfect point lookup with server filter on DFW Notice the secondary index is an INTERSECTION of the primary key. What if I ran this: SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' This would run a full table scan. Why? This is not an intersection of primary row key with secondary row key. To solve this challenge you have options such as covered index or using a hint. Hints SELECT /*+ /*+ INDEX(AIRLINE indexname) */ * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' This will cause each data row to be retrieved when the index is traversed to find the missing column values. Use this with care as you may find performance is better with covered index. You can always force hints to the optimizer to use the index of your choice. Covered index Covered index is a way to bundle data based on alternative access path. If the index can "cover" all fields in your select statement then only the index will be hit during the query. To continue from previous example, I would create covered index as follows CREATE INDEX indexname ON airline(origin_airport_code) INCLUDE (ALL THE FIELDS YOU WILL COVER IN YOUR SELECT STATEMENT) Issuing SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' will only hit the index table. Local index Local indexes are used for write heavy use cases. why? Local indexes are co-located (Region server) with the primary table. "Unlike global indexes, local indexes will use an index even when all columns referenced in the query are not contained in the index. This is done by default for local indexes because we know that the table and index data co-reside on the same region server thus ensuring the lookup is local." CREATE LOCAL INDEX indexname ON airline(origin_airport_code) Running SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' should take advantage of the secondary index That is a ton of info. It is important to understand secondary indexes on NoSQL databases do not adhere to relational theory. There is no substitute for understanding the principles. Now go create some smart secondary indexes 🙂

sunile_manjee · ‎10-14-2016

@ScipioTheYounger most definitely. Simply use exact same process and connect to yarn jvm. you will get all attributes (metrics) available.

sunile_manjee · ‎10-14-2016

On HDP 2.5 smartsense 1.3 zeppelin instance for hdfs dashboards I am unable to see any reports. I see errors for each report: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) any insights?

sunile_manjee · ‎10-14-2016

if you want some addtional info on jmx here you go https://community.hortonworks.com/content/kbentry/61188/enable-jmx-metrics-on-hadoop-using-jmxterm.html

sunile_manjee · ‎10-14-2016

I found the issue. I was using PQS on node which is did not install PQS. I created secondary index pointing to PQS installed by ambari and all is good.

sunile_manjee · ‎10-14-2016

I have followed directions here on enabling phoenix secondary indexes and restarted all of hbase (master/Region Servers) and phoenix. It continues to fail during secondary index creation Error: Error -1 (00000) : Error while executing SQL "CREATE INDEX orig_airlineix ON airline (origin) ASYNC": Remote driver error: RuntimeException: java.sql.SQLException: ERROR 1029 (42Y88): Mutable secondary indexes must have the hbase.regionserver.wal.codec property set to org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec in the hbase-sites.xml of every region server. tableName=ORIG_AIRLINEIX -> SQLException: ERROR 1029 (42Y88): Mutable secondary indexes must have the hbase.regionserver.wal.codec property set to org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec in the hbase-sites.xml of every region server. tableName=ORIG_AIRLINEIX (state=00000,code=-1) I have verified in hbase-site.xml <property> <name>hbase.regionserver.wal.codec</name> <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value> </property> Any insights would be appreciated.

sunile_manjee · ‎10-14-2016

Can you please tell us a little about what you are doing here? It seems from the log you are using webhdfs using user admin? If that is correct please confirm user admin exist in the kernel. Or use another user which does exist.

sunile_manjee · ‎10-13-2016

@Josh Elser wow. my bad

sunile_manjee · ‎10-13-2016

I have started hbase rest api using hbase rest start -p 20050 --infoport 20051 .... 2016-10-13 19:53:43,979 INFO [main] http.HttpServer: Jetty bound to port 20051 2016-10-13 19:53:43,979 INFO [main] mortbay.log: jetty-6.1.26.hwx 2016-10-13 19:53:44,260 INFO [main] mortbay.log: Started SelectChannelConnector@0.0.0.0:20051 2016-10-13 19:53:44,260 INFO [main] mortbay.log: jetty-6.1.26.hwx 2016-10-13 19:53:44,261 INFO [main] mortbay.log: Started SelectChannelConnector@0.0.0.0:20050 2016-10-13 19:55:38,346 INFO [135221990@qtp-1263634860-3] logs: Aliases are enabled I am able to access the infoport webui on 20051. When I issue a curl HW12334:~ smanjee$ curl -v -X GET -H "Accept: text/xml" "http://xxxxx.xxxx.hortonworks.com:20550/" * Trying xxx.xx.xxx.220... * connect to xxx.xx.xxx.220 port 20550 failed: Connection refused * Failed to connect to xxxxx.xxxx.hortonworks.com port 20550: Connection refused * Closing connection 0 curl: (7) Failed to connect to xxxx.xxxx.hortonworks.com port 20550: Connection refused Any ideas? This is on non kerberized cluster. HDP 2.5

sunile_manjee · ‎10-12-2016

To start pulling jmx metrics from hadoop you need to first enable it via jvm parameters. Go to ambari-->Yarn-->config-->yarn-env Enable jmx by adding the following params to YARN_RESOURCEMANAGER_OPTS "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8001" Set the port to whatever port is available on your cluster Save the new config. This will require restart of yarn. Now you have jmx enabled but you need to client to start pulling jmx metrics. Go to your data node and and download (from here http://wiki.cyclopsgroup.org/jmxterm/download.html) the latest jxmterm-xxx-xxx-uber.jar. For this article I use jmxterm-1.0-alpha-4-uber.jar Once you have the jmxterm client downloaded lets connect to jxm using java -jar jmxterm-1.0-alpha-4-uber.jar -l service:jmx:rmi:///jndi/rmi://localhost:<YOURPORT>/jmxrmi In this example I am set port to 8012. java -jar jmxterm-1.0-alpha-4-uber.jar -l service:jmx:rmi:///jndi/rmi://localhost:8012/jmxrmi Now I am connected to jxm. Lets look at all the beans available by issuing beans commands Well now I see all the beans available to pull metrics from. Lets say I want to pull metrics about resourcemanager cluster metrics. That is bean Hadoop:name=ClusterMetrics,service=ResourceManager. Lets find all the attributes available for that beans by issuing info -b Hadoop:name=ClusterMetrics,service=ResourceManager All the attributes are shown and notice there is a notification attribute. You can use this for notification into your enterprise operational system. To pull metrics for a specific attribute within a bean use get -b bean_name attribute For this example I want to know number of active Node Managers which is attribute NumActiveNMs. get -b Hadoop:name=ClusterMetrics,service=ResourceManager NumActiveNMs So there are 4 active node managers. I want to know how many node managers are down. That is attribute NumLostNMs get -b Hadoop:name=ClusterMetrics,service=ResourceManager NumLostNMs Returns 0 meaning all my node managers are available. Hope this helps you get started on interacting with JMX.

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Art of Phoenix Secondary Indexes

Re: Enable JMX metrics on hadoop using jmxterm

smartsense hdfs dashboard connection issue

Re: How to monitor Ambari itself ?

Re: Phoenix secondary indexes failing on hbase.reg...

Phoenix secondary indexes failing on hbase.regions...

Re: Service 'userhome' check failed: File does no...

Re: HBase Rest API Connection Refused

HBase Rest API Connection Refused

Enable JMX metrics on hadoop using jmxterm