About twang

twang · ‎09-26-2016

In some application use cases, developers want to save Spark DataFrame table directly into Phoenix instead of saving into HBase as a intermediate step. In those case, we can use Apache Phoenix-Spark plugin package. The related api is very simple: df.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> "OUTPUT_TABLE", "zkUrl" -> "****:2181:/****")) However, we need to pay attention that in Apache Phoenix, all the column names by default are considered as uppercase unless you surround it with quotation marks "". Therefore, if you have specified lowercase column name in your Phoenix Schema, you have to do some column names transformation in Spark. The example code is as follows: val oldNames = df.columns val newNames = oldNames.map(name => col(name).as("\"" + name + "\"")) val df2 = df.select(newNames:_*)

twang · ‎09-06-2016

It is a very common operation to do prefix scan in HBase. For example, when reading HBase table from HBase, we may use the following table scan api: val prefixFilter = new PrefixFilter(prefix) val scan: Scan = new Scan() scan.setFilter(prefixFilter) However, the code above may appear to be very slow when scanning a large HBase table. The reason is: we need to set StartRow before using PrefixFilter. Without setting the start row properly, your HBase scan may have to begin with the very first region and waste lots of time to get to the first right place. The recommended way is to use setRowPrefixFilter(byte[] rowPrefix), from its source code below, we can see that it helps us to set up the start row before doing table scan. 409 public Scan setRowPrefixFilter(byte[] rowPrefix) { 410 if (rowPrefix == null) { 411 setStartRow(HConstants.EMPTY_START_ROW); 412 setStopRow(HConstants.EMPTY_END_ROW); 413 } else { 414 this.setStartRow(rowPrefix); 415 this.setStopRow(calculateTheClosestNextRowKeyForPrefix(rowPrefix)); 416 } 417 return this; 418 } In addition, if you want to load HBase table into Spark, you can also use the Spark-HBase connector, which support Spark accessing HBase table as external data source. The method buildScan() can do hbase table scan and return RDD as result. Its related source code is here. Thanks to Weiqing Yang and Ted Yu for the kind help.

twang · ‎05-06-2016

When Storm developers want to share data across multiple bolts or cache a large amount of state information in a single bolt, one of the common choices is to use HBase. In a unsecured HDP cluster, the related code in the HBase Bolt is very intuitive: public class myHBaseBolt implements IRichBolt { ... private OutputCollector collector; private Connection connection; private Table myTable; ... public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.collector = collector; try { this.connection = ConnectionFactory.createConnection(HBaseConfiguration.create()); this.myTable = connection.getTable(TableName.valueOf(MY_TABLE_NAME)); ... } catch (Exception e) { ... } } } However, in a Kerberized HDP cluster, we need to configure lots of information in our code to enable secured connection. First, we need to configure storm keytab and principal for the hbase client in storm bolt. For example, in the myTopology.java code: Map<String, Object> mapHbase = new HashMap<String,Object>(); mapHbase.put("storm.keytab.file","/your/storm/keytab/path"); mapHbase.put("storm.kerberos.principal","yourStormPrincipalName")); Config conf = new Config(); conf.put("hbase.config", mapHbase); StormSubmitter.submitTopology("myTopology", conf, builder.createTopology()); Second, in the hbase bolt, we need to use the keytabs information to set up secured connection. public class myHBaseBolt implements IRichBolt { ... private OutputCollector collector; private Connection connection; private Table myTable; public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.collector = collector; try { final Configuration hbConfig = HBaseConfiguration.create(); Map<String, Object> conf = (Map<String, Object>) stormConf.get(this.configKey); for(String key : conf.keySet()) { hbConfig.set(key, String.valueOf(conf.get(key))); } this.provider = HBaseSecurityUtil.login(conf, hbConfig); this.connection = provider.getCurrent().getUGI().doAs(new PrivilegedExceptionAction<Connection>(){ @Override public Connection run() throws Exception { return ConnectionFactory.createConnection(hbConfig); } }); this.myTable = provider.getCurrent().getUGI().doAs(new PrivilegedExceptionAction<Table>(){ @Override public Table run() throws Exception { return connection.getTable(TableName.valueOf(MY_TABLE_NAME)); } }); ... } catch (Exception e) { ... } } } Basically, what we are doing here is to log in hbase through storm with HBaseSecurityUtil.login(), then create connections as service users.

twang · ‎01-05-2016

When I played with Kafka-Storm streaming analysis, I often met the problem java.lang.RuntimeException: Could not find leader nimbus from seed hosts And I even cannot kill the related topology through StormUI or command line. I think it is because I got stale topology data in storm and zookeeper. My solution is to do directory clean up manually. This idea is borrowed from the bash files in our amazing "trucking demo". Step 1: Clean up files in storm folder #paths through Ambari install may start with /mnt if [ -d "/mnt/hadoop/storm" ]; then rm -rf /mnt/hadoop/storm/supervisor/isupervisor/* rm -rf /mnt/hadoop/storm/supervisor/localstate/* rm -rf /mnt/hadoop/storm/supervisor/stormdist/* rm -rf /mnt/hadoop/storm/supervisor/tmp/* rm -rf /mnt/hadoop/storm/workers/* rm -rf /mnt/hadoop/storm/workers-users/* fi #paths in sandbox may start with /hadoop if [ -d "/hadoop/storm" ]; then rm -rf /hadoop/storm/supervisor/isupervisor/* rm -rf /hadoop/storm/supervisor/localstate/* rm -rf /hadoop/storm/supervisor/stormdist/* rm -rf /hadoop/storm/supervisor/tmp/* rm -rf /hadoop/storm/workers/* rm -rf /hadoop/storm/workers-users/* rm -rf /hadoop/storm/nimbus/stormdist/* fi Step 2: Clear your zookeeper state. 1) stop storm 2) do the following command lines: /usr/hdp/current/zookeeper-client/bin/zkCli.sh rmr /storm quit Step 3: restart storm This solution works for me on most of the nimbus failure problem. Thanks to Ali Bajwa for the kind help!

twang · ‎01-05-2016

In a Kerberized HDP cluster, I use FreeIPA as LDAP and run HDP 2.3.2 including Storm, Kafka and Ranger. After creating a storm topology as a normal user, I keep getting the following runtime error: 2015-12-30 18:31:53.274 o.a.c.ConnectionState [ERROR] Authentication failed ... 2015-12-30 18:31:53.286 o.a.z.ClientCnxn [WARN] SASL configuration failed: javax.security.auth.login.LoginException: No password provided Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. ... 2015-12-30 18:31:53.328 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode ... This problem comes from Kafka and Ranger. As a normal failure recovery mechanism, Kafka will keep creating the topic if it cannot find the designated topic mentioned in the Kafka producer. In a kerberized environment, this CREATE request will be sent to Ranger for approval. However, in HDP 2.3.2, Ranger 0.5.0.2.3 cannot understand/recognize the CREATE action from Kafka. Therefore, this request is blocked. This problem in Ranger is fixed in the latest HDP 2.3.4. In order to solve this problem temporarily, you only need to restart Kafka. It also helps to have a more powerful cluster and larger memory. This solution is still valid in HDP 2.5. Thanks to Sumit Mohanty, Madhan Neethiraj and Sriharsha Chintalapani for the kind help!

twang · ‎10-28-2015

Recently I met a HBase connection problem from Storm. The error message is: Error message: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory It is because when we turn on Phoenix in HBase, HBase will add some additional properties (e.g. org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory) related to Phoenix in the hbase-site.xml. Users suppose to add these new dependencies in their pom.xml file, otherwise the hbase connection will fail, even if user does not use Phoenix at all. In order to avoid this "class not found error", the easiest way is to make sure you turn Phoenix off when connecting HBase from Storm. This solution is still valid in HDP 2.5.

Online	Offline
Last Visited	‎11-09-2016 10:37 PM

Member Since	‎10-09-2015 06:32 PM
Last Visited	‎11-09-2016 10:37 PM
Posts	7
Kudos received	20

Cloudera Community

Save Spark DataFrame table into Phoenix

Recommended Way to do HBase Prefix Scan through HB...

How to Write Storm HBase Bolt in a Kerberized HDP ...

Solutions for Storm Nimbus Failure

Run Kafka+Storm in a Kerberized HDP cluster

Phoenix Issues When Connecting HBase to Storm