About MilesYao

MilesYao · ‎10-12-2020

I imported our existing v5.12 workflows via command-line loaddata. They show up in Hue 3 Oozie Editor, but not Hue 4. We are using CDH 5.16. I find the new "everything is document" paradigm confusing and misleading - Oozie workflows, Hive queries, Spark jobs etc. are not physical documents - in the Unix/HDFS sense that normal users would expect, with absolute paths that can be accessed and manipulated directly. The traditional-style Hue 3 UI lets one focus on working with the technology at hand, instead of imposing The Grand Unifying Design on the user.

MilesYao · ‎07-01-2020

The Phoenix-Hive storage handler as of v4.14.0 (CDH 5.12) seems buggy. I was able to get the Hive external wrapper table working for simple queries, after tweaking column mapping around upper/lower case gotchas. However, it fails to work when I tried the "INSERT OVERWRITE DIRECTORY ... SELECT ..." command to export to file: org.apache.phoenix.schema.ColumnNotFoundException: ERROR 504 (42703): Undefined column. columnName=<table name> This is a known problem that no one is apparently looking at: https://issues.apache.org/jira/browse/PHOENIX-4804

MilesYao · ‎02-15-2019

Try put <extraClassPath> settings into the global Spark config in Ambari instead, in the spark-defaults section (you may have to add them as customized). This works for us with Cloudera and Spark 1.6.

MilesYao · ‎07-24-2018

We were unable to access Spark app log files from either YARN or Spark History Server UIs, with error "Error getting logs at <worker_node>:8041". We can see the logs with "yarn logs" command. Turns out our yarn.nodemanager.remote-app-log-dir = /tmp/logs, but the directory was owned by "yarn:yarn". Following your instruction fixed the issue. Thanks a lot! Miles

MilesYao · ‎04-10-2018

Hi Eric: Thanks for your explanation. Would you be able to point us to the formal licensing statements stating the same? It would be required by our corporate to approve CDK (and CDS2 for that matter) for production use. Miles

MilesYao · ‎03-30-2018

We are interested in using CDK instead of the base Apache version for its additional operation features. However, I cannot find any info about its pricing and licensing terms - required by my corporate legal. Nothing on CDK documentation. The main Cloudera product and download pages seem to have been redesigned, and do not even provide links to the individual component distros like Spark 2 and Kafka. Pricing is also changed to base on high-level "solutions" instead of individual software products. Since CDK is essentially CM+ZooKeeper+Kafka in a parcel, would it be licensed on the same basis as base CDH? I believe (the simpler) Cloudera Spark 2 is indeed free, but cannot find official info on that, either. Can the Cloudera corporate folks help answer? Thanks, Miles Yao

MilesYao · ‎03-20-2018

Note that phoenix-spark2.jar MUST precede phoenix-client.jar in extraClassPath, otherwise connection will fail with: java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame

MilesYao · ‎09-14-2017

When you run OfflineMetaRepair, most likely you will run it from your userid or root. Then we may get some opaque errors like "java.lang.AbstractMethodError: org.apache.hadoop.hbase.ipc.RpcScheduler.getWriteQueueLength()". If you check in HDFS, you may see that the meta directory is no longer owned by hbase: $ hdfs dfs -ls /hbase/data/hbase/ Found 2 items drwxr-xr-x - root hbase 0 2017-09-12 13:58 /hbase/data/hbase/meta drwxr-xr-x - hbase hbase 0 2016-06-15 15:02 /hbase/data/hbase/namespace Manually chown -R it and restart HBase fixed it for me.

MilesYao · ‎08-22-2017

Thanks for the write-up. Does the above imply that the newly split region will always stay on the same RS, or is it configurable? If it's always local, then won't the load on the "hot" region server just get heavier and heavier, until the global load balancer thread kicks in? Shouldn't HBase just create the new daughter regions on the least-loaded RS instead? There was a lot of discussion related to this in HBASE-3373, but it isn't clear what the resulting implementation was.

MilesYao · ‎08-13-2017

This feature behaves unexpectedly when the table is migrated from another HBase cluster. In this case, the table creation time can be much later than the row timestamps of all its data. A flashback query meant to select an earlier subset of data will return the following failure instead: scala> df.count 2017-08-11 20:12:40,550 INFO [main] mapreduce.PhoenixInputFormat: UseSelectColumns=true, selectColumnList.size()=3, selectColumnList=TIMESTR,DBID,OPTION 2017-08-11 20:12:40,550 INFO [main] mapreduce.PhoenixInputFormat: Select Statement: SELECT "TIMESTR","DBID","OPTION" FROM NS.USAGES 2017-08-11 20:12:40,558 ERROR [main] mapreduce.PhoenixInputFormat: Failed to get the query plan with error [ERROR 1012 (42M03): Table undefined. tableName=NS.USAGES] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#13L]) +- TungstenExchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#16L]) +- Project +- Scan ExistingRDD[TIMESTR#10,DBID#11,OPTION#12] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80) ... Which apparently means that Phoenix considers the table nonexistent at this point. I tested the same approach in sqlline and sure enough, the table is missing from "!tables" Any workaround?

Online	Offline
Last Visited	‎03-25-2021 12:17 PM

Member Since	‎03-04-2015 03:05 PM
Last Visited	‎03-25-2021 12:17 PM
Posts	96
Kudos received	10

Cloudera Community

Re: Spark 2

Re: Cannot pass value from Hive query output direc...

Re: Cannot find Oozie Workflows Editor in Hue 4

Re: Exporting a phoenix table into a csvfile

Re: Oozie not taking spark extraclasspath jars whe...

Re: CDH5.0 VM: Error getting logs for job_14046579...

Re: Is Cloudera Kafka (CDK) free?

Is Cloudera Kafka (CDK) free?

Re: How to connect Spark 2.2.0 with Phoenix 4.7 in...

Re: HBase - Region in Transition

Re: How Region Split works in HBase.

Re: Running phoenix flashback queries / setting cu...