About FrozenWave

FrozenWave · ‎08-15-2016

Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!

FrozenWave · ‎07-31-2016

Hi all, I have the following design question for my new table in HBase. Scenario: ------------- - Table will contain Customers Information - Table would be refreshed every week by a procedure, inserting new info (see below) - Row Key would be "Customer ID" (fixed) - There would be fixed contents columns, e.g. "Name", "Surname" - There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X" The question: ------------------ - Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed? - Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis Please keep in mind that: ---------------------------------- - The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake) - I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem Any insight and/or link to examples for the particular case will be very much appreciated! Thanks

FrozenWave · ‎07-28-2016

Thanks. Seems a good alternative, and in a matter of fact I was not aware of its availability in CDH 5.7 Marking the thread as solved, even if by now I don't know yet if all the features I'd need will be there in the native hbase-spark connector

FrozenWave · ‎05-29-2016

Hi all, I wanted to experiment with the "it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3" Package (you can find it at spark-packages.org ). It's an interesting addon giving RDD visibility/operativity on hBase tables via Spark. If I run this extension library in a standard spark-shell (with scala support), everything works smoothly : spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \ --conf spark.hbase.host=<HBASE_HOST> scala> import it.nerdammer.spark.hbase._ import it.nerdammer.spark.hbase._ If I try to run it in a Pyspark shell, therefore my goal is to use the extension with Python, I'm not able to import the Functions and I'm not able to use anything: PYSPARK_DRIVER_PYTHON=ipython pyspark --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \ --conf spark.hbase.host=<HBASE_HOST> In [1]: from it.nerdammer.spark.hbase import * --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-1-37dd5a5ffba0> in <module>() ----> 1 from it.nerdammer.spark.hbase import * ImportError: No module named it.nerdammer.spark.hbase I have tried different combinations of environment variables, parameters, etc when launching Pyspark, but to no avail. Maybe I'm just trying to do something deeply wrong here, maybe it's simply the fact that there is no Python API access to this Library. In a matter of fact, the examples on the Package's home page are all in Scala (but they say you can install the Package in Pyspark too, with the classic "--package" parameter). Can anybody help out with the "ImportError: No module named it.nerdammer.spark.hbase" error message? Thanks for any insight

FrozenWave · ‎05-26-2016

Update: I got to a working solution, this is a brief Howto to get to the result: JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN): Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF THE WORKFLOW MAIN SCREEN): Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL): - PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS: vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15 EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY IN THE WORKFLOW FOLDER: vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib NOW YOU CAN SUBLIT THE WORKFLOW IN YARN: - Click the "PLAY" Submit Icon on top of the screen ADDITIONAL INFO: AUTO-GENERATED "workflow.xml": <workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app> ADDITIONAL INFO: AUTO-GENERATED "job.properties": oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032

FrozenWave · ‎05-17-2016

Hi, please perform the following actions: 1) Fill in "Source Brokers List" --> "nameOfClouderaKafkaBrokerServer.yourdomain.com:9092". This is the Server (or Servers) where you configured the Kafka Broker (NOT the MirrorMaker). 2) Fill in "Destination Brokers List" --> "nameOfRemoteBrokerServer.otherdomain.com:9092". This is supposed to be a remote Cluster that will receive Topics sent over by your MirrorMaker. If you have one, put in that one. Otherwise just put in another Server in your network, whatever Server. Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you'll get other errors. Also the format with the trailing Port Number is mandatory! 3) Click "Continue". Service will NOT start (error). Do not navigate away from that screen 4) Open another Cloudera Manager in another browser pane. You should now see "Kafka" in the list of Services (red, but it should be there). Click on the Kafka Service and then "Configure". 5) Search for the "java heap space" Configuration Property. The standard Java Heap Space you'll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough. 6) Now search for the "whitelist" Configuration Property. In the field, put in "(?!x)x" (without the quotation marks). That's a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I'm assuming you don't want to replicate any topics remotely right now, just put in something that won't replicate anything e.g. that regular expression. 7) Save the changes and go back to the original Configuration Screen on the othe browser pane. Click "Retry", or wathever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager. That should work, at least it did for me! HTH

FrozenWave · ‎05-14-2016

Update: If I use "spark-submit", the script runs successfully. Syntax used for "spark-submit": spark-submit \ --master yarn-cluster \ --deploy-mode cluster \ --executor-memory 500M \ --total-executor-cores 1 \ hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \ 10 Excerpt from output log: 16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact 16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed 16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1 16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1 16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook

FrozenWave · ‎05-14-2016

Hi all, my CDH test rig is as follows: CDH 5.5.1 Spark 1.5.0 Oozie 4.1.0 I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action. It's a Python script, which is as follows (just a test): from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) ### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS: sqlCtx = SQLContext(sc) #sqlCtx = HiveContext(sc) ### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY): #sqlCtx .sql("use default") ### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1): #cronologico_DF = sqlCtx.table("sales_fact") cronologico_DF = sqlCtx.sql("select * from sales_fact") ### (4) ANOTHER DATAFRAME extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY") ### (5) USELESS PRINT STATEMENT: print 'a' When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser). The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion): py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql. : java.lang.RuntimeException: Table Not Found: sales_fact This is my "workflow.xml": <workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5"> <global> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-3ca0"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-3ca0"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.use.system.libpath</name> <value>true</value> </property> </configuration> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>org.apache.spark.examples.mllib.JavaALS</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app> This is my "job.properties": oozie.use.system.libpath=True security_enabled=False dryrun=False jobTracker=<MY_SERVER_FQDN_HERE>:8032 nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020 Please note that: 1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up. 2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed. 3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success 4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though). 5) ShareLib should be OK too: oozie admin -shareliblist [Available ShareLib] oozie hive distcp hcatalog sqoop mapreduce-streaming spark hive2 pig I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!

FrozenWave · ‎05-13-2016

I got past this! Still no cigar, though. Now I have another error, but I'm going to work on this. It's something different now... Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist java.io.FileNotFoundException: File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist Many thanks for your help. I'd never be able to figure this out by myself!

FrozenWave · ‎05-13-2016

Hi Ben, thanks a whole lot for your reply. May I ask you where exactly you specified that setting? - In the GUI, in some particular field? - In "workflow.xml", in the Job's directory in HDFS? If yes: as an "arg", as a "property", or..? - In "job.properties", in the Job's directory in HDFS? If yes: how? - In some other file? E.g. "/etc/alternatives/spark-conf/spark-defaults.conf"? If yes, how? A snippet of your code would be extremely appreciated! I'm asking you because I've tried all of the above with your suggestion but I did not succeed. Thanks again for your help

Online	Offline
Last Visited	‎11-27-2024 10:58 AM

Member Since	‎01-05-2016 04:28 AM
Last Visited	‎11-27-2024 10:58 AM
Posts	60
Kudos received	42

Cloudera Community

Re: CDH Express 6.3.1 - Can't start HDFS after clu...

Re: I would to write a script which will check the...

Re: Running shell scripts in oozie using hue

Re: Oozie Workflows problems after upgrading from ...

Re: Oozie workflow, Spark action (using simple Dat...

Re: HBase Table Design - Multiple "Time Series" Co...

HBase Table Design - Multiple "Time Series" Column...

Re: Using spark-hbase-connector Package with Pyspa...

Using spark-hbase-connector Package with Pyspark

Re: Oozie workflow, Spark action (using simple Dat...

Re: adding a Kafka service failed

Re: Oozie workflow, Spark action (using simple Dat...

Oozie workflow, Spark action (using simple Datafra...

Re: "Failing Oozie Launcher - key not found: SPARK...

Re: "Failing Oozie Launcher - key not found: SPARK...