Member since
01-05-2016
60
Posts
42
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
264 | 10-18-2024 02:15 PM | |
1223 | 10-21-2019 05:16 AM | |
4485 | 01-29-2018 07:05 AM | |
3313 | 06-27-2017 06:42 AM | |
39572 | 05-26-2016 04:05 AM |
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more
07-31-2016
08:06 AM
Hi all, I have the following design question for my new table in HBase. Scenario: ------------- - Table will contain Customers Information - Table would be refreshed every week by a procedure, inserting new info (see below) - Row Key would be "Customer ID" (fixed) - There would be fixed contents columns, e.g. "Name", "Surname" - There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X" The question: ------------------ - Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed? - Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis Please keep in mind that: ---------------------------------- - The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake) - I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem Any insight and/or link to examples for the particular case will be very much appreciated! Thanks
... View more
Labels:
- Labels:
-
Apache HBase
07-28-2016
03:11 AM
Thanks. Seems a good alternative, and in a matter of fact I was not aware of its availability in CDH 5.7 Marking the thread as solved, even if by now I don't know yet if all the features I'd need will be there in the native hbase-spark connector
... View more
05-29-2016
11:39 AM
Hi all, I wanted to experiment with the "it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3" Package (you can find it at spark-packages.org ). It's an interesting addon giving RDD visibility/operativity on hBase tables via Spark. If I run this extension library in a standard spark-shell (with scala support), everything works smoothly : spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>
scala> import it.nerdammer.spark.hbase._
import it.nerdammer.spark.hbase._ If I try to run it in a Pyspark shell, therefore my goal is to use the extension with Python, I'm not able to import the Functions and I'm not able to use anything: PYSPARK_DRIVER_PYTHON=ipython pyspark --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>
In [1]: from it.nerdammer.spark.hbase import *
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-37dd5a5ffba0> in <module>()
----> 1 from it.nerdammer.spark.hbase import *
ImportError: No module named it.nerdammer.spark.hbase I have tried different combinations of environment variables, parameters, etc when launching Pyspark, but to no avail. Maybe I'm just trying to do something deeply wrong here, maybe it's simply the fact that there is no Python API access to this Library. In a matter of fact, the examples on the Package's home page are all in Scala (but they say you can install the Package in Pyspark too, with the classic "--package" parameter). Can anybody help out with the "ImportError: No module named it.nerdammer.spark.hbase" error message? Thanks for any insight
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
05-26-2016
04:05 AM
4 Kudos
Update: I got to a working solution, this is a brief Howto to get to the result: JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN): Spark Master: yarn-cluster
Mode: cluster
App Name: MySpark
Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py
Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script!
Arguments: NO ARGUMENTS DEFINED WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF THE WORKFLOW MAIN SCREEN): Variables: oozie.use.system.libpath --> true
Workspace: hue-oozie-1463575878.15
Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Show Graph Arrows: CHECKED
Version: uri.oozie.workflow.0.5
Job XML: EMPTY
SLA Configuration: UNCHECKED JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL): - PROPERTIES TAB:
-----------------
Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml
Prepare: NO PREPARE STEPS DEFINED
Job XML: EMPTY
Properties: NO PROPERTIES DEFINED
Retry: NO RETRY OPTIONS DEFINED
- SLA TAB:
----------
Enabled: UNCHECKED
- CREDENTIALS TAB:
------------------
Credentials: NO CREDENTIALS DEFINED
- TRANSITIONS TAB:
------------------
Ok End
Ko Kill MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS: vi hive-site.xml
---
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<THRIFT_HOSTNAME>:9083</value>
</property>
</configuration>
---
hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15 EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY IN THE WORKFLOW FOLDER: vi test.py
---
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)
sqlCtx = HiveContext(sc)
xxx_DF = sqlCtx.table("table")
yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table")
---
hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib NOW YOU CAN SUBLIT THE WORKFLOW IN YARN: - Click the "PLAY" Submit Icon on top of the screen ADDITIONAL INFO: AUTO-GENERATED "workflow.xml": <workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5">
<global>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
</property>
</configuration>
</global>
<start to="spark-9fa1"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-9fa1">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>clear</class>
<jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar>
<spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app> ADDITIONAL INFO: AUTO-GENERATED "job.properties": oozie.use.system.libpath=true
security_enabled=False
dryrun=False
jobTracker=<JOBTRACKER_HOSTNAME>:8032
... View more
05-17-2016
02:15 PM
24 Kudos
Hi, please perform the following actions: 1) Fill in "Source Brokers List" --> "nameOfClouderaKafkaBrokerServer.yourdomain.com:9092". This is the Server (or Servers) where you configured the Kafka Broker (NOT the MirrorMaker). 2) Fill in "Destination Brokers List" --> "nameOfRemoteBrokerServer.otherdomain.com:9092". This is supposed to be a remote Cluster that will receive Topics sent over by your MirrorMaker. If you have one, put in that one. Otherwise just put in another Server in your network, whatever Server. Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you'll get other errors. Also the format with the trailing Port Number is mandatory! 3) Click "Continue". Service will NOT start (error). Do not navigate away from that screen 4) Open another Cloudera Manager in another browser pane. You should now see "Kafka" in the list of Services (red, but it should be there). Click on the Kafka Service and then "Configure". 5) Search for the "java heap space" Configuration Property. The standard Java Heap Space you'll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough. 6) Now search for the "whitelist" Configuration Property. In the field, put in "(?!x)x" (without the quotation marks). That's a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I'm assuming you don't want to replicate any topics remotely right now, just put in something that won't replicate anything e.g. that regular expression. 7) Save the changes and go back to the original Configuration Screen on the othe browser pane. Click "Retry", or wathever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager. That should work, at least it did for me! HTH
... View more
05-14-2016
03:39 PM
Update: If I use "spark-submit", the script runs successfully. Syntax used for "spark-submit": spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--executor-memory 500M \
--total-executor-cores 1 \
hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \
10 Excerpt from output log: 16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact
16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed
16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1
16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1
16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook
... View more
05-14-2016
08:53 AM
Hi all, my CDH test rig is as follows: CDH 5.5.1 Spark 1.5.0 Oozie 4.1.0 I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action. It's a Python script, which is as follows (just a test): from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)
### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS:
sqlCtx = SQLContext(sc)
#sqlCtx = HiveContext(sc)
### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY):
#sqlCtx .sql("use default")
### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1):
#cronologico_DF = sqlCtx.table("sales_fact")
cronologico_DF = sqlCtx.sql("select * from sales_fact")
### (4) ANOTHER DATAFRAME
extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY")
### (5) USELESS PRINT STATEMENT:
print 'a' When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser). The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion): py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.RuntimeException: Table Not Found: sales_fact This is my "workflow.xml": <workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5">
<global>
<job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
</property>
</configuration>
</global>
<start to="spark-3ca0"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-3ca0">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
<configuration>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
</configuration>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app> This is my "job.properties": oozie.use.system.libpath=True
security_enabled=False
dryrun=False
jobTracker=<MY_SERVER_FQDN_HERE>:8032
nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020 Please note that: 1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up. 2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed. 3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success 4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though). 5) ShareLib should be OK too: oozie admin -shareliblist
[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!
... View more
Labels:
- Labels:
-
Apache Oozie
05-13-2016
03:03 AM
2 Kudos
I got past this! Still no cigar, though. Now I have another error, but I'm going to work on this. It's something different now... Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist
java.io.FileNotFoundException: File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist Many thanks for your help. I'd never be able to figure this out by myself!
... View more
05-13-2016
02:42 AM
Hi Ben, thanks a whole lot for your reply. May I ask you where exactly you specified that setting? - In the GUI, in some particular field? - In "workflow.xml", in the Job's directory in HDFS? If yes: as an "arg", as a "property", or..? - In "job.properties", in the Job's directory in HDFS? If yes: how? - In some other file? E.g. "/etc/alternatives/spark-conf/spark-defaults.conf"? If yes, how? A snippet of your code would be extremely appreciated! I'm asking you because I've tried all of the above with your suggestion but I did not succeed. Thanks again for your help
... View more
- « Previous
- Next »