Member since
01-05-2016
55
Posts
37
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
802 | 10-21-2019 05:16 AM | |
3762 | 01-29-2018 07:05 AM | |
2574 | 06-27-2017 06:42 AM | |
37035 | 05-26-2016 04:05 AM | |
26679 | 05-17-2016 02:15 PM |
06-26-2017
12:37 PM
As I believe that the problem is definitely due to differences betweek CDH 5.7 and CDH 5.11 in how resources are allocated to containers by YARN, I've tried to follow again from scratch the YARN Tuning Guide. The latest version of the YARN Tuning Guide available is apparently for CDH 5.10: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In that page, an XLS Sheet is available to help out planning the various parameters in a correct and working fashion. No luck. I always find myself with jobs stuck in "ACCEPTED" mode and never starting to run. I also found this interesting thread suggesting how to configure Dynamic Resource Pools for YARN: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cm_mc_resource_pools.html#concept_xkk_l1d_wr__section_c3f_vwf_4n I tried to limit the "number of concurrent jobs" to just 2 in the relevant Configuration Page of the Dynamic Resource Pools, but again, no success. Can anybody please point me out whatever new feature that could have been implemented in CDH 5.11 and related to YARN Resources Allocation (and that I have not mentioned here), because my Workflows were running smoothly before the upgrade, and now I'm facing heavy troubles! Workarounds are welcome too, as well as methods for monitoring/tracing resources usage in a way allowing me to understand what parameters I've been set up in a way that is not functional anymore in CDH 5.11 Thanks a lot for any hints or insights!
... View more
06-21-2017
11:31 AM
Hello, after successfully upgrading a small (5 nodes) CDH 5.7 cluster to CDH 5.11, I am experiencing various problems on existing Oozie Workflows that used to work correctly. The most significant example: I have this Workflow scheduling 8 jobs in parallel (mix of Hive, Shell and Sqoop actions). The 8 jobs are acquired and start running. But the 8 sub-jobs performing the action are stuck in "ACCEPTED" status and never switch to "RUNNING" state. After hours of work, I've not been able to find anything significant in the logs, apart from a few complaining about log4j. So I decided to upgrade JDK from 1.7 to 1.8 too, but without any improvement in the situation. Any help or suggestion pointing me in the right direction in solving this would be very very much appreciated! Thanks
... View more
Labels:
01-03-2017
12:38 PM
Hi AdrianMonter, sorry to say I haven't found a specific solution for the Avro file format in the meanwhile. I'm sticking to Parquet file format since I had this problem, and for now it covers all my needs... Maybe in the latest CDH/Spark releases this has been fixed? Maybe somebody from @Former Member can tell us someting more?
... View more
10-17-2016
01:40 PM
2 Kudos
Hi aj, yes I did manage to solve it. Please, take a look at the following thread and see if it can be of help. It may seem a bit unrelated from the "test.py not found" issue, but it contains detailed info about how to specify all the needed parameters to let the whole thing run smoothly: http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-workflow-Spark-action-using-simple-Dataframe-quot-Table/m-p/40834 HTH
... View more
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more
07-31-2016
08:06 AM
Hi all, I have the following design question for my new table in HBase. Scenario: ------------- - Table will contain Customers Information - Table would be refreshed every week by a procedure, inserting new info (see below) - Row Key would be "Customer ID" (fixed) - There would be fixed contents columns, e.g. "Name", "Surname" - There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X" The question: ------------------ - Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed? - Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis Please keep in mind that: ---------------------------------- - The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake) - I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem Any insight and/or link to examples for the particular case will be very much appreciated! Thanks
... View more
Labels:
- Labels:
-
Apache HBase
07-28-2016
03:11 AM
Thanks. Seems a good alternative, and in a matter of fact I was not aware of its availability in CDH 5.7 Marking the thread as solved, even if by now I don't know yet if all the features I'd need will be there in the native hbase-spark connector
... View more
05-29-2016
11:39 AM
Hi all, I wanted to experiment with the "it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3" Package (you can find it at spark-packages.org ). It's an interesting addon giving RDD visibility/operativity on hBase tables via Spark. If I run this extension library in a standard spark-shell (with scala support), everything works smoothly : spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>
scala> import it.nerdammer.spark.hbase._
import it.nerdammer.spark.hbase._ If I try to run it in a Pyspark shell, therefore my goal is to use the extension with Python, I'm not able to import the Functions and I'm not able to use anything: PYSPARK_DRIVER_PYTHON=ipython pyspark --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>
In [1]: from it.nerdammer.spark.hbase import *
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-37dd5a5ffba0> in <module>()
----> 1 from it.nerdammer.spark.hbase import *
ImportError: No module named it.nerdammer.spark.hbase I have tried different combinations of environment variables, parameters, etc when launching Pyspark, but to no avail. Maybe I'm just trying to do something deeply wrong here, maybe it's simply the fact that there is no Python API access to this Library. In a matter of fact, the examples on the Package's home page are all in Scala (but they say you can install the Package in Pyspark too, with the classic "--package" parameter). Can anybody help out with the "ImportError: No module named it.nerdammer.spark.hbase" error message? Thanks for any insight
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
05-26-2016
04:05 AM
4 Kudos
Update: I got to a working solution, this is a brief Howto to get to the result: JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN): Spark Master: yarn-cluster
Mode: cluster
App Name: MySpark
Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py
Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script!
Arguments: NO ARGUMENTS DEFINED WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF THE WORKFLOW MAIN SCREEN): Variables: oozie.use.system.libpath --> true
Workspace: hue-oozie-1463575878.15
Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Show Graph Arrows: CHECKED
Version: uri.oozie.workflow.0.5
Job XML: EMPTY
SLA Configuration: UNCHECKED JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL): - PROPERTIES TAB:
-----------------
Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml
Prepare: NO PREPARE STEPS DEFINED
Job XML: EMPTY
Properties: NO PROPERTIES DEFINED
Retry: NO RETRY OPTIONS DEFINED
- SLA TAB:
----------
Enabled: UNCHECKED
- CREDENTIALS TAB:
------------------
Credentials: NO CREDENTIALS DEFINED
- TRANSITIONS TAB:
------------------
Ok End
Ko Kill MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS: vi hive-site.xml
---
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<THRIFT_HOSTNAME>:9083</value>
</property>
</configuration>
---
hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15 EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY IN THE WORKFLOW FOLDER: vi test.py
---
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)
sqlCtx = HiveContext(sc)
xxx_DF = sqlCtx.table("table")
yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table")
---
hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib NOW YOU CAN SUBLIT THE WORKFLOW IN YARN: - Click the "PLAY" Submit Icon on top of the screen ADDITIONAL INFO: AUTO-GENERATED "workflow.xml": <workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5">
<global>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
</property>
</configuration>
</global>
<start to="spark-9fa1"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-9fa1">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>clear</class>
<jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar>
<spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app> ADDITIONAL INFO: AUTO-GENERATED "job.properties": oozie.use.system.libpath=true
security_enabled=False
dryrun=False
jobTracker=<JOBTRACKER_HOSTNAME>:8032
... View more