Member since
10-19-2014
58
Posts
5
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4366 | 03-20-2016 10:41 PM | |
8115 | 04-26-2015 02:30 AM |
06-07-2017
11:33 PM
Do you receive email when you run the mail command manually on different worker nodes in your cluster (remember shell action will run on a randomly assigned worker node)?
... View more
06-07-2017
11:27 PM
Any oozie experts out there who can help with the question? Thanks!
... View more
05-31-2017
08:32 PM
Hello, What's the best practice to trigger Oozie workflow based on availability of data in external system (e.g. RDBMS)? The requirement being that the workflow should start as soon as (or at least very soon after) the data in the external system is available. One approach is to schedule the job with high frequency, check for the data and exit if it is not present. But this doesn't seem particularly elegant. Is there any other way? Thanks, Martin
... View more
Labels:
09-21-2016
11:35 PM
Hello, Is there a rule of thumb recommendation for modelling Hive tables on HDFS, whether to store them as "wide" with lots of columns OR break them apart into smaller tables which have to be joined during query execution? Parameters to consider - Number of attributes, e.g. what is the max reasonable number of columns? - Use of complex types (arrays, maps) vs parent-child relationships between tables - Data stored as Avro vs data stored as Parquet - Usage in Imapala and Hive Thanks, Martin
... View more
Labels:
08-14-2016
06:07 AM
Thanks Harsh for confirming there is no external schema file concept in Parquet and for sharing the link for CREATE TABLE ... LIKE PARQUET ... syntax. This seems to be specific to Impala however, is there a generic approach to use across a stack of tools including Spark, Pig, Hive as well as Impala (and with Spark and Pig not using HCatalog)? Many thanks, Martin
... View more
08-14-2016
03:00 AM
Hi,
, in a similar way to Avro with avsc schema files which can be referenced in CREATE TABLE statements?
Thanks,
Martin
... View more
- Tags:
- parquet
Labels:
07-27-2016
01:23 AM
Hi sairamvj, I would suggest you open a new thread for your question, as it is not related to this topic of this thread. Martin
... View more
05-15-2016
09:04 PM
Hello, What is the right way to pass the -no_multiquery option to Pig from Oozie workflow developed in Hue? Thanks, Martin
... View more
04-06-2016
08:59 PM
Hello, Here's a our scenario: Data stored in HDFS as Avro Data is partitioned and there are approx. 120 partitions Each partition has around 3,200 files in it The file sizes vary, as small as 2 kB and up to 50 MB In total there is roughly 3 TB of data (we are well aware that such data layout is not ideal) Requirement: Run a query against this data to find a small set of records, maybe around 100 rows matching some criteria Code: import sys
from pyspark import SparkContext
from pyspark.sql import SQLContext
if __name__ == "__main__":
sc = SparkContext()
sqlContext = SQLContext( sc )
df_input = sqlContext.read.format( "com.databricks.spark.avro" ).load( "hdfs://nameservice1/path/to/our/data" )
df_filtered = df_input.where( "someattribute in ('filtervalue1', 'filtervalue2')" )
cnt = df_filtered.count()
print( "Record count: %i" % cnt ) Submit the code: spark-submit --master yarn --num-executors 50 --executor-memory 2G --driver-memory 50G --driver-cores 10 filter_large_data.py Issue: This runs for around many hours without producing any meaningful output. Eventually it crashes either with GC error, disk out of space error, or we are forced to kill it. We've played with different values for the --driver-memory setting, up to 200 GB. This resulted in the program running for over six hours at which point we killed it. Corresponding query in Hive or Pig would take around 1.5 - 2 hours Question: Where are we going wrong? 🙂 Many thanks in advance, Martin
... View more
04-06-2016
08:38 PM
Another option would be to use Oozie which now has a Spark action. Regards, Martin
... View more
- Tags:
- oo
03-28-2016
01:30 AM
Hi, couple of suggestions on attacking the problem by simplifying it somewhat: 1. Can you try with a table and data which is not partitioned? 2. Can you try with external table instead of managed, and use a path different to the hive warehouse? Martin
... View more
03-25-2016
02:59 AM
Surely if you are accessing impala from python, you should be able to parse the output of "show partitions" etc. programmatically in order to achieve what you want to do?
... View more
03-20-2016
10:41 PM
Answering my own question, found this: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_avro.html dzimka, hope this works for you too.
... View more
03-20-2016
10:29 PM
Hi Jamiet, You should be able to achieve this using impala-shell and then store output as a table, i.e. from command line run something like: impala-shell --delimited -q "show partitions database.table;" --output_file partitions.out Then upload the output file to HDFS and create a table over it. The only downside is that it is not dynamically updated, you'd have to define some schedule for it (maybe using oozie) at a frequency which is acceptable for your requirement. Full reference for impala-shell is here. Hope this helps, Martin
... View more
11-17-2015
11:19 PM
Hello, I want to use the Streaming action in Oozie workflow to run some Python scripts. This works great for text data, however as all our data is stored as Avro, I am trying to figure out how to pass Avro data into Python without first having to convert it to text. What parameters need to be set in the Streaming action and are there any jars to be explicitly added in the workflow or in Oozie shared lib? Thanks, Martin
... View more
Labels:
11-16-2015
11:09 PM
Hello, Does anyone have any concrete examples how to use the HDFS file concatenation functionality introduced in HDFS-222? Thanks in advance, Martin
... View more
Labels:
09-14-2015
08:56 PM
I think it depends on how much of the log file you want to see. If it's just a few lines, you can read the log file in a shell script and print out those lines as LINE1=... , LINE2=... etc. Then check the "Capture output" flag on the Oozie shell action, which will create variables that you can then pick up in your email action with the syntax ${wf:actionData('YourShellAction')['LINE1']} etc. Hope this helps. Martin
... View more
09-14-2015
08:50 PM
Hey there, For what it's worth, have a look through this forum question. Though in the end I gave up on trying to do this, and as much as possible we are moving away from shell actions in Oozie. Martin
... View more
09-14-2015
10:00 AM
Hi Harsh, thanks for the follow up. In my observation (on CDH 5.3.4), Hive actions are almost alway run as the user who submitted the Oozie workflow in Hue. And as such any directory created by the Hive table creation is also created by this user. But clearly sometimes it is the service "hive" user. Is it deterministic? i.e. given a Hive workflow action with a given script can I tell which user will end up running it?? Thanks, Martin
... View more
09-14-2015
09:26 AM
Hello, As per my understanding, Hive actions in oozie workflows are run with the same user who submitted the workflow. As such statements such as CREATE EXTERNAL TABLE t1 ... LOCATION /foo/bar/t1 will result in the necessary directory structure in HDFS being created with that user. This is what we normally observe. But in a few cases, I have seen that the directory is getting created with user "hive" which then results in all sorts of HDFS permission errors such as: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=hive, access=WRITE, inode="/foo/bar":my_expected_user:my_expected_group:drwxr-xr-x Has anyone come across this behaviour? Is it expected under some conditions or is a misconfiguration or a bug? Many thanks, Martin
... View more
Labels:
09-02-2015
05:13 AM
Hi Thomas, We configured Flume to batch together 1,000 XML files and store them as a SequenceFile. Do let me know how you decide to proceed with your protobuf data, I think it is a very similar requirement. Regards, Martin
... View more
06-24-2015
07:57 PM
Thanks Harsh. So I take this as confirmation that in CDH 5.3 it is not possible to configure the workflow to ignore errors on particular actions? Thanks, Martin
... View more
06-24-2015
11:04 AM
Hi Harsh, thanks for your reply, Where exactly is this gear icon? Is it something newly introduced in the new Hue Oozie editor? I am on CDH 5.3, I don't see a gear icon either in the workflow design pane or in the Edit Action view. Or am I just missing it?? Thanks, Martin
... View more
06-22-2015
08:41 PM
Hello, Is there a way to ignore errors in Oozie workflows defined from Hue? I.e. if I have a long complex workflow, there might be some steps where an error should be noted, but it should not cause the whole workflow to fail. I suppose if the workflow.xml was developed by hand then it would be possible to specify the "error to=" transition, but I am not sure if this is in any way possible from the Hue workflow editor? Thanks, Martin
... View more
Labels:
05-14-2015
04:06 AM
Thanks for the response, unfortunatelly I am none the wiser 😐 Specifically I would want to run a shell action as another user. What we observe is that shell actions are not run as the user who logged in to Hue, rather they run under user "yarn". Is there any way to get shell actions to run as another user? Thanks, Martin
... View more
04-29-2015
08:56 PM
Got it thanks!
... View more
04-29-2015
08:32 PM
Ah, my bad, thanks. That shows oozie_jobs_count = 100, so obviously the change is not picked up. Is /etc/hue/conf/hue.ini the right file (CM is being used)?
... View more
04-29-2015
07:03 PM
Hi Romain, thanks for your response. Unfortunatelly that didn't work, I uncommented that line (under /etc/hue/conf/hue.ini) and changed to 500. This had no effect. Also wanted to check using the dump_config, but http://hue:port/dump_config is giving page me not found. This is on CDH 5.3.1. Thanks, Martin
... View more