About MartinEK

MartinEK · ‎08-14-2016

Thanks Harsh for confirming there is no external schema file concept in Parquet and for sharing the link for CREATE TABLE ... LIKE PARQUET ... syntax. This seems to be specific to Impala however, is there a generic approach to use across a stack of tools including Spark, Pig, Hive as well as Impala (and with Spark and Pig not using HCatalog)? Many thanks, Martin

MartinEK · ‎08-14-2016

Hi, , in a similar way to Avro with avsc schema files which can be referenced in CREATE TABLE statements? Thanks, Martin

MartinEK · ‎07-27-2016

Hi sairamvj, I would suggest you open a new thread for your question, as it is not related to this topic of this thread. Martin

MartinEK · ‎05-15-2016

Hello, What is the right way to pass the -no_multiquery option to Pig from Oozie workflow developed in Hue? Thanks, Martin

MartinEK · ‎04-06-2016

Hello, Here's a our scenario: Data stored in HDFS as Avro Data is partitioned and there are approx. 120 partitions Each partition has around 3,200 files in it The file sizes vary, as small as 2 kB and up to 50 MB In total there is roughly 3 TB of data (we are well aware that such data layout is not ideal) Requirement: Run a query against this data to find a small set of records, maybe around 100 rows matching some criteria Code: import sys from pyspark import SparkContext from pyspark.sql import SQLContext if __name__ == "__main__": sc = SparkContext() sqlContext = SQLContext( sc ) df_input = sqlContext.read.format( "com.databricks.spark.avro" ).load( "hdfs://nameservice1/path/to/our/data" ) df_filtered = df_input.where( "someattribute in ('filtervalue1', 'filtervalue2')" ) cnt = df_filtered.count() print( "Record count: %i" % cnt ) Submit the code: spark-submit --master yarn --num-executors 50 --executor-memory 2G --driver-memory 50G --driver-cores 10 filter_large_data.py Issue: This runs for around many hours without producing any meaningful output. Eventually it crashes either with GC error, disk out of space error, or we are forced to kill it. We've played with different values for the --driver-memory setting, up to 200 GB. This resulted in the program running for over six hours at which point we killed it. Corresponding query in Hive or Pig would take around 1.5 - 2 hours Question: Where are we going wrong? 🙂 Many thanks in advance, Martin

MartinEK · ‎03-20-2016

Answering my own question, found this: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_avro.html dzimka, hope this works for you too.

MartinEK · ‎03-18-2016

Hello, I have the same question, any pointers? Thanks!

MartinEK · ‎11-19-2015

Thanks Harsh, yes that is helpful.

MartinEK · ‎11-16-2015

Hello, Does anyone have any concrete examples how to use the HDFS file concatenation functionality introduced in HDFS-222? Thanks in advance, Martin

MartinEK · ‎05-14-2015

Thanks for the response, unfortunatelly I am none the wiser 😐 Specifically I would want to run a shell action as another user. What we observe is that shell actions are not run as the user who logged in to Hue, rather they run under user "yarn". Is there any way to get shell actions to run as another user? Thanks, Martin

Online	Offline
Last Visited	‎07-04-2017 05:33 AM

Member Since	‎10-19-2014 08:24 PM
Last Visited	‎07-04-2017 05:33 AM
Posts	58
Kudos received	6

Cloudera Community

Re: How do I load Avro files to Spark RDDs

Re: Pass parameters to oozie subworkflow

Re: Parquet external schema

Does Parquet support a notion of defining and mana...

Re: Pass parameters to oozie subworkflow

How to pass -no_multiquery to Pig from Oozie workf...

How to process a large data set with Spark

Re: How do I load Avro files to Spark RDDs

Re: How do I load Avro files to Spark RDDs

Re: File concatenation (HDFS-222)

File concatenation (HDFS-222)

Re: How to run Oozie workfllow or action as anothe...