Member since
10-19-2014
58
Posts
6
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6050 | 03-20-2016 10:41 PM | |
11098 | 04-26-2015 02:30 AM |
08-14-2016
06:07 AM
Thanks Harsh for confirming there is no external schema file concept in Parquet and for sharing the link for CREATE TABLE ... LIKE PARQUET ... syntax. This seems to be specific to Impala however, is there a generic approach to use across a stack of tools including Spark, Pig, Hive as well as Impala (and with Spark and Pig not using HCatalog)? Many thanks, Martin
... View more
08-14-2016
03:00 AM
Hi,
, in a similar way to Avro with avsc schema files which can be referenced in CREATE TABLE statements?
Thanks,
Martin
... View more
Labels:
- Labels:
-
HDFS
07-27-2016
01:23 AM
Hi sairamvj, I would suggest you open a new thread for your question, as it is not related to this topic of this thread. Martin
... View more
05-15-2016
09:04 PM
Hello, What is the right way to pass the -no_multiquery option to Pig from Oozie workflow developed in Hue? Thanks, Martin
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache Pig
-
Cloudera Hue
04-06-2016
08:59 PM
1 Kudo
Hello, Here's a our scenario: Data stored in HDFS as Avro Data is partitioned and there are approx. 120 partitions Each partition has around 3,200 files in it The file sizes vary, as small as 2 kB and up to 50 MB In total there is roughly 3 TB of data (we are well aware that such data layout is not ideal) Requirement: Run a query against this data to find a small set of records, maybe around 100 rows matching some criteria Code: import sys
from pyspark import SparkContext
from pyspark.sql import SQLContext
if __name__ == "__main__":
sc = SparkContext()
sqlContext = SQLContext( sc )
df_input = sqlContext.read.format( "com.databricks.spark.avro" ).load( "hdfs://nameservice1/path/to/our/data" )
df_filtered = df_input.where( "someattribute in ('filtervalue1', 'filtervalue2')" )
cnt = df_filtered.count()
print( "Record count: %i" % cnt ) Submit the code: spark-submit --master yarn --num-executors 50 --executor-memory 2G --driver-memory 50G --driver-cores 10 filter_large_data.py Issue: This runs for around many hours without producing any meaningful output. Eventually it crashes either with GC error, disk out of space error, or we are forced to kill it. We've played with different values for the --driver-memory setting, up to 200 GB. This resulted in the program running for over six hours at which point we killed it. Corresponding query in Hive or Pig would take around 1.5 - 2 hours Question: Where are we going wrong? 🙂 Many thanks in advance, Martin
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
-
Apache Spark
-
Apache YARN
-
HDFS
03-20-2016
10:41 PM
Answering my own question, found this: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_avro.html dzimka, hope this works for you too.
... View more
11-16-2015
11:09 PM
Hello, Does anyone have any concrete examples how to use the HDFS file concatenation functionality introduced in HDFS-222? Thanks in advance, Martin
... View more
Labels:
- Labels:
-
HDFS
05-14-2015
04:06 AM
Thanks for the response, unfortunatelly I am none the wiser 😐 Specifically I would want to run a shell action as another user. What we observe is that shell actions are not run as the user who logged in to Hue, rather they run under user "yarn". Is there any way to get shell actions to run as another user? Thanks, Martin
... View more