Member since
09-25-2015
230
Posts
276
Kudos Received
39
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
25442 | 07-05-2016 01:19 PM | |
8714 | 04-01-2016 02:16 PM | |
2211 | 02-17-2016 11:54 AM | |
5838 | 02-17-2016 11:50 AM | |
12913 | 02-16-2016 02:08 AM |
12-11-2015
01:59 AM
for really small test environments, you can disable secondary namenode, our sandbox does not have secondary namenode running.
... View more
12-11-2015
01:55 AM
1 Kudo
@Luis Antonio Torres It worked for me. Can you check content of /usr/hdp/current/spark-client/conf/hive-site.xml you are using? mine is like this: <configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://sandbox.hortonworks.com:9083</value>
</property>
</configuration>
... View more
12-11-2015
12:48 AM
Awesome! You can check current number of partitions with command below: print csv_rdd.getNumPartitions()
... View more
12-10-2015
05:16 PM
1 Kudo
You could load your csv directly, but I tested here and indeed distinct is take much longer with data frames. Can you describe your environment? - hortonworks version - spark version - hardware configuration - spark mode (localmode or spark on yarn) Lastly, if you have enough cores/processor and as your file is small, spark might be choosing a low level of parallelism. you can try it increasing parallelism, like this: distinctValues = rawTrainData.map(lambda x : x[i]).distinct(numPartitions = 15).collect()
me fala se ficou mais rápido 🙂
... View more
12-10-2015
02:28 PM
2 Kudos
@Vitor Batista Data Frames are supposed to be faster than Python RDD operations, check slide 20 of this presentation: http://www.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data Could you try code below and check if it's faster? from pyspark.sql import SQLContext, Row
input_file = "hdfs:///tmp/your_text_file"
raw_rdd = sc.textFile(input_file)
csv_rdd = raw_rdd.map(lambda x: x.split(","))
row_data = csv_rdd.map(lambda p: Row(
field1=p[0],
field2=p[1],
field3=p[2]
)
)
df = sqlContext.createDataFrame(row_data)
categories = {}
idxCategories = [0,1,2]
for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data
distinctValues = df.map(lambda x : x[i]).distinct().collect()
categories[i] = distinctValues
print categories[0]
print categories[1]
print categories[2]
... View more
12-10-2015
12:46 PM
7 Kudos
@Luis Antonio Torres I did few tests and I think you just need to change location of --files, it must come before you .jar file. Find my sample class here: https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/HiveCommand.scala Project is here: https://github.com/gbraccialli/SparkUtils Sample spark-submit with hive commands as parameter: git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit \
--class com.github.gbraccialli.spark.HiveCommand \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--files /usr/hdp/current/spark-client/conf/hive-site.xml \
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"
... View more
12-10-2015
12:03 PM
I mean a hot fix to be applied now to hdp 2.3.2 installation before upgrade.
... View more
12-10-2015
10:46 AM
1 Kudo
Thank you @Ayub Pathan. Is is possible to provide a fix?
... View more
12-10-2015
02:30 AM
Thank you @kmungee. Would you have other configuration to check on namenode service?
... View more
12-09-2015
08:23 PM
@David Kaiser notice that hive UDF/UDAF/UDTF are compatible with Spark SQL API, so hive ESRI library would be great with Spark SQL as well.
... View more