About gbraccialli3

gbraccialli3 · ‎12-11-2015

for really small test environments, you can disable secondary namenode, our sandbox does not have secondary namenode running.

gbraccialli3 · ‎12-11-2015

@Luis Antonio Torres It worked for me. Can you check content of /usr/hdp/current/spark-client/conf/hive-site.xml you are using? mine is like this: <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://sandbox.hortonworks.com:9083</value> </property> </configuration>

gbraccialli3 · ‎12-11-2015

Awesome! You can check current number of partitions with command below: print csv_rdd.getNumPartitions()

gbraccialli3 · ‎12-10-2015

You could load your csv directly, but I tested here and indeed distinct is take much longer with data frames. Can you describe your environment? - hortonworks version - spark version - hardware configuration - spark mode (localmode or spark on yarn) Lastly, if you have enough cores/processor and as your file is small, spark might be choosing a low level of parallelism. you can try it increasing parallelism, like this: distinctValues = rawTrainData.map(lambda x : x[i]).distinct(numPartitions = 15).collect() me fala se ficou mais rápido 🙂

gbraccialli3 · ‎12-10-2015

@Vitor Batista Data Frames are supposed to be faster than Python RDD operations, check slide 20 of this presentation: http://www.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data Could you try code below and check if it's faster? from pyspark.sql import SQLContext, Row input_file = "hdfs:///tmp/your_text_file" raw_rdd = sc.textFile(input_file) csv_rdd = raw_rdd.map(lambda x: x.split(",")) row_data = csv_rdd.map(lambda p: Row( field1=p[0], field2=p[1], field3=p[2] ) ) df = sqlContext.createDataFrame(row_data) categories = {} idxCategories = [0,1,2] for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data distinctValues = df.map(lambda x : x[i]).distinct().collect() categories[i] = distinctValues print categories[0] print categories[1] print categories[2]

gbraccialli3 · ‎12-10-2015

@Luis Antonio Torres I did few tests and I think you just need to change location of --files, it must come before you .jar file. Find my sample class here: https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/HiveCommand.scala Project is here: https://github.com/gbraccialli/SparkUtils Sample spark-submit with hive commands as parameter: git clone https://github.com/gbraccialli/SparkUtils cd SparkUtils/ mvn clean package spark-submit \ --class com.github.gbraccialli.spark.HiveCommand \ --master yarn-cluster \ --num-executors 1 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 \ --files /usr/hdp/current/spark-client/conf/hive-site.xml \ --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \ target/SparkUtils-1.0.0-SNAPSHOT.jar "show tables" "select * from sample_08"

gbraccialli3 · ‎12-10-2015

I mean a hot fix to be applied now to hdp 2.3.2 installation before upgrade.

gbraccialli3 · ‎12-10-2015

Thank you @Ayub Pathan. Is is possible to provide a fix?

gbraccialli3 · ‎12-10-2015

Thank you @kmungee. Would you have other configuration to check on namenode service?

gbraccialli3 · ‎12-09-2015

@David Kaiser notice that hive UDF/UDAF/UDTF are compatible with Spark SQL API, so hive ESRI library would be great with Spark SQL as well.

Online	Offline
Last Visited	‎09-28-2021 03:33 PM

Member Since	‎09-25-2015 05:42 PM
Last Visited	‎09-28-2021 03:33 PM
Posts	230
Kudos received	236

Cloudera Community

Re: How to reset Ambari Admin password?

Re: Connection Refused trying to access port 8000 ...

Re: Flume + Knox

Re: Ambari stuck with "Install Pending" when creat...

Re: HDP 2,3.4- Running jobs is not getting display...

Re: Is Secondary Name node is mandatory for any di...

Re: Spark - Hive tables not found when running in ...

Re: Best way to select distinct values from multip...

Re: Best way to select distinct values from multip...

Re: Best way to select distinct values from multip...

Re: Spark - Hive tables not found when running in ...

Re: Atlas generating huge log files

Re: Atlas generating huge log files

Re: Namenode Best Pratices / Performance tunning c...

Re: What is the best way to perform geospatial ana...