About m_a_vervuurt

m_a_vervuurt · ‎01-26-2017

Hi Antin, For XML files I would recommend using the sequencefile file container. Create one big sequencefile containing all your xml files. A sequencefile works similarly to a map with entries consisting of a key and a value. The key to each entry in your sequencefile could be for example the "filename + date" and the value the content of the XML file. The advantage about using a file container such as Sequencefiles, AVRO or Parquet is that they are splittable when compressed with Gz. However to improve compression & decompression speed I would recommend the Snappy compression codec. The following Stackoverflow discussion thread will also clarify things. Mark

m_a_vervuurt · ‎07-06-2016

I can indeed as vshukla is saying recommend using SparkSQL with the HiveSqlContext.

m_a_vervuurt · ‎07-05-2016

Hi Lukas, PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.

m_a_vervuurt · ‎06-06-2016

Hi John, I would recommend reading the paper "Spark SQL: Relational Data Processing in Spark" which describes the steps you are mentioning in more detail about the Catalyst Optimizer. https://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf

m_a_vervuurt · ‎05-02-2016

Spark DataFrame use the Catalyst optimiser under the hood. The spark code gets transformed into an abstract syntax tree or logic plan on which several optimisations are applied before code is being generated from it. See the following paper for the fun explanation https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf. The reason why Spark DataFrames are fast in all languages is because whether you use Python, Java or Scala the implementation used under the hood is the Scala Implementation of a Data Frame or Catalyst Optimiser. Whether you use Scala, Java or Python the Logical Plans are passed on to the same Catalyst Optimiser.

Online	Offline
Last Visited	‎05-19-2017 04:02 PM

Member Since	‎04-25-2016 10:17 AM
Last Visited	‎05-19-2017 04:02 PM
Posts	7
Kudos received	4

Cloudera Community

Re: Can not access Namenode in Pyspark but in Pyth...

Re: Catalyst optimization phases

Re: com.databricks.spark.xml parsing xml takes a v...

Re: Issue with Spark on Hive.

Re: Can not access Namenode in Pyspark but in Pyth...

Re: Catalyst optimization phases

Re: why dataframes are faster in all lnaguages?