Member since
04-25-2016
7
Posts
4
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3542 | 07-05-2016 07:54 AM | |
1926 | 06-06-2016 03:45 PM |
01-26-2017
09:05 AM
Hi Antin, For XML files I would recommend using the sequencefile file container. Create one big sequencefile containing all your xml files. A sequencefile works similarly to a map with entries consisting of a key and a value. The key to each entry in your sequencefile could be for example the "filename + date" and the value the content of the XML file. The advantage about using a file container such as Sequencefiles, AVRO or Parquet is that they are splittable when compressed with Gz. However to improve compression & decompression speed I would recommend the Snappy compression codec. The following Stackoverflow discussion thread will also clarify things. Mark
... View more
07-06-2016
03:22 PM
I can indeed as vshukla is saying recommend using SparkSQL with the HiveSqlContext.
... View more
07-05-2016
07:54 AM
1 Kudo
Hi Lukas, PySpark's Spark Context (sc) also has it's own methods to read data from HDFS, sc.textFile(...), sc.wholeTextFile(...), sc.binaryFile(...). Why don't you try using those to read data from HDFS and you also directly get an RDD for the data you read in? However if you use these methods of the SparkContext make sure to add your core-site.xml and hdfs-site.xml config files to the Spark conf dir; and by the way the Spark Conf Dir can be set using the environment variable to any desired location SPARK_CONF_DIR.
... View more
06-06-2016
03:45 PM
Hi John, I would recommend reading the paper "Spark SQL: Relational Data Processing in Spark" which describes the steps you are mentioning in more detail about the Catalyst Optimizer. https://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
... View more
05-02-2016
10:13 AM
2 Kudos
Spark DataFrame use the Catalyst optimiser under the hood. The spark code gets transformed into an abstract syntax tree or logic plan on which several optimisations are applied before code is being generated from it. See the following paper for the fun explanation https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf. The reason why Spark DataFrames are fast in all languages is because whether you use Python, Java or Scala the implementation used under the hood is the Scala Implementation of a Data Frame or Catalyst Optimiser. Whether you use Scala, Java or Python the Logical Plans are passed on to the same Catalyst Optimiser.
... View more