1954
Posts
1209
Kudos Received
121
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
177 | 08-02-2023 07:30 AM | |
480 | 03-29-2023 01:22 PM | |
2302 | 06-03-2021 07:11 AM | |
1261 | 06-01-2021 10:05 AM | |
1227 | 05-24-2021 11:58 AM |
04-27-2016
03:47 PM
Great sample code. In most of my Spark apps when working with Parquet, I have a few configurations that help. There are some SparkConfigurations that will help working with Parquet files. val sparkConf = new SparkConf()
sparkConf.set("spark.sql.parquet.compression.codec", "snappy")
sparkConf.set("spark.sql.parquet.mergeSchema", "true")
sparkConf.set("spark.sql.parquet.binaryAsString", "true")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("spark.io.compression.codec", "snappy")
sparkConf.set("spark.rdd.compress", "true")
sparkConf.set("spark.streaming.backpressure.enabled", "true")
Some of the compression items are really important for different use cases. You also often need to turn them off or switch which codec you use depending on use case (batch, streaming, sql, large, small, many partitions, ...) EventLog enabled so you can look at how those parquet files are worked with in DAGs and metrics. Before you right some SparkSQL on that file, make sure you register a table name. If you don't want to do a write that will file if the directory/file already exists, you can choose Append mode to add to it. It depends on your use case. df1.registerTempTable("MyTableName")
val results = sqlContext.sql("SELECT name FROM MyTableName")
df1.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).parquet("data.parquet") If you want to look at the data from the command line after you write it, you can download parquet tools. This requires the Java JDK, git and Maven installed. git clone -b apache-parquet-1.8.0 https://github.com/apache/parquet-mr.git
cd parquet-mr
cd parquet-tools
mvn clean package -Plocal
... View more
04-27-2016
12:07 PM
1 Kudo
To get the current AVRO Tools wget http://apache.claz.org/avro/avro-1.8.0/java/avro-tools-1.8.0.jar There's some good documentation here: https://avro.apache.org/docs/1.8.0/gettingstartedjava.html#Compiling+the+schema This article helped me: http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
... View more
04-27-2016
10:28 AM
1 Kudo
Repo Description A few simple cases for teaching people to work with Spark and run various transformations and actions on different small data sets. Repo Info Github Repo URL https://github.com/tspannhw/SparkTransformations Github account name tspannhw Repo name SparkTransformations
... View more
Labels:
04-27-2016
10:28 AM
1 Kudo
Repo Description This will run on a sandbox, laptop or real Hortonworks Hadoop cluster with Spark at Scale 1.6.x. This example is Scala + Spark Core 1.6. It includes shell scripts and SBT for build and deploy. This program reads an Apache Log file and parses it from text file into a Scala case class and runs a simple filter, map and reduce. Repo Info Github Repo URL https://github.com/agilemobiledev/sparkworkshop Github account name agilemobiledev Repo name sparkworkshop
... View more
Labels:
04-27-2016
06:18 AM
Cool thanks. We have tried to do the kafka mirroring and that has had a lot of issues. I am thinking NIFI can solve alot of these problems. I think it's a matter of budget. How many nodes of NIFI an extra nodes to help process this data migrating over. A few people were thinking Dual Ingest, but that is hard to keep in sync usually. With NIFI, that should not be a problem. I wonder if someone has a DR example in NIFI worked up already?
... View more
04-27-2016
05:15 AM
2 Kudos
I am investigating a good disaster recovery solution for banking with multiple petabytes of data. This would be data in HDFS (parquet, avro), Kafka, Hive and HBase. Not just the data, but keeping BI tools in sync and having Spark jobs still function. I have looked at WANDisco, but thats HBase and HDFS. Is there something to keep applications and BI items in sync.
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
04-27-2016
03:50 AM
Can you post the contents of your configuration files and hosts. I had the problem on Mac that the files were ROOT read only and that the sandbox did not have permissions. Can you access other items on the Sandbox remote? Sometimes anti-virus will block ports. Can you access from ambari?
... View more
04-26-2016
11:48 PM
was this Sandbox? local? server? Also did you set the configuration files? http://hortonworks.com/hadoop-tutorial/how-to-install-and-configure-the-hortonworks-odbc-driver-on-mac-os-x/
... View more
04-26-2016
11:40 PM
check out your local mac firewall settings, make sure nothing else is running on the same ports. make sure nothing requires root access. also restarting the server will sometimes do the trick.
... View more
04-13-2016
01:51 PM
1 Kudo
I am going to try on my PineA64 which is a little beefier
... View more
- « Previous
- Next »