1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
546 | 04-03-2024 06:39 AM | |
1198 | 01-12-2024 08:19 AM | |
673 | 12-07-2023 01:49 PM | |
1138 | 08-02-2023 07:30 AM | |
1674 | 03-29-2023 01:22 PM |
04-29-2016
02:32 PM
This is awesome. Very nice combination of tools. Do you have the Notebook and NIFI file in Github?
... View more
04-28-2016
03:16 AM
Hmmm, I will talk to my friends at RedisLabs and see if they want to collaborate on it.
... View more
04-27-2016
10:17 PM
1 Kudo
Has anyone done anything using Redis for aggregates and sums in the flow or Redis as a source for NIFI?
... View more
Labels:
- Labels:
-
Apache NiFi
04-27-2016
07:59 PM
local master is not using YARN version of Spark. it's running a local version. Is that running? is the green connected light on in the right upper corner?
... View more
04-27-2016
07:24 PM
is spark running in the cluster? is it on the default port. can you access Spark? can you get to the spark history UI
... View more
04-27-2016
03:47 PM
Great sample code. In most of my Spark apps when working with Parquet, I have a few configurations that help. There are some SparkConfigurations that will help working with Parquet files. val sparkConf = new SparkConf()
sparkConf.set("spark.sql.parquet.compression.codec", "snappy")
sparkConf.set("spark.sql.parquet.mergeSchema", "true")
sparkConf.set("spark.sql.parquet.binaryAsString", "true")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("spark.io.compression.codec", "snappy")
sparkConf.set("spark.rdd.compress", "true")
sparkConf.set("spark.streaming.backpressure.enabled", "true")
Some of the compression items are really important for different use cases. You also often need to turn them off or switch which codec you use depending on use case (batch, streaming, sql, large, small, many partitions, ...) EventLog enabled so you can look at how those parquet files are worked with in DAGs and metrics. Before you right some SparkSQL on that file, make sure you register a table name. If you don't want to do a write that will file if the directory/file already exists, you can choose Append mode to add to it. It depends on your use case. df1.registerTempTable("MyTableName")
val results = sqlContext.sql("SELECT name FROM MyTableName")
df1.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).parquet("data.parquet") If you want to look at the data from the command line after you write it, you can download parquet tools. This requires the Java JDK, git and Maven installed. git clone -b apache-parquet-1.8.0 https://github.com/apache/parquet-mr.git
cd parquet-mr
cd parquet-tools
mvn clean package -Plocal
... View more
04-27-2016
12:07 PM
1 Kudo
To get the current AVRO Tools wget http://apache.claz.org/avro/avro-1.8.0/java/avro-tools-1.8.0.jar There's some good documentation here: https://avro.apache.org/docs/1.8.0/gettingstartedjava.html#Compiling+the+schema This article helped me: http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
... View more
04-27-2016
06:18 AM
Cool thanks. We have tried to do the kafka mirroring and that has had a lot of issues. I am thinking NIFI can solve alot of these problems. I think it's a matter of budget. How many nodes of NIFI an extra nodes to help process this data migrating over. A few people were thinking Dual Ingest, but that is hard to keep in sync usually. With NIFI, that should not be a problem. I wonder if someone has a DR example in NIFI worked up already?
... View more
- « Previous
- Next »