About joey

joey · ‎01-15-2015

I'm not aware of an option to get it added to the classpath first. In the past when I've had to deploy a patched jar to a core component, I replace the jar file in the lib directory.

joey · ‎01-09-2015

Keep in mind that with the MemoryChannel you lose any records in the channel if Flume crashes or the system reboots.

joey · ‎01-09-2015

You probably need to adjust the maxFileSize and minimumSpaceRequired settings on the file channel[1]. FWIW, transfering large files with Flume is an anti-pattern. Flume is designed for event/log transport not large file transport. You might want to check out a new Apache project called Apache NiFi[2] that is better suited to large file transfer. There's a quick how-to blog post available here to get you started: http://ingest.tips/2014/12/22/getting-started-with-apache-nifi/ -Joey [1] http://flume.apache.org/FlumeUserGuide.html#file-channel [2] http://nifi.incubator.apache.org

joey · ‎12-29-2014

If you want each file to end up remaining whole, you can use the BlobDeserialzier[1] for the deserializer parameter of the SpoolingDirectorySource[2].: a1.channels = c1 a1.sources = src-1 a1.sources.src-1.type = spooldir a1.sources.src-1.channels = ch-1 a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool a1.sources.src-1.fileHeader = true a1.sources.src-1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder If you need to, set deserialzier.maxBlobLength to the maximum file size you'll be picking up. The default is 100 million bytes. This won't work for very large files as the entire file contents will get buffered into RAM. The File channel is the best option for reliable data flow. If you want the output file to have the same name is the input file, you can set the basenameHeader parameter to true. This will set a header in the flume event called basename. You can customize the name of the header by setting basenameHeaderKey. Then in your sink configuration, you can refer to the header value in the filePrefix with something like this: a1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.path = /flume/events/ a1.sinks.k1.hdfs.filePrefix = %{basename}- a1.sinks.k1.hdfs.fileType = DataStream HTH, -Joey [1] http://flume.apache.org/FlumeUserGuide.html#blobdeserializer [2] http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

joey · ‎12-23-2014

Kite uses the Hive API to drop the table when you tell it to delete the dataset and Hive should take care of dropping the table. Can you check the log of your HiveMetastoreServer to see if there was an error on that side? To get past the error, you can remove the directory by hand.

Online	Offline
Last Visited	‎01-26-2015 11:43 PM

Member Since	‎07-08-2013 12:40 PM
Last Visited	‎01-26-2015 11:43 PM
Posts	26
Kudos received	8

Cloudera Community

Re: Flume: HDFS sink: Can't write large files

Re: Move files from a spooling directory to HDFS w...

Re: Deleting existing DataSets

Re: NPE if kafka has null record key

Re: Flume: HDFS sink: Can't write large files

Re: Flume: HDFS sink: Can't write large files

Re: Move files from a spooling directory to HDFS w...

Re: Deleting existing DataSets