Member since
02-08-2016
36
Posts
18
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1392 | 12-14-2017 03:09 PM | |
2352 | 08-03-2016 02:49 PM | |
4293 | 07-26-2016 10:52 AM | |
3616 | 03-07-2016 12:47 PM |
12-14-2017
03:09 PM
Thank you @Matt Andruff for your reply. I resolved the issue. I had another .jar in the /lib directory containing the same code but with another file name. I'm not sure how it does affect the execution of the job. But after removing it every thing works fine, for now at least.
... View more
12-13-2017
02:38 PM
Hi, I've a prolem with running a jar using an oozie shell action in a kerberized cluster. My jar has the following code for authentification: Configuration conf = new Configuration();
conf.set("hadoop.security.authentication","kerberos");
UserGroupInformation.setConfiguration(conf);
try {
UserGroupInformation.loginUserFromKeytab(principal, keytabPath);
} catch (IOException e) {
e.printStackTrace();
} My workflow.xml as following: <shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>hadoop</exec>
<argument>jar</argument>
<argument>jarfile</argument>
<argument>x.x.x.x.UnzipFile</argument>
<argument>keytab</argument>
<argument>${kerberosPrincipal}</argument>
<argument>${nameNode}</argument>
<argument>${zipFilePath}</argument>
<argument>${unzippingDir}</argument>
<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
<file>${workdir}/lib/[keytabFileName]#keytab</file>
<file>${workdir}/lib/[JarFileName]#jarfile</file>
</shell> The jar file and the keytab are located in HDFS in the /lib directory of the directory where the .xml is located. The problem is that on various identical run of the oozie workflow I sometime get this error: java.io.IOException: Incomplete HDFS URI, no host: hdfs://[name_bode_URI]:8020keytab
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at x.x.x.x.CompressedFilesUtilities.unzip(CompressedFilesUtilities.java:54)
at x.x.x.x.UnzipFile.main(UnzipFile.java:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
... View more
Labels:
- Labels:
-
Apache Oozie
08-03-2016
02:49 PM
Okay, I found a workaround, I added: -Duser.timezone=GMT which changes the the JVM timezon. The final Flume-ng command will be as following: flume-ng agent --conf-file spool1.properties --name agent1 --conf $FLUME_HOME/conf -Duser.timezone=GMT The needed directory for the oozie coordiantor is now being created.
... View more
08-03-2016
08:43 AM
Hi all, I've created an Oozie coordinator with synchronous dataset. The time in the cluster is set to CEST (GMT+2). I'm using flume to collect data and create a directory in HDFS in this format: /flume/%Y/%m/%d/%H coordinator.properties: nameNode=hdfs://vm1.local:8020
jobTracker=vm1.local:8050
queueName=default
exampleDir=${nameNode}/user/root/oozie-wait
oozie.use.system.libpath = true
start=2016-08-03T08:01Z
end=2016-08-03T12:06Z
workflowAppUri=${exampleDir}/app
oozie.coord.application.path=${exampleDir}/app coordiantor.xml: <coordinator-app name="every-hour-waitForData" frequency="${coord:hours(1)}" start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="ratings" frequency="${coord:hours(1)}" initial-instance="${start}" timezone="Europe/Paris">
<uri-template>hdfs://vm1.local:8020/user/root/flume/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="ratings">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>wfInput</name>
<value>${coord:dataIn('coordInput1')}</value>
</property>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
When running this example flume creates the directory /user/root/flume/2016/08/03/10/ But the coordinator is waiting for /user/root/flume/2016/08/03/08 Does any one knows how to make Flume creates the directory in UTC or the coordinator reads the correct directory . Thanks.
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Oozie
07-27-2016
09:17 AM
Thank you @Michael M and @Alexander Bij for your valuable help.
... View more
07-26-2016
10:52 AM
Problem solved, I changed the channel type from file to memory agent1.channels.channel2.type = memory Answers about how to make it work with a channel type file are welcome.
... View more
07-26-2016
09:28 AM
Hi, I'm using Flume to collect data from a Spool Directory. My configuration is as follows: agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel2
agent1.sources.source1.channels = channel2
agent1.sinks.sink1.channel = channel2
agent1.sources.source1.type = spooldir
agent1.sources.source1.basenameHeader = true
agent1.sources.source1.spoolDir = /root/flume_example/spooldir
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/root/flume
agent1.sinks.sink1.hdfs.filePrefix = %{basename}
agent1.sinks.sink1.hdfs.fileSuffix = .csv
agent1.sinks.sink1.hdfs.idleTimeout = 5
agent1.sinks.sink1.hdfs.rollSize = 0
agent1.sinks.sink1.hdfs.rollCount = 100000
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.channels.channel2.type = file When placing 43MB file in spooldir, flume starts writing files into HDFS Directory /user/root/flume: -rw-r--r-- 3 root hdfs 7.9 M 2016-07-26 11:10 /user/root/flume/filename.csv.1469524239209.csv
-rw-r--r-- 3 root hdfs 7.6 M 2016-07-26 11:11 /user/root/flume/filename.csv.1469524239210.csv But a java.lang.OutOfMemoryError: Java heap space error is raised: ERROR channel.ChannelProcessor: Error while writing to required channel: FileChannel channel2 { dataDirs: [/root/.flume/file-channel/data] }
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at org.apache.flume.channel.file.EventQueueBackingStoreFile.put(EventQueueBackingStoreFile.java:338)
at org.apache.flume.channel.file.FlumeEventQueue.set(FlumeEventQueue.java:287)
at org.apache.flume.channel.file.FlumeEventQueue.add(FlumeEventQueue.java:317)
at org.apache.flume.channel.file.FlumeEventQueue.addTail(FlumeEventQueue.java:211)
at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doCommit(FileChannel.java:553)
at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:192)
at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/26 11:10:59 ERROR source.SpoolDirectorySource: FATAL: Spool Directory source source1: { spoolDir: /root/flume_example/spooldir }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at org.apache.flume.channel.file.EventQueueBackingStoreFile.put(EventQueueBackingStoreFile.java:338)
at org.apache.flume.channel.file.FlumeEventQueue.set(FlumeEventQueue.java:287)
at org.apache.flume.channel.file.FlumeEventQueue.add(FlumeEventQueue.java:317)
at org.apache.flume.channel.file.FlumeEventQueue.addTail(FlumeEventQueue.java:211)
at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doCommit(FileChannel.java:553)
at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:192)
at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any idea how can I fix this issue ? Thanks.
... View more
Labels:
- Labels:
-
Apache Flume
07-20-2016
11:06 AM
Okay, I installed the NodeManger on the 3 remaining nodes and I have now all the nodes active.
... View more
07-20-2016
10:41 AM
Hi, I have a cluster with 4 nodes (NameNode: 8Gb RAM, 3 Data nodes with 4GB RAM). In the Ressource Manager UI i'm getting only one Active Node: Is this Normal ? Thanks.
... View more
Labels:
- Labels:
-
Cloudera Manager