About bleonhardi

bleonhardi · ‎01-28-2016

Regarding combining. You would have to extend the existing class, modify the linerecordreader etc. This would only work for text files (not sequence or ORC files ) because they need their own InputFormats. But you could have any kind of content in the text file since the deserialization is handled by the Serde ( which we would not touch ) . Doing the same for sequence files should be possible, doing the same for ORC files is most likely very difficult. In that case InputFormat und Serde are much more interconnected.

bleonhardi · ‎01-28-2016

In my opinion you would need to extend the TextInputformat. Then change the linerecordreader class to read the xattributes when reading the filesplit and prefixing each line with the attributes you want before it is been given to the Serde ( you don't want to change that if you can avoid it ) You could also filter out files completely here. You can then do more advanced filters in SQL. http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/LineRecordReader.java#LineRecordReader

bleonhardi · ‎01-28-2016

I think your only choice would be to build a custom InputFormat that reads these properties and then adds them as columns to the data set. I don't know of any other ways to propagate metadata with the data set.

bleonhardi · ‎01-28-2016

I suppose you mean HDFS blocks. And the answer is it depends: Normally MapReduce will start one Map task for each block. However that is not always the case. MapReduce provides CombinedFileInputFormat that merge small files into single tasks. In pig you can for example enable/disable this with pig.splitCombination. Tez which is an alternative framework to MapReduce and which is used under Hive and optionally pig also merges small files into map tasks, this is controlled by tez.grouping.min-size and tez.grouping.max-size. For Reducers it is more complicated since they work on map output and not on files. In MapReduce the number of reducers needs to be specified by the user. -Dmapred.reduce.tasks=x However frameworks like hive predict the number of reducers that is optimal and set it depending on data sizes. For example hive ( on tez ) uses hive.exec.reducers.bytes.per.reducer to predict an optimal number of reducers. Pig does something similar. Here the parameter is pig.exec.reducers.bytes.per.reducer. Hope that helps.

bleonhardi · ‎01-27-2016

The KeyedMessage by the way can be confusing if called KeyedMessage("topic", "value") your kafka topic does not have a key. The first string is the topic the second the value of the message. This is good for most situations. IF you need a message key ( db primary key for example ) to do log compaction etc. you can do KeyedMessage("topic","key","message") But for you I would assume that you simply serialize the SOAP output into a String and push it into Kafka, then you need a consumer on the other side that can parse it again. ( for example in spark or storm )

bleonhardi · ‎01-27-2016

So you need to pull your data from the soap service. Like you normally would. Which will give you some data. Lets assume a JSON String? There are different options to push it into Kafka, the Example I used directly takes a String, ( there is one taking a byte array as well ) After that you can simple connect a consumer to kafka and you get the message out as you put them in, your consumer then needs to parse the string/byte array to whatever format you want.

bleonhardi · ‎01-27-2016

OK the exec tag executes a shell script in the local working directory of oozie. For example /hadoop/yarn/.../oozietmp/myscript.sh You have no idea before which directory this is or on which server it is located. It is in some yarn tmp dir. The file tag is there to put something into this temp dir. And you can rename the file as well using the # syntax. So if your shell script is in HDFS in hdfs://tmp/myfolder/myNewScript.sh But you do not want to change the exec tag for some reason. You can do <file>/tmp/myfolder/myNewScript.sh#myscript.sh</file> And oozie will take the file from HDFS put it into the tmp folder before execution and rename it. You can use the file tag to upload any kind of files ( like jars or other dependencies ) As far as I can see the ${EXEC} is just a variable they set somewhere with no specific meaning. Oh last but not least, if you want to avoid the file tag you can also simply put these files into a lib folder in the workflow folder. Oozie will upload all of these files per default.

bleonhardi · ‎01-27-2016

Kafka producers are Java programs. Here is a very simple example of a KafkaProducer that uses Syslog4J I did a while back for example. In your case you need to have a program that can pull the data from the webservice and then push it into the producer. You need to serialize your package into a byte array for kafka but apart from that the main work will be connecting to WSDL. https://github.com/benleon/SysLogProducer

bleonhardi · ‎01-26-2016

Just be wary of potential load issues. We reached the connection limits of our consolidated postgresql database because all services were pointing to the same db, This essentially stopped oozie and hive randomly. The biggest culprit seems to have been ranger. If auditing to db is switched on it puts quite a load on the database.

bleonhardi · ‎01-26-2016

I tried to set falcon retention period on a feed. Expecting it to delete old folders after the specified time period. ( In this case 7 days ). However this does not happen. Anybody knows how to debug that? Where would he write any log information to this? In the oozie falcon action surrounding the workflow or in the falcon server logs? Or is there anything else I need to do? The process is happening every 15 minutes so he shouldn't have to schedule cleanup tasks. <clusters> <cluster name="xxx" type="source"> <validity start="2016-01-14T12:45Z" end="2033-01-13T20:00Z"/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/xxx/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations>

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Access HDFS extended attributes (xattrs) in Hi...

Re: Access HDFS extended attributes (xattrs) in Hi...

Re: Access HDFS extended attributes (xattrs) in Hi...

Re: While running an Mapreduce job in YARN, does e...

Re: Hi , I need to integrate data coming from a WS...

Re: Hi , I need to integrate data coming from a WS...

Re: Oozie shell action: exec and file tags

Re: Hi , I need to integrate data coming from a WS...

Re: What are Best Practices for placing Databases ...

Falcon retention period