Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Converting HTML to XML File with Nifi

Highlighted

Converting HTML to XML File with Nifi

New Contributor

Hello,

I am trying to convert an incoming html file (from invokeHTTP processor) to a xml file with python using the ExecuteScript processor. The content of the flowfile is the html page. I used the following python code:

import sys
import traceback
from java.nio.charset import StandardCharsets
from org.apache.commons.io import IOUtils
from org.apache.nifi.processor.io import StreamCallback
from org.python.core.util import StringUtil
from lxml import html,etree




class TransformCallback(StreamCallback):
    def __init__(self):
        pass


    def process(self, inputStream, outputStream):
        try:
            # Read input FlowFile content
            input_text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)        


            # Write output content
            html_file = html.fromstring(input_text)
            xml_file = etree.tostring(html_file)
            outputStream.write(StringUtil.toBytes(xml_file))
        except:
            traceback.print_exc(file=sys.stdout)
            raise




flowFile = session.get()
if flowFile != None:
    flowFile = session.write(flowFile, TransformCallback())


    # Finish by transferring the FlowFile to an output relationship
    session.transfer(flowFile, REL_SUCCESS)

The processor did not show any incoming file and throws the following exception:

2017-05-15 00:56:12,422 WARN [StandardProcessScheduler Thread-6] o.a.n.controller.StandardProcessorNode Timed out while waiting for OnScheduled of 'ExecuteScript' processor to finish. An attempt is made to cancel the task via Thread.interrupt(). However it does not guarantee that the task will be canceled since the code inside current OnScheduled operation may have been written to ignore interrupts which may result in a runaway thread. This could lead to more issues, eventually requiring NiFi to be restarted. This is usually a bug in the target Processor 'ExecuteScript[id=015c113e-74cc-13ad-e411-f861a69dbeec]' that needs to be documented, reported and eventually fixed.
2017-05-15 00:56:12,422 ERROR [StandardProcessScheduler Thread-6] o.a.nifi.processors.script.ExecuteScript ExecuteScript[id=015c113e-74cc-13ad-e411-f861a69dbeec] ExecuteScript[id=015c113e-74cc-13ad-e411-f861a69dbeec] failed to invoke @OnScheduled method due to java.lang.RuntimeException: Timed out while executing one of processor's OnScheduled task.; processor will not be scheduled to run for 30 seconds: java.lang.RuntimeException: Timed out while executing one of processor's OnScheduled task.
2017-05-15 00:56:12,424 ERROR [StandardProcessScheduler Thread-6] o.a.nifi.processors.script.ExecuteScript 
java.lang.RuntimeException: Timed out while executing one of processor's OnScheduled task.
	at org.apache.nifi.controller.StandardProcessorNode.invokeTaskAsCancelableFuture(StandardProcessorNode.java:1447) ~[na:na]
	at org.apache.nifi.controller.StandardProcessorNode.access$100(StandardProcessorNode.java:100) ~[na:na]
	at org.apache.nifi.controller.StandardProcessorNode$1.run(StandardProcessorNode.java:1275) ~[na:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205) [na:1.8.0_121]
	at org.apache.nifi.controller.StandardProcessorNode.invokeTaskAsCancelableFuture(StandardProcessorNode.java:1432) ~[na:na]
	... 9 common frames omitted
2017-05-15 00:56:12,424 ERROR [StandardProcessScheduler Thread-6] o.a.n.controller.StandardProcessorNode Failed to invoke @OnScheduled method due to java.lang.RuntimeException: Timed out while executing one of processor's OnScheduled task.
java.lang.RuntimeException: Timed out while executing one of processor's OnScheduled task.
	at org.apache.nifi.controller.StandardProcessorNode.invokeTaskAsCancelableFuture(StandardProcessorNode.java:1447) ~[na:na]
	at org.apache.nifi.controller.StandardProcessorNode.access$100(StandardProcessorNode.java:100) ~[na:na]
	at org.apache.nifi.controller.StandardProcessorNode$1.run(StandardProcessorNode.java:1275) ~[na:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205) [na:1.8.0_121]
	at org.apache.nifi.controller.StandardProcessorNode.invokeTaskAsCancelableFuture(StandardProcessorNode.java:1432) ~[na:na]
	... 9 common frames omitted

Do you have an idea?

Thank you in advance!

Kind Regards,

Jan

1 REPLY 1

Re: Converting HTML to XML File with Nifi

I am having trouble importing the "etree" module, I have tried with brew-installed Python 2.7 and Anaconda 2.7 (where I believe the etree submodule is part of "xml" not "lxml"). Do I need any additional configuration?

Looking in the lxml package, I see some native libraries (.so files, e.g.). If lxml is a native library, Jython (the "python" script engine in ExecuteScript) will not be able to load/execute it. All imported modules (and their dependencies) must be pure Python (no native code like CPython for example) for Jython to execute the script successfully. Perhaps there is a different library you can use?

If you don't have a requirement on Jython/Python, consider using Javascript, Groovy, or Clojure instead. Their Module Directory allows you to use third-party Java libraries to accomplish this conversion, such as NekoHTML, JTidy, or JSoup.

Don't have an account?
Coming from Hortonworks? Activate your account here