<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Python script to process files on HDFS in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269063#M206604</link>
    <description>&lt;P&gt;I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.&lt;/P&gt;&lt;P&gt;What should be the flow I need to use to get the above working. I tried using ListHDFS -&amp;gt; ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.&lt;/P&gt;</description>
    <pubDate>Tue, 27 Aug 2019 13:59:30 GMT</pubDate>
    <dc:creator>Teej</dc:creator>
    <dc:date>2019-08-27T13:59:30Z</dc:date>
    <item>
      <title>Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269063#M206604</link>
      <description>&lt;P&gt;I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.&lt;/P&gt;&lt;P&gt;What should be the flow I need to use to get the above working. I tried using ListHDFS -&amp;gt; ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2019 13:59:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269063#M206604</guid>
      <dc:creator>Teej</dc:creator>
      <dc:date>2019-08-27T13:59:30Z</dc:date>
    </item>
    <item>
      <title>Re: Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269202#M206702</link>
      <description>&lt;P&gt;The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ListHDFS&amp;gt;FetchHDFS&amp;gt;ConvertExcelToCSV&amp;gt;PutHDFS&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If stopping the queue is not possible, you could also investigate via the provenance view.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2019 17:09:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269202#M206702</guid>
      <dc:creator>DennisJaheruddi</dc:creator>
      <dc:date>2019-08-28T17:09:09Z</dc:date>
    </item>
    <item>
      <title>Re: Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269268#M206745</link>
      <description>&lt;P&gt;Thanks, I will explore on the XLStoCSV processor. Once converted to CSV, I have to do couple of transformations for which I am using Python script. If I place the CSV in HDFS, how do I use Python script to process data from HDFS.Are you suggesting to use ExecuteStream to get the session content and process it or is there a better way to do it.&lt;/P&gt;</description>
      <pubDate>Thu, 29 Aug 2019 07:08:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269268#M206745</guid>
      <dc:creator>Teej</dc:creator>
      <dc:date>2019-08-29T07:08:47Z</dc:date>
    </item>
    <item>
      <title>Re: Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269274#M206748</link>
      <description>If you are putting the data in HDFS first, I assume the following python script is more batch than streaming.&lt;BR /&gt;&lt;BR /&gt;In that case, consider running it via a scheduler like Oozie.&lt;BR /&gt;&lt;BR /&gt;Also, if you run into scalability issues with your script, consider using something like pyspark instead.&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Aug 2019 08:02:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269274#M206748</guid>
      <dc:creator>DennisJaheruddi</dc:creator>
      <dc:date>2019-08-29T08:02:26Z</dc:date>
    </item>
    <item>
      <title>Re: Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269301#M206759</link>
      <description>&lt;P&gt;Alright, got it. Is there a way to access files on HDFS in Python without using pyspark.&lt;/P&gt;</description>
      <pubDate>Thu, 29 Aug 2019 08:56:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269301#M206759</guid>
      <dc:creator>Teej</dc:creator>
      <dc:date>2019-08-29T08:56:54Z</dc:date>
    </item>
    <item>
      <title>Re: Python script to process files on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269322#M206765</link>
      <description>A quick search suggests that libhdfs can do it, but I have not tried it myself.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Aug 2019 10:11:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Python-script-to-process-files-on-HDFS/m-p/269322#M206765</guid>
      <dc:creator>DennisJaheruddi</dc:creator>
      <dc:date>2019-08-29T10:11:26Z</dc:date>
    </item>
  </channel>
</rss>

