<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Put data from Parquet files into DynamoDB with NiFi in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379313#M243831</link>
    <description>&lt;DIV class="votecell post-layout--left"&gt;&lt;DIV class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-300"&gt;Hello,&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class="postcell post-layout--right"&gt;&lt;DIV class="s-prose js-post-body"&gt;&lt;P&gt;I want to integrate data into DynamoDB from Parquet files using NiFi (which I run in a Docker container). I fetch my files from AWS S3 using the ListS3 and FetchS3Object processors and then, as I understand it, convert the files to JSON using ConvertRecord and send the data using PutDynamoDB.&lt;/P&gt;&lt;P&gt;I've tried configuring the AvroSchemaRegistry, ParquetReader and JsonRecordSetWriter controllers, but I'm obviously doing it wrong... I've tried using an UpdateAttribute processor too but nothing works. I don't really understand if I have to add the schema and where to add it. Thanks to anyone who can help me!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 21 Nov 2023 15:10:51 GMT</pubDate>
    <dc:creator>yan439</dc:creator>
    <dc:date>2023-11-21T15:10:51Z</dc:date>
    <item>
      <title>Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379313#M243831</link>
      <description>&lt;DIV class="votecell post-layout--left"&gt;&lt;DIV class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-300"&gt;Hello,&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class="postcell post-layout--right"&gt;&lt;DIV class="s-prose js-post-body"&gt;&lt;P&gt;I want to integrate data into DynamoDB from Parquet files using NiFi (which I run in a Docker container). I fetch my files from AWS S3 using the ListS3 and FetchS3Object processors and then, as I understand it, convert the files to JSON using ConvertRecord and send the data using PutDynamoDB.&lt;/P&gt;&lt;P&gt;I've tried configuring the AvroSchemaRegistry, ParquetReader and JsonRecordSetWriter controllers, but I'm obviously doing it wrong... I've tried using an UpdateAttribute processor too but nothing works. I don't really understand if I have to add the schema and where to add it. Thanks to anyone who can help me!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 21 Nov 2023 15:10:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379313#M243831</guid>
      <dc:creator>yan439</dc:creator>
      <dc:date>2023-11-21T15:10:51Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379320#M243832</link>
      <description>&lt;P&gt;Can you provide more details what the issue is&amp;nbsp; , what the error message you are getting if any and where and in which processor is causing the problem based on the provided input and expected output.&lt;/P&gt;&lt;P&gt;I have never used PutDynamoDB but here are some links that can help:&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.youtube.com/watch?si=ctBH-f-JOzAPgKAJ&amp;amp;embeds_referring_euri=https%3A%2F%2Fwww.google.com%2F&amp;amp;source_ve_path=MzY4NDIsMTM5MTE3LDEzOTExNywyODY2NCwxNjQ1MDY&amp;amp;feature=emb_share&amp;amp;v=Aw6PCz8gbmA" target="_blank"&gt;https://www.youtube.com/watch?si=ctBH-f-JOzAPgKAJ&amp;amp;embeds_referring_euri=https%3A%2F%2Fwww.google.com%2F&amp;amp;source_ve_path=MzY4NDIsMTM5MTE3LDEzOTExNywyODY2NCwxNjQ1MDY&amp;amp;feature=emb_share&amp;amp;v=Aw6PCz8gbmA&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/45840156/how-putdynamodb-works-in-nifi" target="_blank"&gt;https://stackoverflow.com/questions/45840156/how-putdynamodb-works-in-nifi&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Nov 2023 16:09:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379320#M243832</guid>
      <dc:creator>SAMSAL</dc:creator>
      <dc:date>2023-11-21T16:09:45Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379585#M243887</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/107992"&gt;@yan439&lt;/a&gt;&amp;nbsp;Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. If you are still experiencing the issue, can you provide the information @samsal has requested? Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2023 17:29:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379585#M243887</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-11-24T17:29:50Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379622#M243894</link>
      <description>&lt;P&gt;Thanks for the video! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; It solved one of my problems: in fact, as I have a list of items to insert, I need to use PutDynamoDBRecord and not PutDynamoDB. So I can insert data after converting one of my Parquet files.&lt;BR /&gt;But I still have a problem with another file. Here's the error:&lt;/P&gt;&lt;P&gt;UTC ERROR&lt;BR /&gt;ConvertRecord[id=92018f18-018b-1000-fd6f-0a3466abe069] Failed to process&lt;BR /&gt;FlowFile[filename=mini_de_train.parquet]; will route to failure:&lt;BR /&gt;org.apache.avro.SchemaParseException: Illegal character in: A1BG-AS1&lt;/P&gt;&lt;P&gt;There are some characters that are not accepted (like "-" in "A1BG-AS1") so I've changed them all in the schema, the beginning of which is shown below (there are more than 18,000 columns):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="AvroSchemaRegistry.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/39005i45CDAE5528121F7E/image-size/large?v=v2&amp;amp;px=999" role="button" title="AvroSchemaRegistry.png" alt="AvroSchemaRegistry.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;So I tried to add it via an UpdateAttribute processor before the ConvertRecord where I put the name of the schema (de_train), and an AvroSchemaRegistry used by my JsonRecordSetWriter which calls this schema :&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="UpdateAttribute.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/39006iA2EFD5B5402DC8BB/image-size/large?v=v2&amp;amp;px=999" role="button" title="UpdateAttribute.png" alt="UpdateAttribute.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ConvertRecord.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/39007i026B03A2F083D520/image-size/large?v=v2&amp;amp;px=999" role="button" title="ConvertRecord.png" alt="ConvertRecord.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="JsonRecordSetWriter.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/39008i2BB21B965E01C29A/image-size/large?v=v2&amp;amp;px=999" role="button" title="JsonRecordSetWriter.png" alt="JsonRecordSetWriter.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;But after these modifications I still get the same error:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Erreur.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/39009i2550927E3D25C611/image-size/large?v=v2&amp;amp;px=999" role="button" title="Erreur.png" alt="Erreur.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What am I doing wrong?&lt;/P&gt;</description>
      <pubDate>Sat, 25 Nov 2023 17:02:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379622#M243894</guid>
      <dc:creator>yan439</dc:creator>
      <dc:date>2023-11-25T17:02:01Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379623#M243895</link>
      <description>&lt;P&gt;Sorry I haven't had much time to visit the site in the last few days. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Nov 2023 17:05:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379623#M243895</guid>
      <dc:creator>yan439</dc:creator>
      <dc:date>2023-11-25T17:05:27Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379728#M243906</link>
      <description>&lt;P&gt;The issue you are having is when you try to read the parquet file using the ParquetReader where its failing on the invalid column names containing the illegal character "-" . I dont know of a way you can address this in Nifi. You probably have to fix this before you consume through Nifi. You can use pandas dataframe in python to help you remove any illegal characters from column name as an example :&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;import pandas as pd

df =  pd.read_parquet('source.parquet', engine='fastparquet')
# replace hyphen with underscore in column names 
df.columns = df.columns.str.replace("-","_")
df.to_parquet("target.parquet",engine='fastparquet')&lt;/LI-CODE&gt;&lt;P&gt;Its possible to do this through Nifi as well using ExecuteStreamCommand :&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/t5/Support-Questions/Can-anyone-provide-an-example-of-a-python-script-executed/td-p/192487" target="_blank"&gt;https://community.cloudera.com/t5/Support-Questions/Can-anyone-provide-an-example-of-a-python-script-executed/td-p/192487&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The steps will be like this:&lt;/P&gt;&lt;P&gt;1- Fetch Parquet from S3&lt;/P&gt;&lt;P&gt;2- Save to Staging area with certain filename using PutFile&lt;/P&gt;&lt;P&gt;3- Run ExecuteStreamCommand and pass filename and path to the py . The py script will rename columns as shown above and save final copy to target folder&lt;/P&gt;&lt;P&gt;4- Use FetchFile to get the final parquet file from target folder using the same filename&lt;/P&gt;&lt;P&gt;5- Convert Record&lt;/P&gt;&lt;P&gt;....&lt;/P&gt;&lt;P&gt;If that helps please &lt;STRONG&gt;accept&lt;/STRONG&gt; solution.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Nov 2023 23:08:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379728#M243906</guid>
      <dc:creator>SAMSAL</dc:creator>
      <dc:date>2023-11-27T23:08:39Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379964#M243971</link>
      <description>&lt;P&gt;Thanks for your answer SAMSAL.&lt;/P&gt;&lt;P&gt;I was hoping to be able to use a processor directly to add my schema but if that's not possible, I'll use a script.&lt;BR /&gt;As well as changing the names of several columns, I also need to change the type of some of them, as some are of type "large_string" and one is of type "bool". I had this error for example when I tried to add the schema (retrieved with Python code from my Parquet file) to the ConvertRecord processor:&lt;/P&gt;&lt;P&gt;'schema-text' validated against '{&lt;BR /&gt;"type": "record",&lt;BR /&gt;"name": "de_train",&lt;BR /&gt;"fields": [&lt;BR /&gt;{&lt;BR /&gt;"name": "cell_type",&lt;BR /&gt;"type": "string"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "sm_name",&lt;BR /&gt;"type": "string"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "sm_lincs_id",&lt;BR /&gt;"type": "string"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "SMILES",&lt;BR /&gt;"type": "string"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "control",&lt;BR /&gt;"type": "bool"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "A1BG",&lt;BR /&gt;"type": "double"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "A1BG_AS1",&lt;BR /&gt;"type": "double"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "A2M",&lt;BR /&gt;"type": "double"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "A2M_AS1",&lt;BR /&gt;"type": "double"&lt;BR /&gt;},&lt;BR /&gt;{&lt;BR /&gt;"name": "A2MP1",&lt;BR /&gt;"type": "double"&lt;BR /&gt;}&lt;BR /&gt;]&lt;BR /&gt;}' is invalid because Not a valid Avro Schema: "bool" is not a defined name. The type of the "control" field must be a defined name or a {"type": ...} expression.&lt;/P&gt;&lt;P&gt;I had to change "large_string" to "string" and "bool" to "boolean" to get no more errors in the AvroSchemaRegistry.&lt;BR /&gt;So how do I change the types in a Parquet file? Is it possible to do this from the dataframe as well as for names?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 18:29:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379964#M243971</guid>
      <dc:creator>yan439</dc:creator>
      <dc:date>2023-11-30T18:29:18Z</dc:date>
    </item>
    <item>
      <title>Re: Put data from Parquet files into DynamoDB with NiFi</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379975#M243973</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/107992"&gt;@yan439&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;Im not sure I understand. I thought you have the schema already defined in the registry with the correct column names and data types. Can you elaborate more on how the avro schema came about and if its the same thing you are using the in the registry?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 23:21:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Put-data-from-Parquet-files-into-DynamoDB-with-NiFi/m-p/379975#M243973</guid>
      <dc:creator>SAMSAL</dc:creator>
      <dc:date>2023-11-30T23:21:17Z</dc:date>
    </item>
  </channel>
</rss>

