<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best approach to ingest CSV with changing schema? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166993#M57562</link>
    <description>&lt;P&gt;If your schema is changing from hour to hour, then maybe try this:&lt;/P&gt;&lt;P&gt;1. Use Spark with the CSV reader from DataBricks to process the data. The CSV reader can automatically infer the schema. &lt;/P&gt;&lt;P&gt;2. Write the DataFrame to HBase. With HBase you don't need a schema defined and each row can have varying number of columns. When you are ready to analyze the data in HBase you can use Apache Phoenix to create a schema atop the HBase table.&lt;/P&gt;&lt;P&gt;3. You could even check the number of columns in the DataFrame and then route to a Hive table based on the number of columns. For example, if I count 7 fields, then route to table A, if 10 fields, then table B. Hive has a fixed number of columns, whereas HBase does not.&lt;/P&gt;</description>
    <pubDate>Wed, 22 Mar 2017 01:18:29 GMT</pubDate>
    <dc:creator>bmathew</dc:creator>
    <dc:date>2017-03-22T01:18:29Z</dc:date>
    <item>
      <title>Best approach to ingest CSV with changing schema?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166992#M57561</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I'm still new to Hadoop technology and I'm struggling defining the best approach for the following 2 similar challenges:&lt;/P&gt;&lt;P&gt;In first instance, trying to ingest the files -&amp;gt; Hive. InferAVROSchema in Nifi is limited as it does not always recognize the right data type, generating a fair amount of error when the files are ingested.&lt;/P&gt;&lt;P&gt;Switching to specifiy the schema manually bring the following problems:&lt;/P&gt;&lt;P&gt;- Ingesting CSV files that have schema updates over the year, I have a versioning documentation giving me the schema changes, however the date in the versioning document do not match the date of effective change. &lt;/P&gt;&lt;P&gt;- Ingesting hourly CSV files with a schema depending of the business activity (a set of columns is mandatory, a large set is optionnal and will only be seen when the underling options have been used) . The schema of the files is different from hours to hours, and I can't predict which one is to expect.&lt;/P&gt;&lt;P&gt;My feelings are that I have to move to NoSql type of DB / storage, but I'm not exactly sure how to tackle this in the best way.&lt;/P&gt;&lt;P&gt;Has anyone faced similar problematic?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Christophe&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 11:17:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166992#M57561</guid>
      <dc:creator>ChrisV</dc:creator>
      <dc:date>2022-09-16T11:17:56Z</dc:date>
    </item>
    <item>
      <title>Re: Best approach to ingest CSV with changing schema?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166993#M57562</link>
      <description>&lt;P&gt;If your schema is changing from hour to hour, then maybe try this:&lt;/P&gt;&lt;P&gt;1. Use Spark with the CSV reader from DataBricks to process the data. The CSV reader can automatically infer the schema. &lt;/P&gt;&lt;P&gt;2. Write the DataFrame to HBase. With HBase you don't need a schema defined and each row can have varying number of columns. When you are ready to analyze the data in HBase you can use Apache Phoenix to create a schema atop the HBase table.&lt;/P&gt;&lt;P&gt;3. You could even check the number of columns in the DataFrame and then route to a Hive table based on the number of columns. For example, if I count 7 fields, then route to table A, if 10 fields, then table B. Hive has a fixed number of columns, whereas HBase does not.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2017 01:18:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166993#M57562</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2017-03-22T01:18:29Z</dc:date>
    </item>
    <item>
      <title>Re: Best approach to ingest CSV with changing schema?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166994#M57563</link>
      <description>&lt;P style="margin-left: 20px;"&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/3076/bmathew.html" nodeid="3076"&gt;@Binu Mathew&lt;/A&gt;&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;Thanks for your answer. I'll dive into this approach &amp;amp; post further if/when required.&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;Thanks!&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;Christoohe&lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2017 17:27:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-approach-to-ingest-CSV-with-changing-schema/m-p/166994#M57563</guid>
      <dc:creator>ChrisV</dc:creator>
      <dc:date>2017-03-23T17:27:53Z</dc:date>
    </item>
  </channel>
</rss>

