<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: add columns to hive/parquet table in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/49292#M24853</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/11928"&gt;@DWinters&lt;/a&gt;&lt;/P&gt;&lt;P&gt;can you elaborate the way you have solved it, cause i just ran in to this issue.&lt;BR /&gt;I have 11 columns in data frame(added a timestamp column to 10 columned RDD).&lt;BR /&gt;And i have a hive table with complete 11 columns one is partitioned by timestamp.&lt;/P&gt;</description>
    <pubDate>Wed, 11 Jan 2017 10:59:07 GMT</pubDate>
    <dc:creator>jack0188</dc:creator>
    <dc:date>2017-01-11T10:59:07Z</dc:date>
    <item>
      <title>add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/39561#M24849</link>
      <description>&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;I am trying to add columns to table that I created with the “saveAsTable” api.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;I update the columns using sqlContext.sql(‘alter table myTable add columns (mycol string)’).&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;The next time I create a df and save it in the same table, with the new columns I get a :&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;“ParquetRelation&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&amp;nbsp;requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.”&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Also thise two commands don t return the same columns :&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;1. sqlContext.table(‘myTable’).schema.fields &amp;nbsp; &amp;nbsp;&amp;lt;— wrong result&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;2. sqlContext.sql(’show columns in mytable’) &amp;nbsp;&amp;lt;—— good results&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;It seems to be a known bug :&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/SPARK-9764" target="_blank"&gt;&lt;SPAN class="s2"&gt;https://issues.apache.org/jira/browse/SPARK-9764&lt;/SPAN&gt;&lt;/A&gt;&amp;nbsp;(see related bugs)&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;But I am wondering, how else can I update the columns or make sure that spark take the new columns?&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;I already tried to refreshTable and to restart spark.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;thanks&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:13:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/39561#M24849</guid>
      <dc:creator>maurin</dc:creator>
      <dc:date>2022-09-16T10:13:04Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43348#M24850</link>
      <description>&lt;P&gt;Hi Maurin,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Did you ever get a resolution to this issue? &amp;nbsp;I am experiencing the exact same issue. &amp;nbsp;I am attempting to programmatically add columns to an existing table when our streaming data contains new attributes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;David&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 05:52:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43348#M24850</guid>
      <dc:creator>DWinters</dc:creator>
      <dc:date>2016-07-29T05:52:37Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43418#M24851</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I was able to do it with some code. Probably not the best solution but here it is in python :&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;    def _merge_schema(self, df, db, table):
        df_schema_dict = df.schema.jsonValue()['fields']

        table_columns = self.sqlContext.table(table).schema.fields

        df_schema_list = sorted([x['name'].lower() for x in df_schema_dict])
        table_schema_list = sorted([x.name.lower() for x in table_columns])

        missing_in_table = set(df_schema_list) - set(table_schema_list)
        missing_in_df = set(table_schema_list) - set(df_schema_list)

        df_schema_dict_by_name = dict((d['name'], dict(d, index=index)) for (index, d) in enumerate(df_schema_dict))
        missing_col_array = []

        for missing_col in missing_in_table:
            my_type = df_schema_dict_by_name[missing_col]['type']
            missing_col_array.append("`" + missing_col + "` " + my_type)
        if len(missing_col_array) != 0:
            self.sqlContext.sql(
                "ALTER TABLE " + table + " ADD COLUMNS ( " + ' , '.join(missing_col_array) + ')')

        table_schema_dict_by_name = dict((d.name, d) for d in table_columns)
        for missing_col in missing_in_df:
            df = df.withColumn(missing_col, lit(None).cast(table_schema_dict_by_name[missing_col].dataType.typeName()))
        # re-order all the columns otherwise the insert bugs
        all_columns = self.sqlContext.table(table).schema.fields
        columns_names = [df[x.name.lower()] for x in all_columns]
        df = df.select(*columns_names)
        return df&lt;/PRE&gt;</description>
      <pubDate>Mon, 01 Aug 2016 18:30:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43418#M24851</guid>
      <dc:creator>maurin</dc:creator>
      <dc:date>2016-08-01T18:30:36Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43419#M24852</link>
      <description>Thanks for the reply! I was able to get it working on Friday by doing the same thing as you. I manually generate the ALTER TABLE statement by sorting the fields and determining which fields are missing in the table and the incoming data frame. Thanks again for confirming that you had to do the same thing!</description>
      <pubDate>Mon, 01 Aug 2016 19:13:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/43419#M24852</guid>
      <dc:creator>DWinters</dc:creator>
      <dc:date>2016-08-01T19:13:28Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/49292#M24853</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/11928"&gt;@DWinters&lt;/a&gt;&lt;/P&gt;&lt;P&gt;can you elaborate the way you have solved it, cause i just ran in to this issue.&lt;BR /&gt;I have 11 columns in data frame(added a timestamp column to 10 columned RDD).&lt;BR /&gt;And i have a hive table with complete 11 columns one is partitioned by timestamp.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2017 10:59:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/49292#M24853</guid>
      <dc:creator>jack0188</dc:creator>
      <dc:date>2017-01-11T10:59:07Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/52522#M24854</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/15823"&gt;@maurin&lt;/a&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am getting the below error&lt;/P&gt;&lt;P&gt;TypeError: 'StructField' object has no attribute '__getitem__'&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;at&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;df_schema_dict_by_name = dict((d['name'], dict(d, index=index)) for (index, d) in enumerate(df_schema_dict))&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;How do I rectify this&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2017 18:00:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/52522#M24854</guid>
      <dc:creator>sanjeev20</dc:creator>
      <dc:date>2017-03-22T18:00:01Z</dc:date>
    </item>
    <item>
      <title>Re: add columns to hive/parquet table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/52534#M24855</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/11928"&gt;@DWinters&lt;/a&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How were you able to overcome this issue? Could you please post a sample solution&lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2017 21:44:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/add-columns-to-hive-parquet-table/m-p/52534#M24855</guid>
      <dc:creator>sanjeev20</dc:creator>
      <dc:date>2017-03-22T21:44:14Z</dc:date>
    </item>
  </channel>
</rss>

