<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Reading old parquet tables in Hive/Spark returns NULL in some columns in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/83289#M61143</link>
    <description>&lt;P&gt;Actually I got NULL value of columns in created new table which impala could return right results. So if your problem solved may give some suggestion for this. Thanks.&lt;/P&gt;</description>
    <pubDate>Wed, 05 Dec 2018 12:14:54 GMT</pubDate>
    <dc:creator>Striver</dc:creator>
    <dc:date>2018-12-05T12:14:54Z</dc:date>
    <item>
      <title>Reading old parquet tables in Hive/Spark returns NULL in some columns</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/54735#M61141</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;I have an old table where data was created by Impala (2.x). The table is accessible by Impala and the data returned by Impala is valid and correct. However when I try to read the same table (partition) by SparkSQL or Hive, I got in 3 out of 30 columns NULL values. The data are there, the columns are varchars as the others, nullability is the same, and they contain non null values.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I created a new table via CTAS in Impala:&amp;nbsp;&lt;/P&gt;&lt;P&gt;create table tmp.sample stored as parquet as select * from orig_table where partcol = xxx;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then the data are correctly read by Hive and SparkSQL, all 30 columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The old table has these properties:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;| ROW FORMAT SERDE |&lt;BR /&gt;| 'parquet.hive.serde.ParquetHiveSerDe' |&lt;BR /&gt;| STORED AS INPUTFORMAT |&lt;BR /&gt;| 'parquet.hive.DeprecatedParquetInputFormat' |&lt;BR /&gt;| OUTPUTFORMAT |&lt;BR /&gt;| 'parquet.hive.DeprecatedParquetOutputFormat' |&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The new table has different row format serde and input/output formats:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;| ROW FORMAT SERDE |&lt;BR /&gt;| 'parquet.hive.serde.ParquetHiveSerDe' |&lt;BR /&gt;| STORED AS INPUTFORMAT |&lt;BR /&gt;| 'parquet.hive.DeprecatedParquetInputFormat' |&lt;BR /&gt;| OUTPUTFORMAT |&lt;BR /&gt;| 'parquet.hive.DeprecatedParquetOutputFormat' |&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I tweak Hive/Beeline/Spark to read the data correctly from the original table?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PS. Using CDH5.7.1&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 11:36:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/54735#M61141</guid>
      <dc:creator>Tomas79</dc:creator>
      <dc:date>2022-09-16T11:36:31Z</dc:date>
    </item>
    <item>
      <title>Re: Reading old parquet tables in Hive/Spark returns NULL in some columns</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/54769#M61142</link>
      <description>&lt;P&gt;I have figured out the problem and the solution.&lt;/P&gt;&lt;P&gt;The problem si that Hive reads parquet files in partitions by actual schema definition of the table and Impala (I assume) reads by position.&lt;/P&gt;&lt;P&gt;The table had some old partitions created under different schema, the column name were different. But the total number of columns and the position remained the same.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;the old table&amp;nbsp;&lt;/P&gt;&lt;P&gt;CREATE TABLE test ( event varchar(10), event_id int, event_time timestamp );&lt;/P&gt;&lt;P&gt;.. some partitions inserted, then the column renamed to event_name.&lt;/P&gt;&lt;P&gt;the actual table definition:&lt;/P&gt;&lt;P&gt;CREATE TABLE test ( event_name varchar(10), event_id int, event_time timestamp );&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, if I select all partitions in Impala, the query returns all the data correctly. So I assume that Impala ignores the column names in parquet file and tries to access the first column as event_name with type varchar(10).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But SparkSQL and Beeline returns NULL for the partitions created with the old definition.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I downloaded the parquet files and evaluated the schema with parquet-tools and the columns are the old column names.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So to test whether the column was renamed in a table I created a simple script. It is important to run spark with mergeSchema true parameter, to read ALL the schema definitions from the table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;spark-shell --conf spark.sql.parquet.mergeSchema=true

import scala.util.matching.Regex

def test( tbl : String ) = {
  val tb_md = sqlContext.sql("show create table "+tbl).collect()
  val ddl = tb_md.map( x =&amp;gt; x.get(0).toString ).mkString(" ")
  val pattern = new Regex("""LOCATION\s*\'(.+)\'\s+TBLPROPERTIES""")
  var loc:String = ""
  try {
    loc = (pattern findAllIn ddl).matchData.next.group(1)
  } catch {
     case e: Exception =&amp;gt; //error
  }
  var d = sqlContext.read.parquet(loc )
  val columns_parq = d.columns
  println( columns_parq.toList )
  println( "Table " + tbl  + " has " + columns_parq.length + " columns.")
}&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 May 2017 06:53:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/54769#M61142</guid>
      <dc:creator>Tomas79</dc:creator>
      <dc:date>2017-05-16T06:53:36Z</dc:date>
    </item>
    <item>
      <title>Re: Reading old parquet tables in Hive/Spark returns NULL in some columns</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/83289#M61143</link>
      <description>&lt;P&gt;Actually I got NULL value of columns in created new table which impala could return right results. So if your problem solved may give some suggestion for this. Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 05 Dec 2018 12:14:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/83289#M61143</guid>
      <dc:creator>Striver</dc:creator>
      <dc:date>2018-12-05T12:14:54Z</dc:date>
    </item>
  </channel>
</rss>

