Support Questions

Find answers, ask questions, and share your expertise

An impala table on top of SequenceFiles - Key Column missing

avatar
Expert Contributor

I created SequenceFiles using PySpark code below.

path='/data/seq_test2'
rdd = sc.parallelize([(1, "a1"), (2, "a2"), (3, "a3")])
rdd.saveAsSequenceFile(path)

Then I created an impala table.

CREATE EXTERNAL TABLE seq_test2
(key_column STRING,
value_column STRING )
STORED AS SEQUENCEFILE
LOCATION '/data/seq_test2'

Then the query "select * from seq_test2" shows a1, a2, a3 in key_column and null in value_column. But I expect to see 1,2,3 in key column and a1, a2, a3 in value_column.

How do I fix it?

Thank you.

 

2 REPLIES 2

avatar
Expert Contributor

Hello Seaport,

This sounds like the Sequence data file created by pyspark is not being processed by the impala table correctly.

  • the table from pyspark should be values (1, “a1”), (2, “a2"), (3, “a3”).
  • But selecting table in impala shows (“a1", “null”), (“a2", “null”), (“a3", “null”).

 

Can you test a few things to see where the discrepancy is coming from:

  • Can you create impala table in different directory and then do load data in path? https://impala.apache.org/docs/build/html/topics/impala_load_data.html. This is recommended rather than creating a new table on the same file path which has files created outside of impala. This will take data files from your original path and load them into the new Impala table.
  • If the behavior is the same, can you manually insert into your impala table like: insert into table seq_test2 values (1, "a1"), (2, "a2"), (3, "a3"); Then compare the file created by this insert command to your file created by pyspark? Is there any noticeable difference?
  • if it makes a difference, you could also try inserting the data in pyspark using string instead of int values like: rdd = sc.parallelize([("1", "a1"), ("2", "a2"), ("3", "a3")])
  • Also, run same steps from hive too see if its the same behavior?

 

avatar
Community Manager

@Seaport Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. If you are still experiencing the issue, can you provide the information @ezerihun has requested? Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: