Created 10-05-2017 04:37 PM
I am running a Sql query in Spark:
spark.sql("select person_key, count(*) as count1 from <table_name> group by person_key order by count1 desc").show()
This throws a warning:
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
But does give correct results. I want to understand what this means. I did not find anything on the net. I want this resolved because though it is giving the correct results, it is taking very long to execute. (The same query on Hive LLAP takes 3 seconds. Spark numbers are usually comparable to Hive LLAP numbers).
I checked person_key does exist in the table (I created it so the table so I know it exists). Not sure why the warning is coming.
Created 10-05-2017 06:29 PM
That is a valid warning. Old Hive ORC writer doesn't save the correct schema into ORC files. They wrote dummy column names like, `_col1`. You are using that old ORC file. If you generate new ORC file with Hive 2, then you can not see that warning.
Created 10-05-2017 07:35 PM
Desc <Table_name>;
you'll come to know your table would have columns like _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9..
What you need to do is Add the alias to your existing table so that your column names are the actual names and not default _col1, etc.