Created on 06-03-2019 01:55 PM - edited 09-16-2022 07:25 AM
I have a scenario where i'm trying to create a table which points to an HDFS location which has a directory name starting with an "_" in the HDFS path. Now table creation goes through but If I try to read data out of the table it throws error, below is what i get:
create external table `ingest.workgroup__views2` row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location 'hdfs://nameservice1/user/data/ingest/mdm/workgroup_i/workgroup/_views' tblproperties ('avro.schema.url'='hdfs://nameservice1/user/data/ingest/mdm/workgroup_i/workgroup/_views/_gen/_views.avsc');
No rows affected (0.232 seconds)
0: jdbc:hive2://t-hive.sys.cigna.com:25006/de> select * from ingest.workgroup__views2;
Error: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameservice1/user/data/ingest/mdm/workgroup_i/workgroup/_views (state=,code=0)
0: jdbc:hive2://t-hive.sys.cigna.com:25006/de> drop table ingest.workgroup__views2;
So i escape the special character "_" in location and the table gets created and i' able to run select to see data as below:
create external table `ingest.workgroup__views2` row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location 'hdfs://nameservice1/user/data/ingest/mdm/workgroup_i/workgroup/\_views' tblproperties ('avro.schema.url'='hdfs://nameservice1/user/data/ingest/mdm/workgroup_i/workgroup/_views/_gen/_views.avsc');
No rows affected (0.19 seconds)
0: jdbc:hive2://t-hive.sys.cigna.com:25006/de> select * from ingest.workgroup__views2;
+-----------------------+-------------------------+-----------------------------+-------------------------------+-----------------------------+-------------------------------+--------------------------------+--------------------------+--------------------------+----------------------------+----------------------------+--+
| workgroup__views2.id | workgroup__views2.name | workgroup__views2.view_url | workgroup__views2.created_at | workgroup__views2.owner_id | workgroup__views2.owner_name | workgroup__views2.workbook_id | workgroup__views2.index | workgroup__views2.title | workgroup__views2.caption | workgroup__views2.site_id |
+-----------------------+-------------------------+-----------------------------+-------------------------------+-----------------------------+-------------------------------+--------------------------------+--------------------------+--------------------------+----------------------------+----------------------------+--+
+-----------------------+-------------------------+-----------------------------+-------------------------------+-----------------------------+-------------------------------+--------------------------------+--------------------------+--------------------------+----------------------------+----------------------------+--+
No rows selected (0.139 seconds)
Now the weird part is its only the location part which has this issue, parsing of URI mentioned under tblproperties goes through as you can see above and if I explicitly try to escape "_" in tblproperties it doesn't work.
Any comments or suggestions will be helpful on the above obesrvation.
Regards
Created on 06-06-2019 07:56 AM - edited 06-06-2019 09:39 AM
@Prav ,
This appears to have been listed as a bug (which is actually a longstanding limitation due to the definition of files and directories with _ and . being considered as "hidden" in FileInputFormat in Hadoop) of Hive since the 0.12 version:
https://issues.apache.org/jira/browse/HIVE-6431
https://stackoverflow.com/questions/19830264/which-files-are-ignored-as-input-by-mapper
If these files are needed to be seen, please consider using a pre-process script to rename them after loading.
Thanks,
Robert Justice, Technical Resolution Manager
Created 06-05-2019 07:15 AM
Hey Network,
Anyone had this issue or maybe Cloudera team in this community may share if this a known bug etc?
Regards
Created on 06-06-2019 07:56 AM - edited 06-06-2019 09:39 AM
@Prav ,
This appears to have been listed as a bug (which is actually a longstanding limitation due to the definition of files and directories with _ and . being considered as "hidden" in FileInputFormat in Hadoop) of Hive since the 0.12 version:
https://issues.apache.org/jira/browse/HIVE-6431
https://stackoverflow.com/questions/19830264/which-files-are-ignored-as-input-by-mapper
If these files are needed to be seen, please consider using a pre-process script to rename them after loading.
Thanks,
Robert Justice, Technical Resolution Manager