- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Query a Hive external table Failed with exception java.io.IOException:java.io.FileNotFoundException: File does not exist
- Labels:
-
Apache Flume
-
Apache Hive
Created 04-05-2016 11:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Hortonworks Community,
I create an external table to process data coming from Twitter Steaming API using Flume with the following script:
ADD JAR /usr/hdp/2.3.2.0-2950/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
USE dbtwitter;
DROP TABLE IF EXISTS dbtwitter.tweets_raw;
CREATE EXTERNAL TABLE IF EXISTS dbtwitter.tweets_raw (
contributors string, coordinates string,
...
... )
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/twitter/landing/bod';
Every thing was OK when few tweets were streaming. Once I have more than (aprox) 40.000 (even fewer) tweets the Query's and every DDL commnad begin to fail with java exceptions.
For example when I ran: SELECT count(*) FROM tweets_raw I got the desired results
but if I ran SELECT * FROM tweets_raw begin listing but after showing a lot of line Failed with exception similar to these:
java.io.IOException:java.io.FileNotFoundException: File does not exist: /user/flume/twitter/landing/bod/FlumeData.1459892472851.tmp
It seens that hive it is not able to handle the files that are landing in the LOCATION while I'm executing some querys or DDL commands.
Please, any help will be apreciated.
Best regards,
JOSE GUILLEN
Created 04-06-2016 02:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @JOSE GUILLEN, this happens because Flume keeps on writing in a .tmp temporary file, and when it's full the file is renamed so that .tmp suffix is dropped. So, when you start your Hive script it may be there, but when the time comes for Hive to read it, the file may have already been renamed. There is a Flume JIRA FLUME-2458 created to separate tmp files to another directory but it's not resolved yet. In the meanwhile you can try to use a workaround described here by setting hdfs.filePrefix and hdfs.inUsePrefix in your Flume conf file, for example
hdfs.path=/user/flume/twitter/landing hdfs.filePrefix=bod/ hdfs.inUsePrefix=tmp/
and pre-create /user/flume/twitter/landing/tmp/bod in HDFS where .tmp files will be stored (please test this since I don't have a Flume setup handy to try).
Edit: It might be enough just to set "hdfs.inUsePrefix=.", in this way tmp files will be named like .file101.tmp and will be hidden from Hive. So please try this first, and if you still have issues then try the workaround above.
Created 04-06-2016 02:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @JOSE GUILLEN, this happens because Flume keeps on writing in a .tmp temporary file, and when it's full the file is renamed so that .tmp suffix is dropped. So, when you start your Hive script it may be there, but when the time comes for Hive to read it, the file may have already been renamed. There is a Flume JIRA FLUME-2458 created to separate tmp files to another directory but it's not resolved yet. In the meanwhile you can try to use a workaround described here by setting hdfs.filePrefix and hdfs.inUsePrefix in your Flume conf file, for example
hdfs.path=/user/flume/twitter/landing hdfs.filePrefix=bod/ hdfs.inUsePrefix=tmp/
and pre-create /user/flume/twitter/landing/tmp/bod in HDFS where .tmp files will be stored (please test this since I don't have a Flume setup handy to try).
Edit: It might be enough just to set "hdfs.inUsePrefix=.", in this way tmp files will be named like .file101.tmp and will be hidden from Hive. So please try this first, and if you still have issues then try the workaround above.
Created 04-06-2016 03:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, @Predrag Minovic,
It's an elegant solution to hide the temporary files from Hive using the dot value (.) for the attribute hdfs.inUsePrefix. This also solves some problems running a DDL to refresh a view or recreate tables ending in an exception file not found. Thank you very much!.
Best regards,
JOSE GUILLEN