Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Query a Hive external table Failed with exception java.io.IOException:java.io.FileNotFoundException: File does not exist

avatar
Contributor

Dear Hortonworks Community,

I create an external table to process data coming from Twitter Steaming API using Flume with the following script:

ADD JAR /usr/hdp/2.3.2.0-2950/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;

USE dbtwitter;

DROP TABLE IF EXISTS dbtwitter.tweets_raw;

CREATE EXTERNAL TABLE IF EXISTS dbtwitter.tweets_raw (

contributors string, coordinates string,

...

... )

ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'

LOCATION '/user/flume/twitter/landing/bod';

Every thing was OK when few tweets were streaming. Once I have more than (aprox) 40.000 (even fewer) tweets the Query's and every DDL commnad begin to fail with java exceptions.

For example when I ran: SELECT count(*) FROM tweets_raw I got the desired results

but if I ran SELECT * FROM tweets_raw begin listing but after showing a lot of line Failed with exception similar to these:

java.io.IOException:java.io.FileNotFoundException: File does not exist: /user/flume/twitter/landing/bod/FlumeData.1459892472851.tmp

It seens that hive it is not able to handle the files that are landing in the LOCATION while I'm executing some querys or DDL commands.

Please, any help will be apreciated.

Best regards,

JOSE GUILLEN

1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @JOSE GUILLEN, this happens because Flume keeps on writing in a .tmp temporary file, and when it's full the file is renamed so that .tmp suffix is dropped. So, when you start your Hive script it may be there, but when the time comes for Hive to read it, the file may have already been renamed. There is a Flume JIRA FLUME-2458 created to separate tmp files to another directory but it's not resolved yet. In the meanwhile you can try to use a workaround described here by setting hdfs.filePrefix and hdfs.inUsePrefix in your Flume conf file, for example

hdfs.path=/user/flume/twitter/landing
hdfs.filePrefix=bod/
hdfs.inUsePrefix=tmp/

and pre-create /user/flume/twitter/landing/tmp/bod in HDFS where .tmp files will be stored (please test this since I don't have a Flume setup handy to try).

Edit: It might be enough just to set "hdfs.inUsePrefix=.", in this way tmp files will be named like .file101.tmp and will be hidden from Hive. So please try this first, and if you still have issues then try the workaround above.

View solution in original post

2 REPLIES 2

avatar
Master Guru

Hi @JOSE GUILLEN, this happens because Flume keeps on writing in a .tmp temporary file, and when it's full the file is renamed so that .tmp suffix is dropped. So, when you start your Hive script it may be there, but when the time comes for Hive to read it, the file may have already been renamed. There is a Flume JIRA FLUME-2458 created to separate tmp files to another directory but it's not resolved yet. In the meanwhile you can try to use a workaround described here by setting hdfs.filePrefix and hdfs.inUsePrefix in your Flume conf file, for example

hdfs.path=/user/flume/twitter/landing
hdfs.filePrefix=bod/
hdfs.inUsePrefix=tmp/

and pre-create /user/flume/twitter/landing/tmp/bod in HDFS where .tmp files will be stored (please test this since I don't have a Flume setup handy to try).

Edit: It might be enough just to set "hdfs.inUsePrefix=.", in this way tmp files will be named like .file101.tmp and will be hidden from Hive. So please try this first, and if you still have issues then try the workaround above.

avatar
Contributor

Hi, @Predrag Minovic,

It's an elegant solution to hide the temporary files from Hive using the dot value (.) for the attribute hdfs.inUsePrefix. This also solves some problems running a DDL to refresh a view or recreate tables ending in an exception file not found. Thank you very much!.

Best regards,

JOSE GUILLEN