Support Questions
Find answers, ask questions, and share your expertise

Hi, I have hdfs directory with files like "filename-yyyy-mm-ddThh:mm:ss". i wand to load these files into my hive table, between any two given dates. Please help.

Hi, I have hdfs directory with files like "filename-yyyy-mm-ddThh:mm:ss". i wand to load these files into my hive table, between any two given dates. Please help.

Explorer
 
2 REPLIES 2

Re: Hi, I have hdfs directory with files like "filename-yyyy-mm-ddThh:mm:ss". i wand to load these files into my hive table, between any two given dates. Please help.

Contributor

Hello Akila Vel,

why no creating an Hive external table on top of it and use Hive virtual column (INPUT__FILE__NAME) to filter the data and load it to another table (perhaps with ORC) ?

Kind regards.

Re: Hi, I have hdfs directory with files like "filename-yyyy-mm-ddThh:mm:ss". i wand to load these files into my hive table, between any two given dates. Please help.

Hi @AKILA VEL. @mlanciaux 's solution is very clever. It's easy to forget about the INPUT__FILE__NAME function.

It can be a little 'goofy' working with input__file__name, and with datetimes and string functions, so here is a rambling code example. Hopefully it's not confusing.

Let's take a single file with 10 records that looks like this:

1|Record 01|100.00 
2|Record 02|200.00
3|Record 03|300.00      
...
10|Record 10|1000.00

To create a sample like you describe, We'll make 6 identical copies of it in HDFS following the naming conventions you mention. We''ll put them in a directory called datedfiles.

hadoop fs -mkdir  /datedfiles/
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_01T00_00_00
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_02T00_00_00  
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_03T02_00_00  
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_04T03_00_00  
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_04T23_59_00  
hadoop fs -put /tmp/poc_data_file.txt  /datedfiles/datedfile_2016_05_05T00_00_00  

(Note that we removed the colons in the timestamp portion - colons in HDFS filenames can cause misery, so if you can avoid them).

Now in Hive you can create a Hive External Table pointed to that directory.

create external table dated_file_data 
   (id int, textval string, amount double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/datedfiles';

Now we can query the data and find the distinct file names (using INPUT__FILE__NAME):

select distinct 
INPUT__FILE__NAME,
FROM dated_file_data ;

input__file__name 
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_03T02_00_00
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_04T23_59_00
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_05T00_00_00
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_02T00_00_00
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_01T00_00_00
hdfs://sandbox.hortonworks.com:8020/datedfiles/datedfile_2016_05_04T03_00_00

You probably don't want to reference the long path returned from INPUT__FILE__NAME, so if you follow the instructions found in this link you can use REGEXP_EXTRACT to extract the name fairly easily.

Rather than having to code the REGEXP_EXTRACT statement every time we write a HiveQL query, we create a simple Hive view on top of the external table to make querying easier and cleaner:

create view vw_dated_file_data
as select 
   REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) AS file_name,
   id, 
   textval, 
   amount
from dated_file_data; 

Now we can query the data and select rows based on the date in the filename.

-- 6 files will be counted, total = 60 
select count(*) 
from vw_dated_file_data; 

_c0
60

This query counts from 5 of the six files - (note the 'greater than' - this will skip the file from May 1st)

select count(*)  as `count_value`
from vw_dated_file_data
where file_name > 'datedfile_2016_05_01'
  and file_name <= 'datedfile_2016_05_05';

count_value
50 

Note that we don't need to use the view - it just makes our query much cleaner. This code is functionally equivalent to the last query:

select count(*)  as `count_value`
from dated_file_data
where REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) > 'datedfile_2016_05_01'
  and REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) <= 'datedfile_2016_05_05';

count_value
50

The next query will return detailed data from 2 of the 6 files (datedfile_2016_05_04T00:00:00 and datedfile_2016_05_04T23:59:00). It will skip the file from 2016_05_05 because of the 'less than'.

select * 
from vw_dated_file_data
where file_name >= 'datedfile_2016_05_04'
  and file_name < 'datedfile_2016_05_05'
  order by file_name, id; 

file_name id textval amount
datedfile_2016_05_04T03_00_00  1  Record 01 100.0
datedfile_2016_05_04T03_00_00  2  Record 02 200.0
datedfile_2016_05_04T03_00_00  3  Record 03 300.0
datedfile_2016_05_04T03_00_00  4  Record 04 400.0
datedfile_2016_05_04T03_00_00  5  Record 05 500.0
datedfile_2016_05_04T03_00_00  6  Record 06 600.0
datedfile_2016_05_04T03_00_00  7  Record 07 700.0
datedfile_2016_05_04T03_00_00  8  Record 08 800.0
datedfile_2016_05_04T03_00_00  9  Record 09 900.0
datedfile_2016_05_04T03_00_00  10 Record 10 1000.0
datedfile_2016_05_04T23_59_00  1  Record 01 100.0
datedfile_2016_05_04T23_59_00  2  Record 02 200.0
datedfile_2016_05_04T23_59_00  3  Record 03 300.0
datedfile_2016_05_04T23_59_00  4  Record 04 400.0
datedfile_2016_05_04T23_59_00  5  Record 05 500.0
datedfile_2016_05_04T23_59_00  6  Record 06 600.0
datedfile_2016_05_04T23_59_00  7  Record 07 700.0
datedfile_2016_05_04T23_59_00  8  Record 08 800.0
datedfile_2016_05_04T23_59_00  9  Record 09 900.0
datedfile_2016_05_04T23_59_00  10 Record 10 1000.0

Now if you wanted to take the output from the last query and load it into an ORC table with SNAPPY compression, you could run this query:

create table dated_file_data_orc
stored as orc 
tblproperties('orc.compress'='SNAPPY')
as 
select * 
from vw_dated_file_data
where file_name >= 'datedfile_2016_05_04'
  and file_name < 'datedfile_2016_05_05'
  order by file_name, id; 

Hope it helps.