Support Questions
Find answers, ask questions, and share your expertise

Insert a new column with value based on file title - HDFS

Solved Go to solution

Insert a new column with value based on file title - HDFS

Explorer

Hi, I've multiple files (in hDFS) with the same schema and I will aggregate all of them into Hive at only one table. Each files represents a date but I only have this info on file title. Which is the best way to insert the file title (the date) as a new column on this files. Java? NiFi? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Insert a new column with value based on file title - HDFS

Get date from Filename

There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there:

1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable

2) Write your own recordreader

3) Pig seems to provide some value called tagsource that can do the same

http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-...

4) Hive has a hidden column for the filename so you could use that to compute a date column

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

View solution in original post

1 REPLY 1

Re: Insert a new column with value based on file title - HDFS

Get date from Filename

There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there:

1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable

2) Write your own recordreader

3) Pig seems to provide some value called tagsource that can do the same

http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-...

4) Hive has a hidden column for the filename so you could use that to compute a date column

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

View solution in original post