Created 06-13-2016 01:13 PM
Hi, I've multiple files (in hDFS) with the same schema and I will aggregate all of them into Hive at only one table. Each files represents a date but I only have this info on file title. Which is the best way to insert the file title (the date) as a new column on this files. Java? NiFi? Thanks!
Created 06-13-2016 01:29 PM
Get date from Filename
There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there:
1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable
2) Write your own recordreader
3) Pig seems to provide some value called tagsource that can do the same
4) Hive has a hidden column for the filename so you could use that to compute a date column
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Created 06-13-2016 01:29 PM
Get date from Filename
There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there:
1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable
2) Write your own recordreader
3) Pig seems to provide some value called tagsource that can do the same
4) Hive has a hidden column for the filename so you could use that to compute a date column
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns