Support Questions

Find answers, ask questions, and share your expertise

Creating Hive external table on specific files within folder

avatar
Contributor

I have some data being dropped into our HDFS file system on a daily basis into a single folder which contains multiple CSV files. Such as below;

/data/yyyy/mm/dd/file1.csv

/data/yyyy/mm/dd/file2.csv

Now I want to create a Hive external table on all the file1.csv files across all the folders under /data, now it doesn't seem it is currently possible to use a regex in the Hive external table command.

My next thought would be to copy the files into separate structures so Hive can parse this files individually, such as;

/data/file1/yyyy/mm/dd/file1.csv

/data/file2/yyyy/mm/dd/file2.csv

But I am not sure what the best way of doing this would be, whatever I choose to use would initially need to copy bulk data between this folder structures and then be able to be scheduled to copy files over on a daily basis when new folders are created.

Any help would be greatly appreciated, please let me know if any of the above is unclear.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I am not sure about your use case. If you want just include file1 into hive table, you have to copy those files into separate folders. The alternative way might be you can including all data into the hive table, and let hive to control what data can be selected/seen etc.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

I am not sure about your use case. If you want just include file1 into hive table, you have to copy those files into separate folders. The alternative way might be you can including all data into the hive table, and let hive to control what data can be selected/seen etc.

avatar
Contributor

Thanks for the response Frank, I guess my question really was how to easily move these files into the correct folder structure without it being a manual process of using "hdfs dfs" commands.

The including all the data in the Hive table and then let hive control what can be selected/seen is an interesting concept, that might be a possible way of doing what we are after without having to adapt the underlying structure of the data in HDFS. We can then create views on top of this single hive table to split the data and then always insert into Hive internal tables if needed.