I have an S3 bucket full of partitioned production data:
I spin up an EMR cluster and pull down some dirty data and clean it up, including de-duplicating it against the prod data.
Now, on my cluster, in HDFS, I have maybe
This represents new data:
I know that I can create a table and point the 'location' at the bucket described above and do an "insert into" or an "insert overwrite", but this is very slow - it will use one reducer that will copy ALL the new data.
Instead, I want to use s3-dist-cp which will update the data much more quickly.
However, my 000000_0 chunks will overwrite the old ones.
I have a script that renames the chunks: 000000_0 -> BCF704E2-B8A7-4F71-8747-A68AD52E50B7 but it takes about 3 seconds per partition, which is over an hour.
So, here's my question: is there a HFDS setting to change the way the chunks are named? For example, can I force the chunks to be named using the date or a GUI?
@zackriesland , you don't say what application you're using to pull and clean the new data on HDFS. The name pattern "data_day=01-01-2017/000000_0" is under control of that application, and can be configured there. It's not an HDFS thing; HDFS is just a file system, and stores files under whatever filenames the app tells it to use.
If it is not acceptable to change the naming configuration in that app, there are other possible approaches. You say that renaming the files takes 3 seconds per chunk file. Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. You can use the "hadoop fs -mv" CLI command, very much like the linux "mv" command, to rename files on HDFS from a script, and it should not take any 3 seconds. You can use this command in a shellscript while logged into any server in the HDFS cluster. Another way to run such a script is on a workstation with, for instance, the Python HDFS Client library installed, and configured to be able to talk to the HDFS cluster.
Finally, you're probably aware that filenames in S3 are actually just strings, and the slash ("/") to delimit "folders" is purely convention, except for the delimiter between the bucket name and the rest of the filename. There aren't actually any folders below the level of the bucket. So changing the "dest" location for the s3-dist-cp command to put it in a sub-folder is really just the same as specifying a prefix (the name of the "sub-folder") on the filename. Use the date, or a serial number, or whatever convention you want so each invocation of s3-dist-cp puts its files on S3 with a different prefix. It seems that would also fulfill the requirement above.
You don't say what application you're using to pull and clean the new data on HDFS. - I pull the data using s3-dist-cp, and then project a table over it with Hive, and run Spark and Hive queries for ETL.
It's not an HDFS thing - I suppose it's technically a Hive thing. Still, I have the same question - can I control how Hive names the chunks?
Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. - No. I'm actually renaming on HDFS before pushing to S3. And yes, "hdfs dfs -mv ..." does take about 3 seconds per file. I can prove it if you're interested. I'll attach my script for reference.
Regarding your last comment - I do understand how S3 works. I do NOT know of a s3-dist-cp option to force a prefix or convention to the individual chunks. For example, if I have a bunch of data representing a table at /data/my/table in HDFS, I can push that to any prefix in S3, but I don't know how to specify that each chunk under /data/my/table should be renamed. I can push the chunks INDIVIDUALLY and control the name, but then my app is no longer scalable. The length of time increases linearly with the size of the data, regardless of the size of the cluster. That's why I'm trying to leverage s3-dist-cp - it's the only way I have found to push data from a cluster to s3 in a scalable way.
Hi @Zack Riesland , thanks for the clarification. I understand now that s3-dist-cp is used both coming and going, and I agree it doesn't seem to have any renaming capability built in. I strongly suspect there's a way to control the filenames by controlling the Hive partition name, but I'm not a Hive expert and maybe the application benefits from using the same partition name for raw and clean data.
Here's what I found that might help: a StackOverflow query that confirms your complaint that CLI rename of HDFS files is slow, and confirms my sense that it shouldn't be: https://stackoverflow.com/questions/14736017/batch-rename-in-hadoopThe second responder (Robert) wrote a 1-page java program that uses the hadoop java api, and shows that it can rename several thousand files in the same 4 seconds that the CLI takes to rename one file. This suggests that the majority of time is taken up by the connection protocol, and the java program can use a single connection for multiple operations.
I also looked at options for using Python similarly. It appears that the hdfs3 library would do quite nicely, although you have to separately install the underlying C library opensourced by Pivotal. Full docs for installation and api are given in the linked doc. Please note it probably would not work to use the similarly-named "hdfs" library, because the latter uses the WebHDFS interface, which is unlikely to be faster than the CLI (although I didn't test it).
Since I'm a Hortonworks employee, I should point out that the above opensource references are not supported or endorsed by the company, and are used at your own risk. I did look over the Java program and it looks fine to me, but that's my personal opinion, not the company's.
Good luck. If that doesn't get you where you need to be, and since your profile popup indicates you are a Hortonworks support customer, you'd be welcome to open a support request for advice from an SE.
echo "Running $scriptname $@..."
echo " (Usage: $scriptname <path_to_data>)"
if [ "$#" -ne 1 ]
echo "Wrong number of arguments. Expected 1 but got $#"
for partition_name in `hdfs dfs -ls $HDFS_PATH`
if [[ $partition_name == $HDFS_PATH* ]]
echo "Looping through $partition_name"
for chunk_name in `hdfs dfs -ls $partition_name`
if [[ $chunk_name == $partition_name* ]]
echo "Will rename $chunk_name to $partition_name/$UUID"
# hdfs dfs -mv $filename "$HDFS_PATH/$UUID"
echo "Exiting after $duration"