Support Questions

Find answers, ask questions, and share your expertise

Can I control naming patterns for HDFS chunks

avatar
Super Collaborator

Here's my scenario:

I have an S3 bucket full of partitioned production data:

data_day=01-01-2017/000000_0

data_day=01-01-2017/000000_1

data_day=01-02-2017/000000_0

data_day=01-02-2017/000000_1

...

etc

I spin up an EMR cluster and pull down some dirty data and clean it up, including de-duplicating it against the prod data.

Now, on my cluster, in HDFS, I have maybe

data_day=01-01-2017/000000_0

data_day=01-02-2017/000000_0

This represents new data:

I know that I can create a table and point the 'location' at the bucket described above and do an "insert into" or an "insert overwrite", but this is very slow - it will use one reducer that will copy ALL the new data.

Instead, I want to use s3-dist-cp which will update the data much more quickly.

However, my 000000_0 chunks will overwrite the old ones.

I have a script that renames the chunks: 000000_0 -> BCF704E2-B8A7-4F71-8747-A68AD52E50B7 but it takes about 3 seconds per partition, which is over an hour.

So, here's my question: is there a HFDS setting to change the way the chunks are named? For example, can I force the chunks to be named using the date or a GUI?

Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I wrestled with the Java and hdfs3 options mentioned above, but getting either of them to run on EMR was pretty painful and not very bootstrap script friendly.

Finally, I figured out how to do this with Spark.

It's lightning fast, and super simple:

val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(args(0)))
val originalPath = files.map(_.getPath())

println("Will now list files in " + args(0) + "...")

for(i <- originalPath.indices)
{
  val id = randomUUID().toString;
  println("Will move " + originalPath(i) + " to " + id);
  hdfs.rename(originalPath(i), originalPath(i).suffix("." + id))
}

View solution in original post

7 REPLIES 7

avatar

@zackriesland , you don't say what application you're using to pull and clean the new data on HDFS. The name pattern "data_day=01-01-2017/000000_0" is under control of that application, and can be configured there. It's not an HDFS thing; HDFS is just a file system, and stores files under whatever filenames the app tells it to use.

If it is not acceptable to change the naming configuration in that app, there are other possible approaches. You say that renaming the files takes 3 seconds per chunk file. Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. You can use the "hadoop fs -mv" CLI command, very much like the linux "mv" command, to rename files on HDFS from a script, and it should not take any 3 seconds. You can use this command in a shellscript while logged into any server in the HDFS cluster. Another way to run such a script is on a workstation with, for instance, the Python HDFS Client library installed, and configured to be able to talk to the HDFS cluster.

Finally, you're probably aware that filenames in S3 are actually just strings, and the slash ("/") to delimit "folders" is purely convention, except for the delimiter between the bucket name and the rest of the filename. There aren't actually any folders below the level of the bucket. So changing the "dest" location for the s3-dist-cp command to put it in a sub-folder is really just the same as specifying a prefix (the name of the "sub-folder") on the filename. Use the date, or a serial number, or whatever convention you want so each invocation of s3-dist-cp puts its files on S3 with a different prefix. It seems that would also fulfill the requirement above.

Hope one or another of these ideas helps.

avatar
Super Collaborator

Thanks @Matt Foley

A few clarifications for whatever they're worth:

You don't say what application you're using to pull and clean the new data on HDFS.
- I pull the data using s3-dist-cp, and then project a table over it with Hive, and run Spark and Hive queries for ETL.

It's not an HDFS thing
- I suppose it's technically a Hive thing. Still, I have the same question - can I control how Hive names the chunks?

Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3.
- No. I'm actually renaming on HDFS before pushing to S3. And yes, "hdfs dfs -mv ..." does take about 3 seconds per file. I can prove it if you're interested. I'll attach my script for reference.

Regarding your last comment
- I do understand how S3 works. I do NOT know of a s3-dist-cp option to force a prefix or convention to the individual chunks.
For example, if I have a bunch of data representing a table at /data/my/table in HDFS, I can push that to any prefix in S3, but I don't know how to specify that each chunk under /data/my/table should be renamed.
I can push the chunks INDIVIDUALLY and control the name, but then my app is no longer scalable. The length of time increases linearly with the size of the data, regardless of the size of the cluster.
That's why I'm trying to leverage s3-dist-cp - it's the only way I have found to push data from a cluster to s3 in a scalable way.

avatar

Hi @Zack Riesland , thanks for the clarification. I understand now that s3-dist-cp is used both coming and going, and I agree it doesn't seem to have any renaming capability built in. I strongly suspect there's a way to control the filenames by controlling the Hive partition name, but I'm not a Hive expert and maybe the application benefits from using the same partition name for raw and clean data.

Here's what I found that might help: a StackOverflow query that confirms your complaint that CLI rename of HDFS files is slow, and confirms my sense that it shouldn't be: https://stackoverflow.com/questions/14736017/batch-rename-in-hadoopThe second responder (Robert) wrote a 1-page java program that uses the hadoop java api, and shows that it can rename several thousand files in the same 4 seconds that the CLI takes to rename one file. This suggests that the majority of time is taken up by the connection protocol, and the java program can use a single connection for multiple operations.

I also looked at options for using Python similarly. It appears that the hdfs3 library would do quite nicely, although you have to separately install the underlying C library opensourced by Pivotal. Full docs for installation and api are given in the linked doc. Please note it probably would not work to use the similarly-named "hdfs" library, because the latter uses the WebHDFS interface, which is unlikely to be faster than the CLI (although I didn't test it).

Since I'm a Hortonworks employee, I should point out that the above opensource references are not supported or endorsed by the company, and are used at your own risk. I did look over the Java program and it looks fine to me, but that's my personal opinion, not the company's.

Good luck. If that doesn't get you where you need to be, and since your profile popup indicates you are a Hortonworks support customer, you'd be welcome to open a support request for advice from an SE.

avatar
Super Collaborator

Thanks @Matt Foley,

The insight that renaming via the Java API is so much faster is especially interesting. I'll investigate that further!

avatar
Super Collaborator
#!/bin/bash

scriptname=`basename "$0"`
echo ""
echo "Running $scriptname $@..."
echo " (Usage: $scriptname <path_to_data>)"
echo ""

if [ "$#" -ne 1 ]
  then
    echo "Wrong number of arguments. Expected 1 but got $#"
    exit 1;
  fi

SECONDS=0

HDFS_PATH="$1"

for partition_name in `hdfs dfs -ls $HDFS_PATH`
do
  if [[ $partition_name == $HDFS_PATH* ]]
    then
      echo "Looping through $partition_name"
      for chunk_name in `hdfs dfs -ls $partition_name`
      do
        if [[ $chunk_name == $partition_name* ]]
          then
            UUID=$(uuidgen)
            UUID=${UUID^^}

            echo "Will rename $chunk_name to $partition_name/$UUID"
            # hdfs dfs -mv $filename "$HDFS_PATH/$UUID"
        fi
      done
  fi
done
duration=$SECONDS
echo "Exiting after $duration"

exit 0;
<br>

avatar
New Contributor

Hello, can you please help me with a similar script for batch renaming Hadoop files? Thanks! 

avatar
Super Collaborator

I wrestled with the Java and hdfs3 options mentioned above, but getting either of them to run on EMR was pretty painful and not very bootstrap script friendly.

Finally, I figured out how to do this with Spark.

It's lightning fast, and super simple:

val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(args(0)))
val originalPath = files.map(_.getPath())

println("Will now list files in " + args(0) + "...")

for(i <- originalPath.indices)
{
  val id = randomUUID().toString;
  println("Will move " + originalPath(i) + " to " + id);
  hdfs.rename(originalPath(i), originalPath(i).suffix("." + id))
}