Community Articles

Dominika · ‎11-19-2016

This is tutorial will help you get started accessing data stored on Amazon S3 from a cluster created through Hortonworks Data Cloud for AWS 1.16 (released in June 2017). The tutorial assumes no prior experience with AWS.

In this tutorial:

- We will use DistCp to copy sample data from S3 to HDFS and from HDFS to S3.

- We will be using fs shell commands.

- We will be using the Landsat 8 data that AWS makes available in the s3://landsat-pds in US West (Oregon) region.

- We will also create a new S3 bucket to which we will copy data from HDFS.

- In general, when specifying a path to S3, we will follow this required convention: `s3a://bucket-name/directory/`.

Let's get started!

Prerequisites

Before starting this tutorial, your cloud controller needs to be running, and you must have a cluster running on AWS.

To set up the cloud controller and cluster, refer to the following tutorial: How to set up Hortonworks Data Cloud for AWS.

Accessing HDFS in HDCloud for AWS

1. SSH to a cluster node.

You can copy the SSH information from the cloud controller UI:

2.In HDCloud clusters, after you SSH to a cluster node, the default user is cloudbreak. The cloudbreak user doesn’t have write access to HDFS, so let’s create a directory to which we will copy the data, and then let’s change the owner and permissions so that the cloudbreak user can write to the directory:

sudo -u hdfs hdfs dfs -mkdir /user/cloudbreak
sudo -u hdfs hdfs dfs -chown cloudbreak /user/cloudbreak
sudo -u hdfs hdfs dfs -chmod 700 /user/cloudbreak

Now you will be able to copy data to the newly created directory.

Copying from S3 to HDFS

We will copy the scene_list.gz file from a public S3 bucket called landsat-pds to HDFS:

1. First, let’s check if the scene_list.gz file that we are trying to copy exists in the S3 bucket:

hadoop fs -ls s3a://landsat-pds/scene_list.gz

2. You should see something similar to:

-rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 17:16 s3a://landsat-pds/scene_list.gz

3. Now let's copy scene_list.gz to your current directory using the following command:

hadoop distcp s3a://landsat-pds/scene_list.gz .

4. You should see something similar to:

________________________________________________________

[cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp s3a://landsat-pds/scene_list.gz .

16/11/18 22:00:50 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://landsat-pds/scene_list.gz], targetPath=, targetPathExists=true, filtersFile='null'}

16/11/18 22:00:51 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/

16/11/18 22:00:51 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050

16/11/18 22:00:51 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200

16/11/18 22:00:53 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0

16/11/18 22:00:53 INFO tools.SimpleCopyListing: Build file listing completed.

16/11/18 22:00:53 INFO tools.DistCp: Number of paths in the copy list: 1

16/11/18 22:00:53 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/

16/11/18 22:00:53 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050

16/11/18 22:00:53 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200

16/11/18 22:00:53 INFO mapreduce.JobSubmitter: number of splits:1

16/11/18 22:00:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0009

16/11/18 22:00:54 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0009

16/11/18 22:00:54 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0009/

16/11/18 22:00:54 INFO tools.DistCp: DistCp job-id: job_1479498757313_0009

16/11/18 22:00:54 INFO mapreduce.Job: Running job: job_1479498757313_0009

16/11/18 22:01:01 INFO mapreduce.Job: Job job_1479498757313_0009 running in uber mode : false

16/11/18 22:01:01 INFO mapreduce.Job: map 0% reduce 0%

16/11/18 22:01:11 INFO mapreduce.Job: map 100% reduce 0%

16/11/18 22:01:11 INFO mapreduce.Job: Job job_1479498757313_0009 completed successfully

16/11/18 22:01:11 INFO mapreduce.Job: Counters: 38

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=145318

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=349

HDFS: Number of bytes written=33410189

HDFS: Number of read operations=13

HDFS: Number of large read operations=0

HDFS: Number of write operations=4

S3A: Number of bytes read=33410181

S3A: Number of bytes written=0

S3A: Number of read operations=3

S3A: Number of large read operations=0

S3A: Number of write operations=0

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=8309

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=8309

Total vcore-milliseconds taken by all map tasks=8309

Total megabyte-milliseconds taken by all map tasks=8508416

Map-Reduce Framework

Map input records=1

Map output records=0

Input split bytes=121

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=54

CPU time spent (ms)=3520

Physical memory (bytes) snapshot=281440256

Virtual memory (bytes) snapshot=2137710592

Total committed heap usage (bytes)=351272960

File Input Format Counters

Bytes Read=228

File Output Format Counters

Bytes Written=8

org.apache.hadoop.tools.mapred.CopyMapper$Counter

BYTESCOPIED=33410181

BYTESEXPECTED=33410181

COPY=1

[cloudbreak@ip-10-0-1-208 ~]$

________________________________________________________

5. Now let’s check if the file that we copied is present in the cloudbreak directory:

hadoop fs -ls

6. You should see something similar to:

-rw-r--r-- 3 cloudbreak hdfs 33410181 2016-11-18 21:30 scene_list.gz

Congratulations! You’ve successfully copied the file from an S3 bucket to HDFS!

Creating an S3 Bucket

In this step, we will copy the scene_list.gz file from the cloudbreak directory to an S3 bucket. But before that, we need to create a new S3 bucket.

1. In your browser, navigate to the S3 Dashboard https://console.aws.amazon.com/s3/home.

2. Click on Create Bucket and create a bucket:

For example, here I am creating a bucket called “domitest”. Since my cluster and source data are in the Oregon region, I am creating this bucket in that region.

3. Next, navigate to the bucket, and create a folder:

For example, here I am creating a folder called “demo”.

4. Now, from our cluster node, let’s check if the bucket and folder that we just created exist:

hadoop fs -ls s3a://domitest/

5. You should see something similar to:

Found 1 items

drwxrwxrwx - cloudbreak 0 2016-11-18 22:17 s3a://domitest/demo

Congratulations! You’ve successfully created an Amazon S3 bucket.

Copying from HDFS to S3

1. Now let’s copy the scene_list.gz file from HDFS to this newly created bucket:

hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo

2. You should see something similar to:

______________________

[cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo

16/11/18 22:20:32 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/cloudbreak/scene_list.gz], targetPath=s3a://domitest/demo, targetPathExists=true, filtersFile='null'}

16/11/18 22:20:33 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/

16/11/18 22:20:33 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050

16/11/18 22:20:33 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200

16/11/18 22:20:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0

16/11/18 22:20:34 INFO tools.SimpleCopyListing: Build file listing completed.

16/11/18 22:20:34 INFO tools.DistCp: Number of paths in the copy list: 1

16/11/18 22:20:34 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/

16/11/18 22:20:34 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050

16/11/18 22:20:34 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200

16/11/18 22:20:34 INFO mapreduce.JobSubmitter: number of splits:1

16/11/18 22:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0010

16/11/18 22:20:35 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0010

16/11/18 22:20:35 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0010/

16/11/18 22:20:35 INFO tools.DistCp: DistCp job-id: job_1479498757313_0010

16/11/18 22:20:35 INFO mapreduce.Job: Running job: job_1479498757313_0010

16/11/18 22:20:42 INFO mapreduce.Job: Job job_1479498757313_0010 running in uber mode : false

16/11/18 22:20:42 INFO mapreduce.Job: map 0% reduce 0%

16/11/18 22:20:53 INFO mapreduce.Job: map 100% reduce 0%

16/11/18 22:21:01 INFO mapreduce.Job: Job job_1479498757313_0010 completed successfully

16/11/18 22:21:01 INFO mapreduce.Job: Counters: 38

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=145251

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=33410572

HDFS: Number of bytes written=8

HDFS: Number of read operations=10

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

S3A: Number of bytes read=0

S3A: Number of bytes written=33410181

S3A: Number of read operations=14

S3A: Number of large read operations=0

S3A: Number of write operations=4098

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14695

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=14695

Total vcore-milliseconds taken by all map tasks=14695

Total megabyte-milliseconds taken by all map tasks=15047680

Map-Reduce Framework

Map input records=1

Map output records=0

Input split bytes=122

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=57

CPU time spent (ms)=4860

Physical memory (bytes) snapshot=280420352

Virtual memory (bytes) snapshot=2136977408

Total committed heap usage (bytes)=350748672

File Input Format Counters

Bytes Read=269

File Output Format Counters

Bytes Written=8

org.apache.hadoop.tools.mapred.CopyMapper$Counter

BYTESCOPIED=33410181

BYTESEXPECTED=33410181

COPY=1

______________________

3. Next, let’s check if the file that we copied is present in the cloudbrek directory:

hadoop fs -ls s3a://domitest/demo

4. You should see something similar to:

Found 1 items

-rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 22:20 s3a://domitest/demo/scene_list.gz

5. You will also see the file on the S3 Dashboard:

Congratulations! You’ve successfully copied the file from HDFS to the S3 bucket!

Next Steps

1. Try creating another bucket. Using similar syntax, you can try copying files between two S3 buckets that you created.

2. If you want to copy more files, try adding -D fs.s3a.fast.upload=true and see how this accelerates the transfer. Click here for more information.

3. Try running more hadoop fs commands listed here.

4. Learn more about the landat-pds bucket at https://pages.awscloud.com/public-data-sets-landsat.html.

Cleaning Up

Any files stored on S3 or in HDFS add to your charges, so it’s good to get into the habit of getting rid of the files.

1. To delete the scene_list.gz file from HFDS, run:

hadoop fs -rm -skipTrash /user/cloudbreak/scene_list.gz

2. To delete the scene_list.gz file from the S3 bucket, run:

hadoop fs -rm -skipTrash s3a://domitest/demo/scene_list.gz

Or, you can delete it from the S3 Dashboard.

More Resources

Visit Cloud Data Access documentation for more information on working with Amazon S3 buckets.

Dominika · ‎04-03-2017

Updated for HDCloud 1.14.1. Check it out!

Cloudera Community

Community Articles

How to copy between a cluster and S3 buckets

Apache Hadoop

HDFS

Hortonworks Cloudbreak

Hortonworks Data Platform (HDP)

MapReduce