Created on 11-19-2016 01:09 AM - edited 09-16-2022 01:36 AM
This is tutorial will help you get started accessing data stored on Amazon S3 from a cluster created through Hortonworks Data Cloud for AWS 1.16 (released in June 2017). The tutorial assumes no prior experience with AWS.
In this tutorial:
- We will use DistCp to copy sample data from S3 to HDFS and from HDFS to S3.
- We will be using fs shell commands.
- We will be using the Landsat 8 data that AWS makes available in the s3://landsat-pds in US West (Oregon) region.
- We will also create a new S3 bucket to which we will copy data from HDFS.
- In general, when specifying a path to S3, we will follow this required convention: `s3a://bucket-name/directory/`.
Let's get started!
Before starting this tutorial, your cloud controller needs to be running, and you must have a cluster running on AWS.
To set up the cloud controller and cluster, refer to the following tutorial: How to set up Hortonworks Data Cloud for AWS.
1. SSH to a cluster node.
You can copy the SSH information from the cloud controller UI:
2.In HDCloud clusters, after you SSH to a cluster node, the default user is cloudbreak. The cloudbreak user doesn’t have write access to HDFS, so let’s create a directory to which we will copy the data, and then let’s change the owner and permissions so that the cloudbreak user can write to the directory:
sudo -u hdfs hdfs dfs -mkdir /user/cloudbreak sudo -u hdfs hdfs dfs -chown cloudbreak /user/cloudbreak sudo -u hdfs hdfs dfs -chmod 700 /user/cloudbreak
Now you will be able to copy data to the newly created directory.
We will copy the scene_list.gz file from a public S3 bucket called landsat-pds to HDFS:
1. First, let’s check if the scene_list.gz file that we are trying to copy exists in the S3 bucket:
hadoop fs -ls s3a://landsat-pds/scene_list.gz
2. You should see something similar to:
-rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 17:16 s3a://landsat-pds/scene_list.gz
3. Now let's copy scene_list.gz to your current directory using the following command:
hadoop distcp s3a://landsat-pds/scene_list.gz .
4. You should see something similar to:
________________________________________________________
[cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp s3a://landsat-pds/scene_list.gz .
16/11/18 22:00:50 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://landsat-pds/scene_list.gz], targetPath=, targetPathExists=true, filtersFile='null'}
16/11/18 22:00:51 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
16/11/18 22:00:51 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
16/11/18 22:00:51 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
16/11/18 22:00:53 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
16/11/18 22:00:53 INFO tools.SimpleCopyListing: Build file listing completed.
16/11/18 22:00:53 INFO tools.DistCp: Number of paths in the copy list: 1
16/11/18 22:00:53 INFO tools.DistCp: Number of paths in the copy list: 1
16/11/18 22:00:53 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
16/11/18 22:00:53 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
16/11/18 22:00:53 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
16/11/18 22:00:53 INFO mapreduce.JobSubmitter: number of splits:1
16/11/18 22:00:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0009
16/11/18 22:00:54 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0009
16/11/18 22:00:54 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0009/
16/11/18 22:00:54 INFO tools.DistCp: DistCp job-id: job_1479498757313_0009
16/11/18 22:00:54 INFO mapreduce.Job: Running job: job_1479498757313_0009
16/11/18 22:01:01 INFO mapreduce.Job: Job job_1479498757313_0009 running in uber mode : false
16/11/18 22:01:01 INFO mapreduce.Job: map 0% reduce 0%
16/11/18 22:01:11 INFO mapreduce.Job: map 100% reduce 0%
16/11/18 22:01:11 INFO mapreduce.Job: Job job_1479498757313_0009 completed successfully
16/11/18 22:01:11 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=145318
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=349
HDFS: Number of bytes written=33410189
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
S3A: Number of bytes read=33410181
S3A: Number of bytes written=0
S3A: Number of read operations=3
S3A: Number of large read operations=0
S3A: Number of write operations=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=8309
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=8309
Total vcore-milliseconds taken by all map tasks=8309
Total megabyte-milliseconds taken by all map tasks=8508416
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=121
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=54
CPU time spent (ms)=3520
Physical memory (bytes) snapshot=281440256
Virtual memory (bytes) snapshot=2137710592
Total committed heap usage (bytes)=351272960
File Input Format Counters
Bytes Read=228
File Output Format Counters
Bytes Written=8
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=33410181
BYTESEXPECTED=33410181
COPY=1
[cloudbreak@ip-10-0-1-208 ~]$
________________________________________________________
5. Now let’s check if the file that we copied is present in the cloudbreak directory:
hadoop fs -ls
6. You should see something similar to:
-rw-r--r-- 3 cloudbreak hdfs 33410181 2016-11-18 21:30 scene_list.gz
Congratulations! You’ve successfully copied the file from an S3 bucket to HDFS!
In this step, we will copy the scene_list.gz file from the cloudbreak directory to an S3 bucket. But before that, we need to create a new S3 bucket.
1. In your browser, navigate to the S3 Dashboard https://console.aws.amazon.com/s3/home.
2. Click on Create Bucket and create a bucket:
For example, here I am creating a bucket called “domitest”. Since my cluster and source data are in the Oregon region, I am creating this bucket in that region.
3. Next, navigate to the bucket, and create a folder:
For example, here I am creating a folder called “demo”.
4. Now, from our cluster node, let’s check if the bucket and folder that we just created exist:
hadoop fs -ls s3a://domitest/
5. You should see something similar to:
Found 1 items
drwxrwxrwx - cloudbreak 0 2016-11-18 22:17 s3a://domitest/demo
Congratulations! You’ve successfully created an Amazon S3 bucket.
1. Now let’s copy the scene_list.gz file from HDFS to this newly created bucket:
hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo
2. You should see something similar to:
______________________
[cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo
16/11/18 22:20:32 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/cloudbreak/scene_list.gz], targetPath=s3a://domitest/demo, targetPathExists=true, filtersFile='null'}
16/11/18 22:20:33 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
16/11/18 22:20:33 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
16/11/18 22:20:33 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
16/11/18 22:20:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
16/11/18 22:20:34 INFO tools.SimpleCopyListing: Build file listing completed.
16/11/18 22:20:34 INFO tools.DistCp: Number of paths in the copy list: 1
16/11/18 22:20:34 INFO tools.DistCp: Number of paths in the copy list: 1
16/11/18 22:20:34 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
16/11/18 22:20:34 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
16/11/18 22:20:34 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
16/11/18 22:20:34 INFO mapreduce.JobSubmitter: number of splits:1
16/11/18 22:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0010
16/11/18 22:20:35 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0010
16/11/18 22:20:35 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0010/
16/11/18 22:20:35 INFO tools.DistCp: DistCp job-id: job_1479498757313_0010
16/11/18 22:20:35 INFO mapreduce.Job: Running job: job_1479498757313_0010
16/11/18 22:20:42 INFO mapreduce.Job: Job job_1479498757313_0010 running in uber mode : false
16/11/18 22:20:42 INFO mapreduce.Job: map 0% reduce 0%
16/11/18 22:20:53 INFO mapreduce.Job: map 100% reduce 0%
16/11/18 22:21:01 INFO mapreduce.Job: Job job_1479498757313_0010 completed successfully
16/11/18 22:21:01 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=145251
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=33410572
HDFS: Number of bytes written=8
HDFS: Number of read operations=10
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
S3A: Number of bytes read=0
S3A: Number of bytes written=33410181
S3A: Number of read operations=14
S3A: Number of large read operations=0
S3A: Number of write operations=4098
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=14695
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=14695
Total vcore-milliseconds taken by all map tasks=14695
Total megabyte-milliseconds taken by all map tasks=15047680
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=122
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=57
CPU time spent (ms)=4860
Physical memory (bytes) snapshot=280420352
Virtual memory (bytes) snapshot=2136977408
Total committed heap usage (bytes)=350748672
File Input Format Counters
Bytes Read=269
File Output Format Counters
Bytes Written=8
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=33410181
BYTESEXPECTED=33410181
COPY=1
______________________
3. Next, let’s check if the file that we copied is present in the cloudbrek directory:
hadoop fs -ls s3a://domitest/demo
4. You should see something similar to:
Found 1 items
-rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 22:20 s3a://domitest/demo/scene_list.gz
5. You will also see the file on the S3 Dashboard:
Congratulations! You’ve successfully copied the file from HDFS to the S3 bucket!
1. Try creating another bucket. Using similar syntax, you can try copying files between two S3 buckets that you created.
2. If you want to copy more files, try adding -D fs.s3a.fast.upload=true and see how this accelerates the transfer. Click here for more information.
3. Try running more hadoop fs commands listed here.
4. Learn more about the landat-pds bucket at https://pages.awscloud.com/public-data-sets-landsat.html.
Any files stored on S3 or in HDFS add to your charges, so it’s good to get into the habit of getting rid of the files.
1. To delete the scene_list.gz file from HFDS, run:
hadoop fs -rm -skipTrash /user/cloudbreak/scene_list.gz
2. To delete the scene_list.gz file from the S3 bucket, run:
hadoop fs -rm -skipTrash s3a://domitest/demo/scene_list.gz
Or, you can delete it from the S3 Dashboard.
Visit Cloud Data Access documentation for more information on working with Amazon S3 buckets.
Created on 04-03-2017 06:56 PM
Updated for HDCloud 1.14.1. Check it out!