Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Tips for optimizing export to S3(n) ?

avatar
Super Collaborator

I've been experimenting with the options for copying data from our (bare metal) cluster to S3.

I found that something like this works:

hive> create table aws.my_table
> (
> `column1` string,
> `column2` string,
  ....
> `columnX` string)
> row format delimited fields terminated by ','
> lines terminated by '\n'
> stored as textfile
> location 's3n://my_bucket/my_folder_path/';
hive> insert into table aws.my_table select * from source_db.source_table;

But only if the source data set is pretty small.

For a larger data set (10's of GB), it fails with errors like

	Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at 
	 ... 8 more 
	Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: n must be positive
        at 
	...

I understand that pushing gigabytes (and eventually terabytes) of data to a remote server is going to be somewhat painful.

So, I'm wondering what kind of customizations are available.

Is there a way to specify compression or upload throttling, etc?

Can anyone give me instruction on getting around the errors?

1 ACCEPTED SOLUTION

avatar

@Zack Riesland, have you considered trying DistCp to copy the raw files from a source hdfs: URI to a destination s3n: or s3a: URI? It's possible this would be able to move the data more quickly than the Hive insert into/select from. If it's still important to have Hive metadata referencing the table at the s3n: or s3a: location, then you could handle that by creating an external table after completion of the DistCp.

View solution in original post

11 REPLIES 11

avatar
Super Collaborator

Chris,

I added this fs.s3.buffer.dir property under "custom hdfs-site" under the hdfs properties in Ambari - the same place where I added my aws credentials (which are working).

But it doesn't appear to be "Sticking". I pointed the property at "/home/s3_temp", which I created on the edge node where I'm testing the distcp tool. But I never see data in there, and my uploads continue to fail with the same errors as before.

Any ideas?

cc @Chris Nauroth

avatar
Super Collaborator

Fantastic. Thanks!