Support Questions

Find answers, ask questions, and share your expertise

Tips for optimizing export to S3(n) ?

avatar
Super Collaborator

I've been experimenting with the options for copying data from our (bare metal) cluster to S3.

I found that something like this works:

hive> create table aws.my_table
> (
> `column1` string,
> `column2` string,
  ....
> `columnX` string)
> row format delimited fields terminated by ','
> lines terminated by '\n'
> stored as textfile
> location 's3n://my_bucket/my_folder_path/';
hive> insert into table aws.my_table select * from source_db.source_table;

But only if the source data set is pretty small.

For a larger data set (10's of GB), it fails with errors like

	Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at 
	 ... 8 more 
	Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: n must be positive
        at 
	...

I understand that pushing gigabytes (and eventually terabytes) of data to a remote server is going to be somewhat painful.

So, I'm wondering what kind of customizations are available.

Is there a way to specify compression or upload throttling, etc?

Can anyone give me instruction on getting around the errors?

1 ACCEPTED SOLUTION

avatar

@Zack Riesland, have you considered trying DistCp to copy the raw files from a source hdfs: URI to a destination s3n: or s3a: URI? It's possible this would be able to move the data more quickly than the Hive insert into/select from. If it's still important to have Hive metadata referencing the table at the s3n: or s3a: location, then you could handle that by creating an external table after completion of the DistCp.

View solution in original post

11 REPLIES 11

avatar
Super Collaborator

Chris,

I added this fs.s3.buffer.dir property under "custom hdfs-site" under the hdfs properties in Ambari - the same place where I added my aws credentials (which are working).

But it doesn't appear to be "Sticking". I pointed the property at "/home/s3_temp", which I created on the edge node where I'm testing the distcp tool. But I never see data in there, and my uploads continue to fail with the same errors as before.

Any ideas?

cc @Chris Nauroth

avatar
Super Collaborator

Fantastic. Thanks!