Created 06-15-2016 02:53 PM
I've been experimenting with the options for copying data from our (bare metal) cluster to S3.
I found that something like this works:
hive> create table aws.my_table > ( > `column1` string, > `column2` string, .... > `columnX` string) > row format delimited fields terminated by ',' > lines terminated by '\n' > stored as textfile > location 's3n://my_bucket/my_folder_path/'; hive> insert into table aws.my_table select * from source_db.source_table;
But only if the source data set is pretty small.
For a larger data set (10's of GB), it fails with errors like
Diagnostic Messages for this Task: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: n must be positive at ...
I understand that pushing gigabytes (and eventually terabytes) of data to a remote server is going to be somewhat painful.
So, I'm wondering what kind of customizations are available.
Is there a way to specify compression or upload throttling, etc?
Can anyone give me instruction on getting around the errors?
Created 06-15-2016 06:44 PM
@Zack Riesland, have you considered trying DistCp to copy the raw files from a source hdfs: URI to a destination s3n: or s3a: URI? It's possible this would be able to move the data more quickly than the Hive insert into/select from. If it's still important to have Hive metadata referencing the table at the s3n: or s3a: location, then you could handle that by creating an external table after completion of the DistCp.
Created 06-20-2016 07:33 PM
Chris,
I added this fs.s3.buffer.dir property under "custom hdfs-site" under the hdfs properties in Ambari - the same place where I added my aws credentials (which are working).
But it doesn't appear to be "Sticking". I pointed the property at "/home/s3_temp", which I created on the edge node where I'm testing the distcp tool. But I never see data in there, and my uploads continue to fail with the same errors as before.
Any ideas?
Created 06-16-2016 10:28 PM
Fantastic. Thanks!