Dear all, perhaps my understanding is incorrect but I'm trying to reload a table using sqoop import, that already exists in hive.
My expectation is that besides overwriting the actual data, also the table will be dropped and recreated. A table structure can of course change and as sqoop cannot detect changes unless you are using the incremental option, it should assume that this will happen - thus dropping and recreating the table with the new structure. Also a change in for instance a delimiter used to separate records or attributes can happen.
Currently the table is not recreated though. The ddl stays untouched and also the the comment when data has been loaded is not updated. Is this a bug or am I missing a command in my sqoop action?
My sqoop action looks like this:
<sqoop xmlns="uri:oozie:sqoop-action:0.4"> <arg>import</arg> <arg>--connect</arg> <arg>jdbc:sap://[server]:30015/?currentschema=[schema]</arg> <arg>--username</arg> <arg>[username]</arg> <arg>--password-file</arg> <arg>[pwd_location]</arg> <arg>--driver</arg> <arg>com.sap.db.jdbc.Driver</arg> <arg>--query</arg> <arg>SELECT * FROM "ZDLTABLES" WHERE $CONDITIONS</arg> <arg>--hive-import</arg> <arg>--hive-database</arg> <arg>[hive_schema]</arg> <arg>--hive-table</arg> <arg>zdltables</arg> <arg>--hive-delims-replacement</arg> <arg>\040</arg> <arg>--fields-terminated-by</arg> <arg>\037</arg> <arg>--hive-overwrite</arg> <arg>--compress</arg> <arg>--num-mappers</arg> <arg>1</arg> <name-node>[name_node]</name-node> <job-tracker>[job_tracker]</job-tracker> <property xmlns=""> <name>oozie.launcher.mapred.job.queue.name</name> <value>default</value> <source>programatically</source> </property> <property xmlns=""> <name>mapreduce.job.queuename</name> <value>default</value> <source>programatically</source> </property> </configuration> </sqoop>
If you have multiple Hive installations or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory.
Refer this guide as well:
Thanks rbiswas, but I'm not sure how this helps. Sqoop creates a table fine the first time. It just doesn't recreate the table with the new tabledefs and updated timestamp from when the data was loaded (as a table comment).
My question is if sqoop should recreate a table when using the --hive-overwrite function or if it is supposed to overwrite only the data.
This what will happen in the background, hence I asked about hive home.
If the Hive table already exists, you can specify the
--hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a
CREATE TABLE operation defining your columns using Hive’s types, and a
LOAD DATA INPATHstatement to move the data files into Hive’s warehouse directory.
The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or
hive is not in your
$PATH, use the
--hive-home option to identify the Hive installation directory. Sqoop will use
$HIVE_HOME/bin/hive from here.
so if I have multiple hive installations (not sure how to check, but I can ask someone who should know this) than by adding the --hive-home to my sqoop script the table should be overwritten? So with new delimiter/columns and comment when the table was created?
Because the documentation only mentions that a create table operation is created but I'm missing the reference there that the table will be dropped first if it exists.
This is definitely a bug. Seems like the table definition doesn't change with --hive-overwrite flag afterall.