Hi! I'm new to Hadoop, I've just started to learn about its ecosystem and all the tools it has.
Currently, I'm writing a batch script to migrate a source database into Hive. I want it to copy as much data as possible, and that includes tables that lack a primary key (like n-to-n relations). I don't mind if I have to create a new table with its own primary key in the process.
What would be the best procedure to do so? In case Sqoop and Hive are not the best tools for such a job, should I consider something else? I'll be grateful of any advice I can get.
I'm assuming you're using Sqoop 1 since Sqoop 2 does not yet support importing into hive. The primary key is used mainly to control import/export distribution. You can specify a column to partition by via the --split-by argument.
Sorry for taking so long to reply. Indeed, I am using Sqoop 1. After you told me the purpose of the primary key in the import job, I realized Sqoop wasn't the tool I needed... or at least as it is. The idea of my project is to absorb as much data as possible from the source without user intervention. Therefore, it shouldn't use a manual --split-by.
I made a workaround for this by writing a job in Talend/bash that automatically finds what's the best column to split each table by, then runs a Sqoop import job for each table splitting by said column.