One of customer has thousands of MS Access DBs. What is best way to migrate them to Hive/Hadoop?
Any tools/experience in the space?
I agree with Scott. Bringing data from Hadoop into Access is no big deal. We had an Access database use ODBC to retrieve data from Hadoop with effective results (fast enough).
BUT if you try to send data from Access into Hadoop - painfully slow. Row by Row. Sending it off to Sql Server requires very little code in Access, and then you have a durable store of your data. Sending it from SQL Server to Hadoop is effective and fast.
Here's a simple sample of a Sqoop script to send data from SQL Server to Hadoop. It reads table Customer from SQL Server DB SQLTestDB and copies it into Hadoop database TestDB tablename Customer. It overwrites any existing data in the Hive table. It also uses 1 mapper.
sqoop import --connect "jdbc:sqlserver://<IP Address>:1433;database=SQLTestDB" \ --username root \ --password hadoop \ --table Customer \ --hive-import --hive-overwrite \ --hive-table TestDB.Customer \ -m 1
If you can work your way through SQL Server and sqoop, I agree that's probably the cleanest option. If you're looking for something that you can automate entirely on the cluster or don't have the luxury of pushing the data through SQL Server, then here's another option.
There's a very simple Open Source toolset called mdbtools that makes it really easy to extract metadata and data from MS Access databases. In a series of about 10 lines of a shell script, you can get a list of tables in the mdb, dump the data out to text files, import those to HDFS, and wrap a generic Hive schema around the files.
Since you're going through intermediate text files, you might not be able to support some character sets and could run into an issues or two with file formats that can be cleaned up with a secondary sed or perl script. If you don't want to go through SQL Server to get the data transferred over, though, this might be a good solution for you.