I am trying to move Sql server table to Hive using Spark Program. The table in SqlServer is around 100GB.
Hence, it's taking lots of time(around 2 hours) for the table to be created in Hive.
Is there any way wherein we can compress the data while/before fetching the table from SQL Server and decompress once it reaches hive? That way, complete 100GB need not be transferred. Only the compressed data (probably 30-40GB) would be transferred and may be the job will run faster.?
Thanks in Advance.
As far as I know, you will not be able to compress "on-wire" with MS SQL.
This will not directly answer your question, but there might few track to investigate:
- Use SQOOP with incremental ingestion. I manage to ingest 600GB in <20min (be careful with the number of threads, Sqoop can be very hard on the SQL Server). Sqoop can wirte straight to Hive
- Assuming this is a bulk, you can try to BCP the data to (SQL) disk, compress the BCP, transmit to HDFS before de-compress & ingestion.
Hi @Sagar Morakhia,
Dump the data as-is and don't perform any operations on that using sqoop(SequenceFiles using unique field separators) using split-by and maximum mappers you can allocate - this brings the data into multiple files as quick as you can.
Create an external table built on top this SequenceFile directory and insert into target format you wanted.
for such 100GB volumes you may need to arrange the sam amount of RAM and any skew will cause some spark failures and retries thats slow down the performance.
On the side note,
Presto also well suited for large volumes of data transfers as thats designed to work on distributed cluster specifically for RDBMS.