Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SqlServer to Hive using Spark - Long running jobs

SqlServer to Hive using Spark - Long running jobs

Contributor

Hi All,

I am trying to move Sql server table to Hive using Spark Program. The table in SqlServer is around 100GB.

Hence, it's taking lots of time(around 2 hours) for the table to be created in Hive.

Is there any way wherein we can compress the data while/before fetching the table from SQL Server and decompress once it reaches hive? That way, complete 100GB need not be transferred. Only the compressed data (probably 30-40GB) would be transferred and may be the job will run faster.?

Thanks in Advance.

Regards,

Sagar

6 REPLIES 6

Re: SqlServer to Hive using Spark - Long running jobs

Rising Star

Hi,

As far as I know, you will not be able to compress "on-wire" with MS SQL.

This will not directly answer your question, but there might few track to investigate:

- Use SQOOP with incremental ingestion. I manage to ingest 600GB in <20min (be careful with the number of threads, Sqoop can be very hard on the SQL Server). Sqoop can wirte straight to Hive

- Assuming this is a bulk, you can try to BCP the data to (SQL) disk, compress the BCP, transmit to HDFS before de-compress & ingestion.

Re: SqlServer to Hive using Spark - Long running jobs

Contributor

Thanks @Christophe Vico, Ill try with the 1st option and update you.

Re: SqlServer to Hive using Spark - Long running jobs

Super Guru

Save from Spark to Hive using ORC file with compression enabled.

Re: SqlServer to Hive using Spark - Long running jobs

Contributor

@Timothy Spann Thats something I am already donig, but my problem is with the transfer speed.

Re: SqlServer to Hive using Spark - Long running jobs

Super Collaborator

Hi @Sagar Morakhia,

Dump the data as-is and don't perform any operations on that using sqoop(SequenceFiles using unique field separators) using split-by and maximum mappers you can allocate - this brings the data into multiple files as quick as you can.

Create an external table built on top this SequenceFile directory and insert into target format you wanted.

for such 100GB volumes you may need to arrange the sam amount of RAM and any skew will cause some spark failures and retries thats slow down the performance.

On the side note,

Presto also well suited for large volumes of data transfers as thats designed to work on distributed cluster specifically for RDBMS.

Re: SqlServer to Hive using Spark - Long running jobs

Contributor

@bkosaraju Will try this and update you. Thanks :)