Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Comparing hive tables with Spark.

avatar
Expert Contributor

I have a request where I need to compare two hive tables and get the unique records from each table. Both the tables are in different clusters(Staging and Prod). I am using beeline to query both the table and storing the key column values into two different files for both tables and then I run some unix commands to compare the files to get the unique records. Here is the caveat!!!! It works perfect with small set of data in testing cluster. But when implemented in Staging and Prod the job hangs up and I am not able to get the result. Both the table has 50 gb of data. When I run the application using a shell script it takes a very long time and the putty screen freezes after a while, but the job finishes in the background. But still I cant see the output in the file.I have set the buffer size to 8 gb also.

I can't use join operations because I have to compare two tables that are in two different clusters.

Can I use spark to solve my problem? If yes, how can I do that? Or, is there some other way with which I can fix the problem that I face now?

1 ACCEPTED SOLUTION

avatar
Guru

To solve a problem like this you likely need to distribute the workload, so running the difference script on a single node is definitely not going to work as you found out.

The best way to do this would be to run the job on one, or both of the clusters, there are however some challenges here.

The big problem is that to compare the data you need to move the data from one cluster to the other. You could do this by first moving the data set perhaps from staging to prod, and then running the comparison as a job on the prod cluster. However, this may not be bandwidth efficient. If you are expecting relatively few differences between the data set, a more efficient method may be to emulate what rsync does, in other words to write a job on both clusters that produces a hash of the rows, or some set of blocks in the data set. You can then use the hashes as an index for the rows and just move the hashes from one cluster to the other for comparison. Once you've done the hash comparison, you can then use that to filter the heavier data transfer. Note that this technique is not guaranteed to be perfect due to the risk of hash collision, so you may want to chose wide hash function, however, the probability of failure is very low.

The best way to do this would probably be to produce the hashes with either spark or hive, transfer them to one of the clusters, again, use spark or hive to figure out the rows worth transferring. Both Spark and Hive prove good tools for solving this problem.

View solution in original post

4 REPLIES 4

avatar
Guru

To solve a problem like this you likely need to distribute the workload, so running the difference script on a single node is definitely not going to work as you found out.

The best way to do this would be to run the job on one, or both of the clusters, there are however some challenges here.

The big problem is that to compare the data you need to move the data from one cluster to the other. You could do this by first moving the data set perhaps from staging to prod, and then running the comparison as a job on the prod cluster. However, this may not be bandwidth efficient. If you are expecting relatively few differences between the data set, a more efficient method may be to emulate what rsync does, in other words to write a job on both clusters that produces a hash of the rows, or some set of blocks in the data set. You can then use the hashes as an index for the rows and just move the hashes from one cluster to the other for comparison. Once you've done the hash comparison, you can then use that to filter the heavier data transfer. Note that this technique is not guaranteed to be perfect due to the risk of hash collision, so you may want to chose wide hash function, however, the probability of failure is very low.

The best way to do this would probably be to produce the hashes with either spark or hive, transfer them to one of the clusters, again, use spark or hive to figure out the rows worth transferring. Both Spark and Hive prove good tools for solving this problem.

avatar
Expert Contributor

Thank you. I am unaware of rsync. Could you also give me a brief intro on generating the hash from hive tables? Any link to its tutorial?

avatar
Super Collaborator

You could use Spark's data federation capabilities... E.g. in Cluster-1 have a Hive table for the data in Cluster-1. In Spark (running on Cluster-1) you can set up a temporary table that refers to a table in Cluster-2...

For instance:

CREATE TEMPORARY TABLE jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:<URL for Cluster-2's Spark thrift server>",
  dbtable "schema.tablename"
)

So in Cluster-1's spark-sql session, create the TEMPORARY table (pointing to Cluster-2) and join it with the Hive table in Cluster-1.

This can be done via spark-sql or via the API's. More info here: http://spark.apache.org/docs/1.6.1/sql-programming-guide.html#jdbc-to-other-databases

avatar
Expert Contributor

Thanks for the reply. Let me try this will come back with a reply.