Support Questions

Find answers, ask questions, and share your expertise

Any use case for using hdf for importing data from relational databases?

avatar

Some pointers on the performance comparison with sqoop will help.

Also things we can leverage on data governance and provenance if we use hdf over sqoop in such scenarios.

Example database can be mysql/oracle/sqlserver etc.

Synopsis: What will induce me to choose hdf over sqoop in medium sized (few terabytes) relational databases?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Nifi documentation seems to indicate around 50MB/s -100 MB/s transfer rates.

https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.0/bk_Overview/content/performance-expectation...

Nifi is useful if there are several source databases from which data needs to be extracted very frequently as it helps with monitoring and work flow maintenance. If some of this data needs to be routed to different tables based on a columns value for instance Nifi is a good choice as sqoop wont support this by default.

If data needs to moved to multiple destinations also Nifi is a good choice - for example , Land data in HDFS while moving a part of the data to Kafka/Storm or Spark - This is also a benefit of Nifi

Nifi can apply scheduling of these flows easily while in sqoop it has to be set up as a crontab or Control M etc.

Sqoop can use mappers in hadoop for faulttolerance and for parallelism and may achieve better rates.If deduplication etc is to be performed then Nifi becomes a choice for smaller data sizes. For large table loads Sqoop is a good choice.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Nifi documentation seems to indicate around 50MB/s -100 MB/s transfer rates.

https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.0/bk_Overview/content/performance-expectation...

Nifi is useful if there are several source databases from which data needs to be extracted very frequently as it helps with monitoring and work flow maintenance. If some of this data needs to be routed to different tables based on a columns value for instance Nifi is a good choice as sqoop wont support this by default.

If data needs to moved to multiple destinations also Nifi is a good choice - for example , Land data in HDFS while moving a part of the data to Kafka/Storm or Spark - This is also a benefit of Nifi

Nifi can apply scheduling of these flows easily while in sqoop it has to be set up as a crontab or Control M etc.

Sqoop can use mappers in hadoop for faulttolerance and for parallelism and may achieve better rates.If deduplication etc is to be performed then Nifi becomes a choice for smaller data sizes. For large table loads Sqoop is a good choice.

avatar

Thanks you nice insights.