Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

what is the tool which we use to copy the structured and unstructured data into hadoop?

what is the tool which we use to copy the structured and unstructured data into hadoop?

New Contributor
 
2 REPLIES 2

Re: what is the tool which we use to copy the structured and unstructured data into hadoop?

Super Mentor

@pavan p

There are good discussion available on moving "Unstructured Data" to Hadoop. That gives a god references.
https://www.quora.com/How-do-I-import-unstructured-data-to-Hadoop


- Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc., A good overview of this tool you can find in: https://hortonworks.com/apache/sqoop/

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

.

- Flume in Hadoop is used to sources data which is stored in various sources like and deals mostly with unstructured data.

A good overview of this tool you can find in: https://hortonworks.com/apache/flume/

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

.

Re: what is the tool which we use to copy the structured and unstructured data into hadoop?

@pavan p

Its a vast question. Apart from Structured/ Un-structured you may have to look on other parameters like frequency of ingestion, size of file, Is it a event or batch load, from where you are picking the data.

In general,

Sqoop(Structured)--> Used to import the RDBMS data into HDFS/Hive

Flume/Kafka/NiFi(Unstructured) --> Can be used to capture unstructured data into HDFS.

Choosing the client depends on the many other parameters apart from what I have mentioned above. Each tool has its own pros & cons. You may have to dig deep if it is other than learning purpose. Hope it helps!!

Don't have an account?
Coming from Hortonworks? Activate your account here