Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

hive,sqoop,hue and others

avatar
Rising Star

in my search i always encounter these tools hive ,hue, sqoop and each one has a specific installation way requirements ,specific operating system version,specific hadoop version ,specific environment to deal with like cloudera,ambari

but i'm still not able to understand the relation between these tools ,i mean is hive part of hue or it can be stand alone tool for data importing and exporting to sql server, is sqoop is another tool for data processing ,can it be standalone tool ?

i want some one to explain these infrastructure of hadoop and what is the relation between these tools which i cant found directly while exploring google,what is the best tool for data importing and exporting to sql server ?

1 ACCEPTED SOLUTION

avatar
Super Guru

@oula.alshiekh@gmail.com alshiekh

MapReduce: When Hadoop was first created, the only way to read and write data from Hadoop was writing a MapReduce job. I highly recommend you read the Google MapReduce paper or go over the slides here.

One limitation of Map Reduce is that the API was in Java which means you have to be a Java programmer. That limits the platform significantly. What if you want your analysts who can only write SQL to be able to query data in Hadoop. That's where hive comes in. It was created so people can run SQL on their data in Hadoop. See below for details on Hive.

Hive: A tool that enables you to run SQL on top of your tabular data in Hadoop. Imagine you have CSV, TAB delimited or similar file in Hadoop. Now you want to read the data in this file. You can of course "cat" the file like linux. But what if you want to fetch only few rows? What if you have hundreds of such files and you want to combine some data in these files and get results like you read from traditional RDBMS. Hive enables you to run SQL on your data in Hadoop.

Assume you have a file called "fileA" which has data in this format:

col1,col2,col3,....coln

With Hive you can create a table by specifying the location of this file and read data using SQL. Notice unlike traditional RDBMS, you already had the file and data before you created the table. After creating table, you can of course bring in more data by either appending to the same file or creating new files with same structure in the same directory as "fileA" above.

SQOOP: Name comes from "SQL" (SQ) and "Hadoop" (OOP). When companies started using Hadoop, and bringing data from traditional databases into Hadoop, there was a need for a tool that helps import data from databases like Teradata, Netezza, SQL Server, Oracle and so on. SQOOP provides a command line mechanicsm to import data into and export data from Hadoop to a traditional database. It uses drivers provided by the database you are importing or exporting data to/from.

HUE: "Hadoop User Experience". This tool provides a nice GUI, where you can run Hive queries, or even SQOOP commands from a nice GUI. It enables you to save your work and come back later. It is basically a development tool. Think how eclipse helps you write Java programs. Just like that HUE helps you write Hadoop scripts for example Hive or Pig scripts.

Pig: A scripting language to work with your data in Hadoop. Hive enables you to write SQL. But what if you want to write something similar to PL/SQL? Well, you can use Pig for that.

Spark: If you read the MapReduce paper, you'll understand how it works. In short, it basically reads data from disk (and lots of data on hundreds of machines in parallel) and then for intermediate steps (mapper output, shuffle and sort, and finally a reducer output which is the output of whole job), it writes data back to disk. What this means is that you go back to disk up to six times (read data from disk, write mappers output to disk. shuffle/sort will read mappers output and write sorted data back to disk and reducers will read sorted data and write the output back).

What Spark enables you is that unlike MapReduce, once the data is read from disk, it stays in memory. The results of all intermediate steps stay in memory. This in memory processing significantly improves performance of your jobs over MapReduce.

HBase: (H)adoop Data(Base). Based on Google's Big Table, it provides a low latency (single to low double digit milli second) high throughput read/write access for your data in Hadoop. Thousands of records per second can be read and written from HBase. Massively scalable, last I knew, your facebook messenger was powered by HBase. In 2010, 350 million users were sending around 15 billion messages per month, all powered by HBase. So, if you need fast, highly scalable and reliable technology for your system, HBase is the tool you are looking for. Also check this page.

View solution in original post

4 REPLIES 4

avatar
Super Guru

@oula.alshiekh@gmail.com alshiekh

MapReduce: When Hadoop was first created, the only way to read and write data from Hadoop was writing a MapReduce job. I highly recommend you read the Google MapReduce paper or go over the slides here.

One limitation of Map Reduce is that the API was in Java which means you have to be a Java programmer. That limits the platform significantly. What if you want your analysts who can only write SQL to be able to query data in Hadoop. That's where hive comes in. It was created so people can run SQL on their data in Hadoop. See below for details on Hive.

Hive: A tool that enables you to run SQL on top of your tabular data in Hadoop. Imagine you have CSV, TAB delimited or similar file in Hadoop. Now you want to read the data in this file. You can of course "cat" the file like linux. But what if you want to fetch only few rows? What if you have hundreds of such files and you want to combine some data in these files and get results like you read from traditional RDBMS. Hive enables you to run SQL on your data in Hadoop.

Assume you have a file called "fileA" which has data in this format:

col1,col2,col3,....coln

With Hive you can create a table by specifying the location of this file and read data using SQL. Notice unlike traditional RDBMS, you already had the file and data before you created the table. After creating table, you can of course bring in more data by either appending to the same file or creating new files with same structure in the same directory as "fileA" above.

SQOOP: Name comes from "SQL" (SQ) and "Hadoop" (OOP). When companies started using Hadoop, and bringing data from traditional databases into Hadoop, there was a need for a tool that helps import data from databases like Teradata, Netezza, SQL Server, Oracle and so on. SQOOP provides a command line mechanicsm to import data into and export data from Hadoop to a traditional database. It uses drivers provided by the database you are importing or exporting data to/from.

HUE: "Hadoop User Experience". This tool provides a nice GUI, where you can run Hive queries, or even SQOOP commands from a nice GUI. It enables you to save your work and come back later. It is basically a development tool. Think how eclipse helps you write Java programs. Just like that HUE helps you write Hadoop scripts for example Hive or Pig scripts.

Pig: A scripting language to work with your data in Hadoop. Hive enables you to write SQL. But what if you want to write something similar to PL/SQL? Well, you can use Pig for that.

Spark: If you read the MapReduce paper, you'll understand how it works. In short, it basically reads data from disk (and lots of data on hundreds of machines in parallel) and then for intermediate steps (mapper output, shuffle and sort, and finally a reducer output which is the output of whole job), it writes data back to disk. What this means is that you go back to disk up to six times (read data from disk, write mappers output to disk. shuffle/sort will read mappers output and write sorted data back to disk and reducers will read sorted data and write the output back).

What Spark enables you is that unlike MapReduce, once the data is read from disk, it stays in memory. The results of all intermediate steps stay in memory. This in memory processing significantly improves performance of your jobs over MapReduce.

HBase: (H)adoop Data(Base). Based on Google's Big Table, it provides a low latency (single to low double digit milli second) high throughput read/write access for your data in Hadoop. Thousands of records per second can be read and written from HBase. Massively scalable, last I knew, your facebook messenger was powered by HBase. In 2010, 350 million users were sending around 15 billion messages per month, all powered by HBase. So, if you need fast, highly scalable and reliable technology for your system, HBase is the tool you are looking for. Also check this page.

avatar
Rising Star

thanks that helped me alot

avatar
Expert Contributor

You check this , it also gives some basic information, which tool used for what purpose

http://hortonworks.com/products/data-center/hdp/

avatar
Expert Contributor