My experience - a month old in the Hadoop world. Fiddled a bit in Hive, Pig and Hadoop utilizing Cloudera's Hadoop VM. Have perused Google's paper on Map-Reduce and GFS (PDF connect).
I comprehend that-
Thanks and Regards,
You may take an interest in this community article:
I hope it helps.
To put it in simple
Hive - does not verify the data when it is loaded but rather when a query is issused other wise callled as Schema on read . So the intial load of data is fast when compared to Schema on write i,e traditional database systems.
Pig - This is more of data flow programming lanaguage , like you have the freedom to tell how you want yourdata to be transformed , based on the input relation . Also you define your schema on runtime.
Hope this is suffice .
Thank you for getting involved in big data.
These are both just tools to shield people from having to be Java programmers writing raw MapReduce applications. While Pig may be more expressive in some ways than SQL, far more people know SQL in the IT industry than Pig Latin, so it has certainly had much more uptake. More importantly, Hive fits very nicely into many existing workloads that used to be run on traditional databases.
The storage mechanism underneath the processing is not what defines these products. They are simply translation tools that take something humans can understand and turns it into a MapReduce application. The storage mechanisms are changing all the time. For example, Hive can generate MapReduce applications that run on a local file system, HDFS, and they can also run on Amazon S3 buckets.
Exactly. That's the ticket.
Moving forward, there is Spark which is changing how we do big data processing. CDH Hive can currently be configured to generate Spark applications instead of MapReduce. There is some work going on now in the Pig ecosystem to do the same.
...And just to be confusing, Spark programmers can submit SQL statements directly to the Spark framework, in their code, that drives their application without having any contact with Hive.
If you're interested in the SQL arena, also be sure to check out Apache Impala. It is a high-speed compute engine that accepts SQL statement. It serves the same role as Hive (SQL on Big Data), but there are obviously some trade-offs/overheard for achieving a high speed so it is not currently a drop-in replacement.