Member since
09-25-2015
22
Posts
9
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3213 | 06-12-2018 09:26 PM |
07-28-2020
11:31 PM
"I highly recommend skimming quickly over following slides, specially starting from slide 7. http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey" This slide is not there at the path
... View more
08-28-2018
06:42 PM
@snukavarapu Thanks for the article, this worked great for me. Is this something you keep continuously updated? If so, what's your strategy for keeping the table updated?
... View more
09-21-2017
01:35 PM
2 Kudos
@Riddhi Sam
First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.
... View more
11-18-2015
04:30 PM
smaller blocks take up more space in the namenode tables, so in a large cluster, small blocks come at a price. What small block sizes can do is allow for more workers to get at the data (half the blocksize == twice the bandwidth), but it also means that code that works with > 128MB of data isn't going to get all the data local to a machine, so more network traffic may occur. And, for apps that spin up fast, you may find that 128 MB blocks are streamed through fast enough that the overhead of scheduling containers and starting up the JVMs outweighs the extra bandwidth opportunities. So the notion of "optimal size" isn't really so clear cut. If you've got a big cluster and you are running out of NN heap space, you're going to want to have a bigger block size whether or not your code likes it. Otherwise, it may depend on your data and the uses made of it. As an experiment, try to save copies of the same data with different block sizes. Then see which is faster to query
... View more
06-30-2016
07:55 AM
I have the same problem, as users of Hive (configured with doAs=false and security in Ranger) create lots of external tables to map their data. But hive is unable to access this data by default, we have to give explicit permissions for hive user to read the hdfs data of external table. That is very cumbersome. I don't see any best practice regarding external tables in the document you referenced. Do you guys have any advice how to handle external tables in such case? Thanks!
... View more
10-12-2015
01:56 PM
Joe, please dont' mention customer names as we dicussed. Please reword your response to remove it.
... View more