About bleonhardi

bleonhardi · ‎02-19-2016

@Sunile Manjee It depends. You can run any R function, but only a subset is supported directly on the dataframe. R functions are normally not parallelized so to have true parallel aggregations he needs to translate them into Spark code. - You can always filter first in Spark and then copy your sparkr dataframe into a local normal R data frame using as.data.frame. - Other similar tools support the execution of R code on rows/groups of data inside the cluster ( groupApply, TableApply, RowApply in other mapreduce frameworks ) however I do not see a way to do that in Spark they do not seem to have an R library distributed to every node but I might be wrong others can correct me. - You always have the option to directly execute R from Scala and then do the grouping yourself but that will be a lot of effort https://cran.r-project.org/web/packages/rscala/

bleonhardi · ‎02-19-2016

You are definitely not stupid 🙂 Working with data is hard. There are some things that work really well now and the core advantage of Hadoop is that once written you can scale your application to infinity. But in general working with data is hard. I remember once spending a day to export an XML table from DB2 and spending days to figure out the correct way to extract some key fields from JSON Tweets ( the user name can be in different fields, some fields are empty when I think they shouldn't, some records are just broken ... ) In general Hadoop uses some of the most commonly used open source Java libraries to handle XML and JSON processing but it is not a core feature like XML in postgres might be. For the JSON I would say if it breaks it most likely breaks in other tools as well. The Open source Java Json libraries are widely used. But let's go back to your XML problem. So you have a pretty huge XML and want to extract hundreds of fields from them as a view? And you say you didn't use a Serde but stored it how? As a String? And then you used the following? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF But that is terrible. He would read the 400KB XML string push it into the Xpath udf for every single one of your xpath expressions and parse the document over and over and over and over again. Not surprised that it is slow or kills itself if this is what you actually did. You need to find a way to parse the document once and extract all the information out of it that you need. Or use the Serde which does the same.

bleonhardi · ‎02-19-2016

Just one question, did you cut and paste the command? Word has the bad habit of replacing the - sign with a different character that looks the same. Can you verify that the - is really the normal linux command line - ?

bleonhardi · ‎02-19-2016

Hello Rukishek, unless I misunderstand something this is not correct. HDFS doesn't store in the data where the next block is. Instead the Namenode knows which blocks make up a file and also the order of the blocks. Using this the HDFS client knows which block to load at any time if you seek in the file. HDFS blocks are stupid, simple 128MB cut blocks of the data. Datanodes are stupid and only know which blocks they have. The Namenode pieces it all together using an in-memory image of all files and blocks that make these files and where they are stored. The clients get this information from the namenode. Now if you mean full-text indexing then you should look at Solr like Rahul said.

bleonhardi · ‎02-19-2016

The short answer is no. Indexes in Hive are not recommended. The reason for this is ORC. ORC has build in Indexes which allow the format to skip blocks of data during read, they also support Bloom filters. Together this pretty much replicates what Hive Indexes did and they do it automatically in the data format without the need to manage an external table ( which is essentially what happens in indexes. ). I would rather spend my time to properly setup the ORC tables. Again shameless plug: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎02-19-2016

So first: ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and bloom filters. Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids. Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ). However boom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns: orc.bloom.filter.columns The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much. You can play a bit with it but I doubt you will get big performance improvements by changing it. I would expect more impact from block size ( which impacts how many mappers are created ), compression ( zip is normally the best ). But by far the most impact comes from good data modeling. I.e. Sorting the data during insert, Correct number of ORC files in the folder, data types used, etc. shameless plug who explains it all a bit: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎02-18-2016

You can normally escape things with a \ in front of it. Sometimes two backslash.

bleonhardi · ‎02-18-2016

Yarn is a work scheduler that can run different types of workloads. - Spark - MapReduce2 - Storm - Tez ... While MapReduce is a core feature and most likely the majority of the workloads its not the only one anymore. Hive/Pig uses Tez and Spark and Storm are big as well. This is the biggest advantage. Other advantages include better scalability ( local nodemanagers instead of a single bottleneck ) lots of convenience features etc. pp.

bleonhardi · ‎02-17-2016

You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So this is a subset.

bleonhardi · ‎02-17-2016

The jira proposes 10+4 ( 10 data blocks with 4 additional blocks for the RAID ) which would be roughly 40% overhead. But I think I have seen 6+3 as well somewhere. In both cases you should be able to roughly count with 50% overhead. But as Neeraj says we will have to wait to see final figures. https://issues.apache.org/jira/browse/HDFS-7285

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Spark R and Python Libraries

Re: Am I stupid or does anyone else have constant ...

Re: Getting –useHCatalog file doesnot exist when r...

Re: How indexing is done in HDFS?

Re: Creating Indexes in Hive

Re: Looking for a better explanation for "orc.row....

Re: Beeline hivevar value with spaces and symbols

Re: YARN v/s MapReduce?

Re: Spark R and Python Libraries

Re: How to calculate space quota for erasure coded...