About bleonhardi

bleonhardi · ‎05-10-2016

That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind ) Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching. https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me ) https://community.hortonworks.com/questions/26849/person-matching-in-spark.html

bleonhardi · ‎05-10-2016

Quick Prefix: One "vertex" is a more general form of the Map and Reduce stages in MapReduce. In mapreduce you can only have 2 stages and complex pig jobs result in multiple mapreduce jobs running after each other. In Tez multiple stages ( vertexes ) can be merged into the same job. So think of a vertex as of a map or reduce stage. Each vertex can have multiple tasks. In pig the same rules apply for the number here as in MapReduce (see links below). However sometimes its a bit difficult to configure two reducer stages independently so normally parameters are used like "Nr of mb per reducer" and pig then tries to compute the number of tez tasks based on the output size of the previous stage/vertex. You can also hard set it but then all reducer vertexes have the same number which is not always what you want. 1) yes 2) yes How to control this: Same as MapReduce. This link for reducers https://pig.apache.org/docs/r0.11.1/perf.html#reducer-estimation and this link for mappers: https://pig.apache.org/docs/r0.11.1/perf.html#combine-files 3) Not sure honestly, have you tried the tez view? 4) Not sure what you mean, one node as in one server? Or in one container. One container = one task multiple containers = one vertex, to control their parallelity see above 5) One vertex running 4 mappers, unless the files are small then they are combined see link above 6) See above ( Might misunderstand the question seems to be the same as above ) 7) Again Tez view in ambari might help. In Hive there is a parameter set hive.tez.exec.print.summary ( or hive.tez.print.exec.summary? ) which shows you all of that. No idea if something like that is available in pig

bleonhardi · ‎05-10-2016

Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig: http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig To specify a number of writers you will need to force reducers. http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-instead-of-thousands-of-ti And here are some tips on setting the number of reducers: http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword.

bleonhardi · ‎05-10-2016

Yeah I am not really sure about the whole hash approach. Is there a primary key here? Why not simply load in the batch of data and then recreate the base table using: CREATE TABLE NEWTABLE AS SELECT * FROM DAILYADD UNION ALL SELECT * FROM OLD TABLE WHERE PRIMARYKEY NOT IN ( SELECT PRIMARY KEY FROM DAILYADD ); If you don't have primary keys then hash will not help you either, you might have two rows with the same values so what would you do then? What would you DO with the information of which rows have changed?

bleonhardi · ‎05-10-2016

LOAD by itself doesn't do any data transformations, it essentially takes files and puts them in the hive table directory. ( So you need to be sure to have created your hive table with the correct storage options for the data. ) If you need to transform data you need to create an external table and then use the insert into statement to transform your data.

bleonhardi · ‎05-10-2016

In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together). This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start. So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB. There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.

bleonhardi · ‎05-10-2016

There are two reasons for zookeeper numbers: a) Redundancy As Predrag mentions if you do not have HA anyway 1 Zookeeper will be as fast and good enough. However you NEED to backup the zookeeper data directory like you would the namenode folder. There is a lot of information in there that you need for a working cluster. Going for 3 makes life safer I think b) Performance Not what you ask for because you mentioned a small cluster however just for general information: Adding zookeeper nodes makes your cluster slower if you have more than 15-30% write operations. But if you have mostly reading clients adding nodes makes your zookeeper cluster faster. Just in case you ever have performance problems because of too many zookeeper clients. ( Highly unlikely on a smaller cluster unless you are a heavy HBase or Kafka user ( assuming an older kafka version ) http://muratbuffalo.blogspot.co.uk/2014/09/paper-summary-zookeeper-wait-free.html

bleonhardi · ‎05-10-2016

Contrary to popular believe Spark is not in-memory only a) Simple read no shuffle ( no joins, ... ) For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. ) So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory. b) Shuffle This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc. c) After Shuffle RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that. d) Spark Streaming This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.

bleonhardi · ‎05-09-2016

One question? Why not simply add a "changed at" column? For adding rows you could just use partitioning. And filter by the daily partiton? For changing them you plan to use ACID? It is still pretty new and not great for high number of updates across a full table. I think a bit of more detail to what you actually plan to achieve might be good

bleonhardi · ‎05-09-2016

@John CodUnfortunately too vague. Also it looks to me like you use MapReduce? May I ask which distribution you are using? If its CDH then you won't have Tez and hive will be slow. Cloudera has their own query engine Impala and are now going to Hive on Spark so they do not really support the latest of the Open Source Hive. On CDH I would go with Parquet+Impala then. ( or switch to Hive and HDP or any other OpenHadoop distribution)

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Find Fields in Noise with Spark

Re: Question on tez dag task and pig on tez

Re: How a huge compressed file will get stored in ...

Re: Using hash functions in hive to figure out whi...

Re: basic hive clarification

Re: How a huge compressed file will get stored in ...

Re: Number of Zookepers on small cluster

Re: dataframe bigger than evalable memory

Re: Using hash functions in hive to figure out whi...

Re: create hive orc table