About bleonhardi

bleonhardi · ‎05-29-2016

Its actually not too hard. Tungsten is an engine in Spark to process data on a lower level using vectorized and in other ways optimized operations. TungstenAggregate for example allows an efficient hash table based aggregation. So if you want to understand the aggregation in Tungsten: TungestenAggregate-> TungstenExchange -> TungstenAggregate it might help if you understand what a map->combiner->reducer is in MapReduce. Its very similar. Essentially you do a group by l_returnflag, l_linestatus and then aggregate a bunch of columns. So what Spark needs to do is: - Read local partitions of data on each block ( RDD ) HiveScan - Filter locally and project ( Filter, Project ) - Locally aggregate the data on each RDD by your grouping keys ( First TungstenAggregate ( you can see the mode=Partial tag ) in mapreduce this would be the combiner. - Distribute the data by group key to a new set of RDDs so all values for a specific group key will end up in the same target RDD, in MapReduce this would be called the shuffle : TungstenExchange - Do the final aggregation of the pre aggregated values on the target rdds, : second TungstenAggregate with mode=Final - Do some casting and type conversion ( ConvertToSave ) Voila hope that helps

bleonhardi · ‎05-26-2016

"SORT BY is required to speed up the search" correct, with sort by each output file will be sorted by your sort condition, so ORC can skip whole blocks ( stripes ) of data based on where conditions "DISTRIBUTE BY to create less no. of output files(1-10 or more?," Distribute by works very similar to buckets. ( you may be better off with buckets in case you don't understand it well however it gives you more flexibility and skips some of the problems that come with buckets ). Essentially distribute by forces a reducer with a shuffle and distributes data by the key. So if you distribute by smapiname_ver you would have all values with the same smapiname in the same output file. Also if you distribute with the partition key you can make sure that each reducer only writes to a single output file. Together with forcing the number of reducers you have essentially a similar power to buckets ( and more flexibility ) . But again if you don't understand it you might be better off with buckets sorted. and the optimized sorted load. "The JOIN is on the integer column Snapshot_Id, hence, SORT BY Snapshot_Id" Hmmm no, you do not filter by the snapshot_id, you still need all of them so the predicate pushdown doesn't help you thee. "Is it necessary that the DISTRIBUTE BY column has LOW CARDINALITY" No you define the number of files with the number of reducers, the distribute by decides how data is distributed between them. Make sure that the partition key is part of the distribute by and any other key you want to add where conditions on. ( ideally still allowing parallelity.

bleonhardi · ‎05-25-2016

You do not have to change the Ambari database. But Ambari needs the jdbc drivers for mysql available to check on oozie and do the initial setups. "Also if I use derby to start with, can I changed it to MySql later ?" Yes you can switch later but you will lose all the data inside unless you do a migration. So just do it with mysql from the start.

bleonhardi · ‎05-24-2016

yes you would need to configure user sync with ldap/ad in the ranger ui. Alternatively use UNIX user sync in Ranger to sync with the local operating system. ( Works as well )

bleonhardi · ‎05-24-2016

yeah ldap or use PAM. You can still kerberize your cluster. But you wouldn't do the hive authentication through it. https://community.hortonworks.com/articles/591/using-hive-with-pam-authentication.html

bleonhardi · ‎05-24-2016

Alternatively use kerberos and kerberize the HDFS UI. In this case only SPNEGO enabled browsers will be able to access the ui and you will have the same filesystem access restrictions as users have when directly accessing hdfs.

bleonhardi · ‎05-24-2016

I don't think Spark Streaming 2.0 will change your requirements too much. AFAIK it will provide an easy way to run SQL on top of the stream. ( Might be mistaken so Spark experts feel free to stop me ) However it will not change the underlying architecture or latency considerations. In the end I think it depends on your workload. What kind of latency do you expect? You will have a hard to impossible time getting subsecond latency out of Spark Streaming for example. In this case I would go with Storm. Other reasons for storm: - out of order processing is much easier ( i.e. have some heavy tuples that can take a long time while processing fast tuples at the same time without blocking ) - I think its easier to have control flows - Essentially anytime you have a complex flow of multiple input streams that do not do complex joins but work more like control flows I would go with Storm Spark Streaming: - You have the full power of Spark at your disposal, data transformation steps like Groupings, Joins etc. should be much more natural. - all the cool spark tooling and features like ML.

bleonhardi · ‎05-24-2016

Hello Mike, I would not use Knox unless you have to. HTTP protocol makes a lot of problems with clients. I would go with Ldap/pam for authentication in Hive ( has nothing to do with either ranger or knox ) and binary access. Then configure Ranger for autorization. ( or use sqlstdauth. )

bleonhardi · ‎05-23-2016

Or if you want to do it in a dirty way. ( Ravi's way is obviously cleaner ). Replace the drive make sure it works and restart the datanode. Same effect but the datanode will be out of the system for a shorter period of time.

bleonhardi · ‎05-23-2016

"Isn't it's role limited to the data loading(ORC table creation) phase ?" Oh one last comment here. Loading the data correctly is key to the performance of the queries. One thing to look out for is too many small files in the table location. That is deadly for performance. So correct loading is a major thing.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Spark physical plan doubts (TungstenAggregate,...

Re: Part-2 : Join involving 24 billion X 1 to 8 mi...

Re: oozie / MySQl - Qiestion

Re: How to setup Hive Authentication in my cluster...

Re: How to setup Hive Authentication in my cluster...

Re: Quickly secure the access to the cluster via h...

Re: Spark Streaming 2.0 is it suitable for Low Lat...

Re: How to setup Hive Authentication in my cluster...

Re: Please how do i do a hot disk swap on a node ...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...