About bleonhardi

bleonhardi · ‎06-02-2016

Etc. Normally you have more mappers for two reasons: a) in most analytical tasks you can filter out a huge percentage of the data at the source b) If you can choose where to compute things its better to do it in the mapper. Therefore you would want more reducers for any task where you do heavy tasks after a group by/join and you cannot filter out data in the mapper. Things I could think of: Running DataMining inside MapReduce to for example create one forecast model per product. In that case reading the data in the mapper is trivial but the modelling step running in the reducer is heavy so you would want more reducers than mappers. Inserting data into a ( partitioned ) ORC Hive table: Creating ORC files is pretty heavy and you want one reducer per partition and potentially a couple files for each. While reading a delimited file is very lightweight, so here you also want more reducers than mappers. ...

bleonhardi · ‎06-01-2016

Yeah I would say something like Oozie/distcp might be your better bet here. It fits nicely into the ETL flow you would have in your cluster anyway. HDF is very powerful and in many areas much nicer to use than oozie/falcon. However if you have a hadoop cluster you normally want to do bulk processing in it and this would be scheduled by oozie/falcon, so using these frameworks to propagate results or raw files to other clusters also seems to make sense to me. I would see HDF more as the tool that gathers all information and brings it into the cluster.

bleonhardi · ‎06-01-2016

Weird I do the same thing and it should work. Works for me. You sure its not just a stupid mistake like missing / or that the access rights are not correct? Can you try with hardcoded path? <exec>myscript.sh</exec> ... <file>${nameNode}/myfolder/myscript.sh#myscript.sh</file>

bleonhardi · ‎06-01-2016

Its a bit comparing oranges to apples. Falcon is used to pipe huge amounts of data between hadoop clusters. ( using distcp and other tools ). And it can schedule like oozie transformation tasks that are supposed to run in a cluster. HDF is a streaming solution a bit more similar to flume ( HDF fans will hit me for that comparison ) for ingesting data into an hadoop cluster ( and doing other things with it. ) So the question is you have data streams ( logs, IOT data, social media data, ... ) coming in from outside an hadoop cluster? HDF is perfect for it and you can easily add two outputs to different clusters. You have a source cluster and want to move data to two target clusters and do some in cluster computations like ETL? Falcon/Oozie with distcp. Doesn't mean that you couldn't use HDF for that as well but it would not be as natural.

bleonhardi · ‎06-01-2016

Forget indexes but partitions would help a lot . I normally expect Hive tables to be partitioned by date. So why don't you do that. You may need to add another column with an integer day column. Also ORC files are much faster than delimited files. Finally clustering by code doesn't help you much since you don't use ORC. So there is no predicate pushdown at all and no performance enhancement. Even if you use ORCs you might not get benefits since you would have one file with all rows of your code and 24 files ( mappers) that would close immediately. So it might be better to cluster by something else and add a sorted by code keyword. However as said that only helps you with ORCs.

bleonhardi · ‎06-01-2016

You can try to add the Hive2 credential to your java action but I am afraid that it is not supported for a generic Java action ( still worth a try), https://oozie.apache.org/docs/4.2.0/DG_ActionAuthentication.html if you need to get it yourself you would have to programatically read the keytab and get the ticket yourself: https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.admin.doc/doc/kerberos_hive.html And finally why don't you use a more normal authentication mechanism for hive like PAM/LDAP ( ceterum censeo cartaginem delendam esse )

bleonhardi · ‎05-30-2016

"(so we really have more than 1 rdd lol...?)" I was not precise here. We have one RDD ( distributed dataset ) with x partitions. So when I said RDD replace it with RDD partition. "1- Hivescan to read data stored in hdfs and create a RDD based on this data (create 1 or more rdd?)" Spark uses the underlying hadoop inputfprmats to read the files and creates a Spark RDD partition for each split ( normally a block ) of the file in hdfs. It will then try to place these RDD partitions in executors which are in the same physical machine as the file block. This is what I mean with locally. Whenever you can read data on the local machine without having to send it to other machines. "6 - Ant the last TungstenAggregate that aggregates the pre aggregation I didnt understand very well, can you explain better? http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm Lots of great explanations around. It works exactly the same way as a combiner. Instead of shipping all values around the network you do a local aggregation on each node and then only distribute the intermediate result around to do the final aggregation after the shuffle/exchange.

bleonhardi · ‎05-30-2016

It would sound to me that you didnt use the file tag correctly. Did you upload the file to hdfs? However I am dubious if this will work in any case since the script will have a lot of dependencies. You might have to install the hbase client on all nodes or upload a lot of extra libraries. Alternatively you could use an ssh action to connect to a full working edge node and run your commands there. I would almost propose this for something non mission critical like scheduling compactions.

bleonhardi · ‎05-29-2016

@Kaliyug Antagonist "Does this mean that I have to explicitly set the no. of reducers on the Hive prompt ? Is it mandatory for the CORRECT insertion of data ? Its not mandatory for the correct insertion but for the performance. If you have a hundred you have a hundred files and the smapis divided between them ( all values for one ending up in the same file ) if you have 10 you will have ten files. So there is a direct correlation with load speed ( and to a lesser extent query performance as well and yeah buckets might be your better bet "Unfortunately, there is only one where condition(where smapiname_ver ='dist_1'), so I am left only with one column on which partitioning is already considered." So once you use buckets you don't use distribute by anymore its either or sort you specify it in the table definition https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables see how they specify the sorted by keyword in the table definition? If you then load data into it you hive will do the distribute/sort stiuff itself.

bleonhardi · ‎05-29-2016

Under the cover Spark uses a Hadoop TextInputformat to read the file. The minPartitions number is given as an input to the FileInputFormat getSplits method. http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29 This function is pretty complex and uses a goalSize, blockSize and minSize to split up the file into splits. goalsize being totalsize/numbersplits. Looking at it it normally should honour your request but you might be running into a scenario where you have a very small file and run into some rounding issues. You could try running the code with your blocksize to see if that is the case.. It should not matter though since Hadoop will make sure that each record is processed exactly once. ( By ignoring the first unfinished record of any block and overreading the split to finalize the last record. ), Y

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: What are the common jobs where number of reduc...

Re: Falcon for Teeing

Re: I have been working on scheduling HBase compac...

Re: Falcon for Teeing

Re: slow query in hive

Re: JDBC client to Hive - No data or no sasl data ...

Re: Spark physical plan doubts (TungstenAggregate,...

Re: I have been working on scheduling HBase compac...

Re: Part-2 : Join involving 24 billion X 1 to 8 mi...

Re: increasing textFile() partitioning number anom...