About bleonhardi

bleonhardi · ‎03-26-2016

"But within each of those files, sorting may help to skip blocks.Am i right? Yes. Often this is even wanted. For example you have 10 countries and want to filter by them. If you had distributed by the country all rows for that country end up in one file. And the mappers in that file will take a long time because there would be full 256MB blocks of that country. However if you have ten files the data would be aggregated in parallel because each file would have a couple stripes with the country but could skip most of the data in the block. "Does ORC maintain index at the block level or the stripe level?" It actually does so sub stripe ( 10000 rows at a time ). It has to read the Stripe footer though. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification "3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem? Which one is normal? You mean bucketed or just in mapper? Bucketed has one reducer for each bucket. So if you have 30 buckets and 40 partitions you have 1200 files in the end. However you wrote that with 30 reducers which would be slow for large amounts of data. In optimized load with 30 buckets and 40 partitions you would have 1200 files as well. However you wrote that with 1200 reducers which on a big cluster is 40x faster than before. Also optimized sorts the data so only needs to keep one target at a time in memory. So it uses much less RAM than map or bucketed load. ( Too much memory use can result in OOM errors or in files with small stripes. ) "4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?" True, PPD is only for ORC and Parquet but partitioning bucketing etc. are valid for every format. The only thing is that delimited or Sequence writers do not need to keep a memory buffer for open files so they do not have the same problems with OOM or small stripes.

bleonhardi · ‎03-25-2016

Hi Standard Load: The daily, hourly, weekly load you normally do during operations. This is different to a initial load since it normally loads in one partition. Mappers/Small files: that is correct. If you have lots of target partitions in your data you get lots of small files. ( for example 30 country partitions and 100 blocks of input data will result in 3000 target files each of them roughly the size of 4MB * 128MB / 30 . That is the reason you should use reducers in that case. "What is the default key used for distribution if you don’t use the distribute by clause?" If you don't use that clause yo either have just mappers ( and no distribution ) or some other part of the query that defines the reducer like a group by, join, or the sort.import setting discussed.. " Slide13" That is just an overview of the three approaches after ( bucketed, optimized or manual ). The trick is to distribute by a key that fits to your partitions. To make sure each reducer only writes to one partition. Hash conflict means that randomly two partition keys may end up in the same reducer. distribute by essentially distributes the distribute key randomly ( hashed ) among reducers. "Slide14" You have one reducer for each bucket in all partitions. This means if you have 3000 partitions the ORC writer needs to keep 3000 buffers. Thats the reason for optimized. "Optimized" thats the reason optimized is different from normal bucket load. You want many reducers ( otherwise the load is slow ). But not too many ( otherwise you get small files) . Also if each reducer writes to one partition only you have optimal performance/file size ratio and can tune the file size through the number of reducers. "Sort" Not true when you sort you make sure that all values belonging together will end up besides each other and you can skip all blocks but the one containing the value.

bleonhardi · ‎03-24-2016

Can you paste the actual message? Normally Vectorization is Hive grouping together x ( 1024 ) records to run operations at them at once. This is much more efficient than doing operations row by row on modern CPUs because of cache settings and compiler optimizations. https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution Not sure about Update Deletes, they might use Vectorization for some functions there too.

bleonhardi · ‎03-24-2016

If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )

bleonhardi · ‎03-24-2016

Hello Hoda, so yes you would do basically the same. But there are functions on the DStream that do that for you already: saveAsTextFiles and saveAsObjectFiles but as said they essentially do the same you did before. I.e. do a save on each RDD using a timestamp in the filename. @hoda moradi https://spark.apache.org/docs/1.1.1/api/java/org/apache/spark/streaming/dstream/DStream.html

bleonhardi · ‎03-23-2016

Just in addition to what Artem said ( Oozie stores the output of an action in its launcher logs so you have to drill through the logs ). If you however want to automatically react to output in the shell action you can do that as well with the capture output tag. Your shell command needs to output a key value pair and oozie will read them and add them to the flow as varialbles. So if your load.sh would do an "echo output=success" at the end the below flow would go to the success if not the flow would be killed. <action name="load-files"> <shell xmlns="uri:oozie:ssh-action:0.1"> ... <command>load.sh</command> <capture-output/> </shell> <ok to="check-if-data"/> <error to="kill"/> </action> <decision name="check-if-data"> <switch> <case to="end">${ wf:actionData('load-files')['output'] eq 'success'}</case> <default to="kill" /> </switch> </decision>

bleonhardi · ‎03-23-2016

Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files. Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads. What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward. https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F So short answer. While possible it is not easy. Sorry.

bleonhardi · ‎03-23-2016

If you use LDAP and not kerberos then you need a password. However you can provide it with a password file beeline -w ~/passfile for example. You just need to make sure the password file is only accessible by your user for security. Then you can just keep the beeline command in a shell script to execute it in a simple command. Password less Su would be done in linux using the sudoer file. Password less ssh is done by adding the public key of the caller to the authorized file of the target host. If you need those.

bleonhardi · ‎03-22-2016

@hoda moradi Unfortunately not for Java but Scala the general difference is just that it changes the fields. The Extractor class is a Java or Scala class that changes an object from one object into another. For example you have columns, you could create a CSV parser that parses the file and returns a structured object containing all fields you need and do any transformations. I think I should do a quick article about that sometimes. var parsedStream = inputStream.mapPartitions { records => val extractor = new Extractor(field,regex); records.map { record => extractor.parse(record) } }

bleonhardi · ‎03-21-2016

Hello John, I think there has been a confusion, the Jars need to be on the client/hiveserver nodes of the cluster on the local Linux file system. In /usr/hdp/<version?/hive/auxlib. If you put a jar in there you don't need to do another ADD. If you do an ADD you also need to have the jar on the local file system. This time depending what you use. If you use the hive client then on your client machine if you use beeline or JDBC then on the same machine of the hiveserver2.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Hive - Deciding the number of buckets

Re: Hive - Deciding the number of buckets

Re: What is the term 'Vectorization' used while up...

Re: Options for decompressing HDFS data using Pig

Re: How to save all the output of spark sql query ...

Re: Where is the output of an Oozie workflow store...

Re: Options for decompressing HDFS data using Pig

Re: Access Beeline via LDAP without providing pass...

Re: How to save all the output of spark sql query ...

Re: Loading unstructured CSV files to Hive