Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7358 | 08-12-2016 01:02 PM | |
| 2708 | 08-08-2016 10:00 AM | |
| 3672 | 08-03-2016 04:44 PM | |
| 7211 | 08-03-2016 02:53 PM | |
| 1863 | 08-01-2016 02:38 PM |
03-26-2016
12:23 AM
1 Kudo
"But within each of those files, sorting may help to skip blocks.Am i right? Yes. Often this is even wanted. For example you have 10 countries and want to filter by them. If you had distributed by the country all rows for that country end up in one file. And the mappers in that file will take a long time because there would be full 256MB blocks of that country. However if you have ten files the data would be aggregated in parallel because each file would have a couple stripes with the country but could skip most of the data in the block. "Does ORC maintain index at the block level or the stripe level?" It actually does so sub stripe ( 10000 rows at a time ). It has to read the Stripe footer though. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification "3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem? Which one is normal? You mean bucketed or just in mapper? Bucketed has one reducer for each bucket. So if you have 30 buckets and 40 partitions you have 1200 files in the end. However you wrote that with 30 reducers which would be slow for large amounts of data. In optimized load with 30 buckets and 40 partitions you would have 1200 files as well. However you wrote that with 1200 reducers which on a big cluster is 40x faster than before. Also optimized sorts the data so only needs to keep one target at a time in memory. So it uses much less RAM than map or bucketed load. ( Too much memory use can result in OOM errors or in files with small stripes. ) "4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?" True, PPD is only for ORC and Parquet but partitioning bucketing etc. are valid for every format. The only thing is that delimited or Sequence writers do not need to keep a memory buffer for open files so they do not have the same problems with OOM or small stripes.
... View more
03-25-2016
08:09 PM
1 Kudo
Hi Standard Load: The daily, hourly, weekly load you normally do during operations. This is different to a initial load since it normally loads in one partition. Mappers/Small files: that is correct. If you have lots of target partitions in your data you get lots of small files. ( for example 30 country partitions and 100 blocks of input data will result in 3000 target files each of them roughly the size of 4MB * 128MB / 30 . That is the reason you should use reducers in that case. "What is the default key used for distribution if you don’t use the distribute by clause?" If you don't use that clause yo either have just mappers ( and no distribution ) or some other part of the query that defines the reducer like a group by, join, or the sort.import setting discussed.. " Slide13" That is just an overview of the three approaches after ( bucketed, optimized or manual ). The trick is to distribute by a key that fits to your partitions. To make sure each reducer only writes to one partition. Hash conflict means that randomly two partition keys may end up in the same reducer. distribute by essentially distributes the distribute key randomly ( hashed ) among reducers. "Slide14" You have one reducer for each bucket in all partitions. This means if you have 3000 partitions the ORC writer needs to keep 3000 buffers. Thats the reason for optimized. "Optimized" thats the reason optimized is different from normal bucket load. You want many reducers ( otherwise the load is slow ). But not too many ( otherwise you get small files) . Also if each reducer writes to one partition only you have optimal performance/file size ratio and can tune the file size through the number of reducers. "Sort" Not true when you sort you make sure that all values belonging together will end up besides each other and you can skip all blocks but the one containing the value.
... View more
03-24-2016
01:06 PM
Can you paste the actual message? Normally Vectorization is Hive grouping together x ( 1024 ) records to run operations at them at once. This is much more efficient than doing operations row by row on modern CPUs because of cache settings and compiler optimizations. https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution Not sure about Update Deletes, they might use Vectorization for some functions there too.
... View more
03-24-2016
11:51 AM
If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )
... View more
03-24-2016
11:12 AM
Hello Hoda, so yes you would do basically the same. But there are functions on the DStream that do that for you already: saveAsTextFiles and saveAsObjectFiles but as said they essentially do the same you did before. I.e. do a save on each RDD using a timestamp in the filename. @hoda moradi https://spark.apache.org/docs/1.1.1/api/java/org/apache/spark/streaming/dstream/DStream.html
... View more
03-23-2016
04:57 PM
1 Kudo
Just in addition to what Artem said ( Oozie stores the output of an action in its launcher logs so you have to drill through the logs ). If you however want to automatically react to output in the shell action you can do that as well with the capture output tag. Your shell command needs to output a key value pair and oozie will read them and add them to the flow as varialbles. So if your load.sh would do an "echo output=success" at the end the below flow would go to the success if not the flow would be killed. <action name="load-files">
<shell xmlns="uri:oozie:ssh-action:0.1">
...
<command>load.sh</command>
<capture-output/>
</shell>
<ok to="check-if-data"/>
<error to="kill"/>
</action>
<decision name="check-if-data">
<switch>
<case to="end">${ wf:actionData('load-files')['output'] eq 'success'}</case>
<default to="kill" />
</switch>
</decision>
... View more
03-23-2016
04:39 PM
2 Kudos
Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files. Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads. What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward. https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F So short answer. While possible it is not easy. Sorry.
... View more
03-23-2016
04:09 PM
2 Kudos
If you use LDAP and not kerberos then you need a password. However you can provide it with a password file beeline -w ~/passfile for example. You just need to make sure the password file is only accessible by your user for security. Then you can just keep the beeline command in a shell script to execute it in a simple command. Password less Su would be done in linux using the sudoer file. Password less ssh is done by adding the public key of the caller to the authorized file of the target host. If you need those.
... View more
03-22-2016
11:47 AM
1 Kudo
@hoda moradi
Unfortunately not for Java but Scala the general difference is just that it changes the fields. The Extractor class is a Java or Scala class that changes an object from one object into another. For example you have columns, you could create a CSV parser that parses the file and returns a structured object containing all fields you need and do any transformations. I think I should do a quick article about that sometimes. var parsedStream = inputStream.mapPartitions {
records =>
val extractor = new Extractor(field,regex);
records.map {
record => extractor.parse(record)
}
}
... View more
03-21-2016
10:33 AM
Hello John, I think there has been a confusion, the Jars need to be on the client/hiveserver nodes of the cluster on the local Linux file system. In /usr/hdp/<version?/hive/auxlib. If you put a jar in there you don't need to do another ADD. If you do an ADD you also need to have the jar on the local file system. This time depending what you use. If you use the hive client then on your client machine if you use beeline or JDBC then on the same machine of the hiveserver2.
... View more