Member since
11-04-2014
13
Posts
0
Kudos Received
0
Solutions
06-06-2018
11:25 PM
Hey @Nick Xu! Not sure if you're still facing this issue. But, could you try to raise the values for tez.am.resource.memory.mb tez.task.resource.memory.mb Hope this helps! 🙂
... View more
12-19-2018
05:26 PM
@Nick Xu We are facing similar behaviour it is taking 30+ minutes for hiveserver2 startup. how did you disable the validator? @Nick Xu
... View more
04-26-2017
01:15 PM
@Nick Xu, I don't think there is a straight forward answer to your question. The choice of an approach will heavily depend on data in those four columns, and the way you store your data. If you use ORC and Tez, there are some articles about optimization that can be done on a data and metadata level. To your approaches: 1) Join on all 4 columns: select the field that has least repeating values and insert data sorted (or even ordered, if it is not too heavy operation for your cluster), use "analyze table" and use CBO during the call. 2) Same as one. Concatenated field will be unique and this is a perfect case for calculating stripe and stride size in order to get best performance on lookup / join 3) What is "binary"? if you mean any time of hashing - see next; else if number of bytes in binary representation remains the same as in string - then it just doesn't matter, in my opinion, keep strings - easier to debug and audit 4) Any type of hashing, encoding/encryption can have collisions, so you will need to check original fields anyway. And I don't think you can gain any value of having shorter values for that column. So, I would go either with #1 or #2 - again, depends on your data. References: https://community.hortonworks.com/content/kbentry/75501/orc-creation-best-practices.html At the end of that article I provided more links, check them and make your choice. Good luck in finding your way 🙂
... View more