About balavignesh_nag

balavignesh_nag · ‎05-12-2017

@mqureshi @Neeraj Sabharwal @Jay SenSharma Could anyone help me on this please? Thanks in advance!

balavignesh_nag · ‎05-11-2017

Hi @n c Hcatalog holds metadata of a table details like Schema, Index, Roles, Structure, Bucketing, Partitions keys, Columns, privileges, When the table was created, By whom it was created etc. But It doesn't contain any details about how many records are stored in each table.

balavignesh_nag · ‎05-11-2017

Consider I have a File size of 150M and Im loading into a hive table. Block size in HDFS is 128MB. Now how the files will be present underneath the Hive?. I believe it will be splitted and loaded as 0000_0,0000_1 .,.etc. Why it is splitted into multiple chunks? Does each files represent the block size? Does the block size and mapred size has anything to do with file size in hive? If I alter the mapred size then will the file size underneath hive will be changed? Do we have any control over the no of files created in hive while loading? I understand through merge mapreduce job we will be able to reduce/increase it. Say I just need 10 files to be created and not more than that while loading a hive table. Is it even possible? Thanks in advance

balavignesh_nag · ‎05-11-2017

@Ashnee Sharma Good article. Thanks for sharing!

balavignesh_nag · ‎05-11-2017

Hi @Venkatesan G For 40-50 Millions files block size is 256 MB which is twice of 128 MB. Naturally the no of blocks created for a file is decreased in turn the this details stored in the name node is also reduced. That's why only 24 GB is recommended. If you further increase the block size to higher value then the recommended size would still decrease. Its block size is indirectly proportional to the recommended size. I hope it would help.

balavignesh_nag · ‎05-10-2017

Hi @Ashnee Sharma As the log says that the provided credentials were unable to be identified. Causes may fall under any of the below: The user set the $KRB5_CONFIG environment variable to something other than the default value of /etc/krb5.conf. The HDFS client will not source the $KRB5_CONFIG from the user's shell. /etc/gphd/hadoop/conf/hdfs-site.xml does not have the correct Kerberos configuration for the namenode HDFS principal. The DNS does not resolve the correct Fully Qualified Domain Name. The Kerberos client libs version do not match the server. Kerberos encryption defaults differ between the client and the KDC. Refer the link for the solution on how to overcome this. https://discuss.pivotal.io/hc/en-us/articles/202210763-The-Secure-HDFS-Error-No-valid-credentials-provided-Displays-when-Running-HDFS-DFS-or-Hadoop-FS Hope it help in solving your issue.

balavignesh_nag · ‎05-10-2017

@Dinesh Chitlangia Sort and orderBy are same when spark is considered. It functions/works on the same way in spark. However in Hive or any other DB the function is quite different. If you want to know differences in hive then refer the below link https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

balavignesh_nag · ‎05-09-2017

@mÁRIO Rodrigues Yes even though if its expressed as .deflate its in ORC with compressed state. I thik you will be able to read the files through hive tables in Spark SQL but you cant use the underneath files in it as it is compressed. If you want to read the files then load the hive tables without any compression and then Spark can make use of that file underneath.

balavignesh_nag · ‎05-09-2017

@Amol Kulkarni Yes that's a known one. Hash function in hive functions similar to hash algorithm or hash sort logic in data structures. Like modulo of odd number by 2 is always 1. The same way the two values provided by you results in having same hash value. In order to generate unique values make use of md5() function in hive to generate unique values. However I suggest not to this logic for generation of primary key for a table as the values out of md5() will be a total mess.

balavignesh_nag · ‎05-08-2017

@mÁRIO Rodrigues Deflate is not a format. But if the file is in compressed state then the extension of your file in HDFS will be mentioned as .defalte. As you stated ORC performance better during loading the table. Parquet and Avro also serves its own purpose. When I have tested a table with 3 billion records the time taken for loading a hive table with specific format were ORC Avro Parquet. In ascending order of time taken. ORC being the least amount of time taken during loading. But if your file format is dynamic then its better to go with parquet/Avro.

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: How Files loaded through a Hive table can be d...

Re: table row count from hive metastore

How Files loaded through a Hive table can be deter...

Re: Getting error while doing distcp with two secu...

Re: How a NameNode Heap size is calculated?

Re: Getting error while doing distcp with two secu...

Re: Spark DataFrame - difference between sort and ...

Re: Hive Table formats

Re: The Hash Function over different values gives ...

Re: Hive Table formats