About mathieu.d

mathieu.d · ‎01-27-2017

We also do not like this "magic number" but we find it useful. I think you should at least investigate your cluster when you have that warning in order to check that you do not have the "too many small files" issue. Even if we are not satisfied with the threshold configured, it is still useful as a reminder (and should only be considered as such). Having too many small files can also be performance-wise a bad thing regarding how map/reduce instanciate one separate mapper per block to read (if you use that data into jobs). By the way, for investigation this I often use the "fsck" utility. When using with a path you can get the block count, the global size and the avg size of a block. This will let you know where in your HDFS storage you have too many small files or not. When you have 200 000 blocks under a path with an average size of 2MB this should be a good indicator of having to many small files.

mathieu.d · ‎01-27-2017

Did you try to drop the partition using Hive query ? It should look like this : ALTER TABLE <table_name> DROP PARTITION (<partition_col_name>='<value>'); https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DropPartitions If it does not delete the data you will need to delete the directory of the partition (in HDFS) after deleting it using the Hive query.

mathieu.d · ‎01-19-2017

I don't think Impala has such a feature (but I could be wrong). If I were you, I would try to answer these questions : - "Why do I need this kind of ouput ?" - "What do I use it for ?" - "Can't I acheive my goal with an other output ?" Maybe you will find an other approach more adapted. By the way, I guess something like this would be better (but it will not make a huge difference) : SELECT a AS col FROM tmp UNION ALL SELECT b AS col FROM tmp UNION ALL SELECT c AS col FROM tmp UNION ALL SELECT d AS col FROM tmp

mathieu.d · ‎01-17-2017

Hive is not Oracle. You should not expect the same processing capabilities. Hive is designed to run long and heavy queries. Whereas it performed poorly for small queries like the one you try to optimize. Also note that Hive run on top of Yarn and per design Yarn takes time to instanciate containers and the JVM inside these containers. This should correspond to your question "why it takes so much time to start the Query Job". If you want to get a quick reply for a basic count(*) without any filter/condition you might want to read about Hive statistics.

mathieu.d · ‎01-10-2017

What kind of web based interface do you need from HiveServer2 ? If it is a user interface for querying Hive then HiveServer2 do not provide one OOTB. But Know that Hue is using HiveServer2 when you are submitting hive queries inside the "Hive Editor".

mathieu.d · ‎12-19-2016

Thx, that was interesting to know !

mathieu.d · ‎12-19-2016

Hi, If it's not working only on edge nodes then there might be some configuration issue leading to that. What difference do you make beetwen "cluster nodes" and "edge nodes" ? Meaning : what roles are distributed to your edge nodes ? - For example, did you assign the HDFS&Yarn "gateway" role to your edge nodes ? - If no, try doing it - If yes, try redeploying the client configuration Might be something else.

mathieu.d · ‎12-19-2016

You are right, I just tested it and there is no need for additional settings (other than having initialized the kerberos ticket). From what I read on your first post, it seems the same job do run successfully for users which do not have their home folder in HDFS encrypted ? (for the same kerberos realm) ? If that is the case, I would open a SR ticket in your shoe. It would be the quickest way for obtaining a feedback from Cloudera on the matter (if there is an incompatiblity or some particular settings for this particular use case).

mathieu.d · ‎12-19-2016

If you want to "drop" the categories table you should run an hive query like this : DROP TABLE categories; If you want to "delete" the content of the table only then try "TRUNCATE TABLE categories;". It should work or try deleting the table content in HDFS directly. As for your use of "hadoop fs", you should know that "hadoop fs -ls rm" does not exist. For deleting HDFS files or folders it is directly "hadoop fs -rm".

mathieu.d · ‎12-19-2016

Hi, I don't know if the map/reduce job you are submitting is Kerberos compatible. That is the first check to do. Then, if that job is Kerberos compatible, it might need some settings like supplying a jaas configuration. The kinit of a ticket is sometimes not enough. For example, when running the map/reduce job "MapReduceIndexerTool", you need to supply a jaas configuration. HADOOP_OPTS="-Djava.security.auth.login.config=/home/user/jaas.conf" \ hadoop jar MapReduceIndexerTool See: https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_sg_search_security.html

Online	Offline
Last Visited	‎01-17-2018 02:52 AM

Member Since	‎07-16-2015 01:41 AM
Last Visited	‎01-17-2018 02:52 AM
Posts	177
Kudos received	28

Cloudera Community

Re: Unable to delete HDFS Corrupt files

Re: Hive partitions based on date from timestamp

Re: Partition Hive Table to Hbase Handler ?

Re: yarn logs location on disk

Re: Increase Flume graceful restart time

Re: Hadoop Data Node: why is there a "magic" numbe...

Re: How to delete/drop a partition of an external ...

Re: Transpose columns to rows

Re: Hive Queries run slowly

Re: How to loginto the Hive server 2

Re: RegionServer throws "not a SequenceFile", oldW...

Re: Sample MR job fails from edge node after encry...

Re: Sample MR job fails from edge node after encry...

Re: Exercise 1 Sqoop import fails

Re: Sample MR job fails from edge node after encry...