About venkatsambath

venkatsambath · ‎03-16-2023

@Me Sorry for that confusion. I see what you mean now Per: https://impala.apache.org/docs/build/html/topics/impala_perf_stats.html#perf_stats_incremental COMPUTE INCREMENTAL STATS In Impala 2.1.0 and higher, you can use the COMPUTE INCREMENTAL STATS and DROP INCREMENTAL STATS commands. The INCREMENTAL clauses work with incremental statistics, a specialized feature for partitioned tables. When you compute incremental statistics for a partitioned table, by default Impala only processes those partitions that do not yet have incremental statistics. By processing only newly added partitions, you can keep statistics up to date without incurring the overhead of reprocessing the entire table each time. So the drop statistics is intended for "COMPUTE INCREMENTAL STATS" and not for " COMPUTE INCREMENTAL STATS with partition" May I know which version of CDP you are using, so that I can test on my end and confirm you.

venkatsambath · ‎03-14-2023

Hi, This statement in the doc "In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition." Applies specifically to a partition in which stats are already available but you added more data to that existing partition. If you are unsure about whether stats exist for a partition you can run show table stats <table_name>; and check the "Incremental stats" section Query: show table stats test_part +-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+ | b | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+ | 1 | 0 | 1 | 0B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://xxxx:8020/user/hive/warehouse/test_part/b=1 | | Total | -1 | 1 | 0B | 0B | | | | | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+ Fetched 2 row(s) in 5.60s If false, you can run COMPUTE INCREMENTAL STATS with PARTITION If true and you have added more data to this partition then you have to drop the stats and then run COMPUTE INCREMENTAL STATS with PARTITION

venkatsambath · ‎02-22-2023

Hi. Yeah its expected when you have the common path for tgt cache for multiple user. Can you make the location unique for each different user - I haven't tested but I see an option in this link https://gpdb.docs.pivotal.io/6-3/admin_guide/kerberos-win-client.html Set up the Kerberos credential cache file. On the Windows system, set the environment variable KRB5CCNAME to specify the file system location of the cache file. The file must be named krb5cache. This location identifies a file, not a directory, and should be unique to each login on the server. When you set KRB5CCNAME, you can specify the value in either a local user environment or within a session. For example, the following command sets KRB5CCNAME in the session: set KRB5CCNAME=%USERPROFILE%\krb5cache

venkatsambath · ‎01-09-2023

You can set quota on /tmp - Once quota is reached further write on the directory will fail. https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/scaling-namespaces/topics/hdfs-set-quotas-cm.html has the steps to enable quota

venkatsambath · ‎04-01-2020

Hi @Amn_468 Please configure it in CM > HDFS > Configuration > Java Heap Size of NameNode in Bytes Enter a value per requirement Save and Restart

venkatsambath · ‎03-23-2020

"although same property (dfs.datanode.balance.max.concurrent.moves) already exists in Cloudera Manager." --> Okay, I assume you are referring to the one highlighted in screenshot below Yes its unnecessary to add dfs.datanode.balance.max.concurrent.moves in Balancer Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml if you had used the "Maximum Concurrent Moves" section. Also note that this "Maximum Concurrent Moves" is scoped only to balancer and not to datanodes. So for datanodes you have to explicitly set it using " DataNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" Regarding reason for why to add this property both for balancer and datanode is mentioned in my previous comment. Hope that clarifies and let me know if there are further questions I will raise an internal jira for correcting the document to avoid duplicate entry on balancer safety-valve.

venkatsambath · ‎03-22-2020

Yes you can install CM offline after downloading the packages and - Its documented in this link https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ig_create_local_package_repo.html#internal_package_repo Once the repo is ready you can install the binaries using the steps in link https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/install_cloudera_packages.html#id_z2h_pnm_25

venkatsambath · ‎03-22-2020

You would need to tune your heap in accordance with the number of files. The tuning guideline is in document https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/configuring-namenode-heap-size.html If you would like to get count of files, You may run hdfs dfs -count /

venkatsambath · ‎03-22-2020

Just a correction The document suggest to tune property dfs.datanode.balance.max.concurrent.moves and not dfs.datanode.ec.reconstruction.xmits.weight Regarding the question of dfs.datanode.balance.max.concurrent.moves is already present in Datanode and balancer so why to add again. The doc says "Add the following code to the configuration field, for example, setting the value to 50." i.e 50 is just a example number and the document doesnt mandate setting this value to 50. You can tune it to any value of your requirement. Then why to add in both balancer and datanode? Setting it on HDFS Balancer(client) will give the flexibility to change this value on the client side at runtime i.e you can set this property to a value lesser or equal to what you have configured on the datanode side. Reason why we set this on server side is to impose a limit till what value the property can be configured. If you configure a value greater than what you have set on the Datanode(server), the datanodes fails it

venkatsambath · ‎03-22-2020

The error suggests the DFSClient is unable to read the blocks due to connection failure. Either the ports are blocked or unreachable from the node From the node in which you are running the code snippet/From the node in which the executor ran, try reading the file using hdfs commands in debug mode which can give further clues on what node/service the client was trying to reach prior to connect timeout export HADOOP_ROOT_LOGGER=DEBUG,console hdfs dfs -cat hdfs://ec2-18-234-71-106.compute-1.amazonaws.com:8020/dataset/Tech.csv

Online	Offline
Last Visited	‎04-27-2026 05:01 PM

Member Since	‎12-11-2015 07:09 AM
Last Visited	‎04-27-2026 05:01 PM
Posts	245
Kudos received	31

Cloudera Community

Re: Cloudera base on premises free trial (60 days)

Re: CrewAI deployment on Cloudera AI Workbench

Re: Versions Compatibility

Re: Utilization Report - Cloudera Platform

Re: Run 2 kerberos ticket in a server for transfer...

Re: COMPUTE INCREMENTAL STATS with PARTITION optio...

Re: COMPUTE INCREMENTAL STATS with PARTITION optio...

Re: Could not create more than one ticket for mult...

Re: limit the size of files that an application ca...

Re: Name Node Pause duration

Re: HDFS Balancer: Why configure same property?

Re: cloudera manger offline installation

Re: Name Node Pause duration

Re: HDFS Balancer: Why configure same property?

Re: i can't read file from hdfs using pyspark (amb...