About robert12231

robert12231 · ‎06-01-2026

A few things stand out from the numbers you've shared. On a 64-core machine, ingesting ~11 million rows in 17 minutes (around 10–11K rows/sec) is significantly below what I'd expect if the workload were effectively parallelized. Before focusing on CPU count, I'd investigate where the bottleneck actually is. Some areas worth checking: Storage throughput: Is the data being written to local SSDs, network-attached storage, or slower disks? Ingest workloads are often I/O-bound rather than CPU-bound. File size and partitioning strategy: Large numbers of small files can severely impact write performance. Compression settings: Certain codecs provide better compression but consume more CPU during ingest. Thread parallelism: Verify that the ingestion framework is actually utilizing all available cores rather than being limited by a small worker pool. Memory pressure and GC activity: If the JVM is spending significant time in garbage collection, additional CPU cores won't help much. Network throughput: If data is being pulled from a remote source, the bottleneck may be upstream rather than on the ingest node itself. I'd also recommend collecting: CPU utilization during ingest Disk IOPS and throughput metrics Memory usage and GC logs Number of concurrent ingest tasks Average file size being generated One quick diagnostic is to look at overall CPU utilization. If the machine is only using 10–20% of available CPU during ingest, then the workload is likely blocked on I/O, synchronization, network transfers, or application-level limits rather than raw compute capacity. Can you share: Which ingestion tool/framework you're using? The storage type (SSD, NVMe, HDD, cloud volume, etc.)? Average CPU utilization during the 17-minute ingest? Whether the target table is partitioned and, if so, by what key? Those details would make it much easier to determine whether the bottleneck is CPU, disk, network, or configuration-related.

robert12231 · ‎06-01-2026

This behavior is expected if the table is transactional (ACID-enabled). A DELETE operation does not immediately rewrite the underlying HDFS data files. Instead, Hive records the deleted row identifiers in a delete_delta directory, and query engines apply those delete markers when reading the table. As a result, the original data files remain in place and the HDFS size often stays the same immediately after a large delete. If you deleted 450,000+ rows and only have ~13,000 rows remaining, it's normal that the table directory still occupies roughly the same amount of space. In some cases, storage consumption can even increase temporarily because the delete metadata itself must be stored. To actually reclaim disk space, you typically need to run a major compaction. During major compaction, Hive rewrites the data files, merges delta/delete_delta information, and removes data that is no longer visible to queries. Only after that process completes will you generally see a significant reduction in HDFS usage. One additional point: new inserts do not "overwrite" the deleted rows inside the existing files. HDFS files are immutable, so Hive creates new data files rather than modifying existing ones in place. The cleanup and consolidation happen during compaction rather than during the DELETE itself. You may want to check: Whether the table is ACID/transactional. The contents of the delta_* and delete_delta_* directories. When the next automatic major compaction is scheduled, or whether a manual major compaction is appropriate for your environment.

Online	Offline
Last Visited	‎06-01-2026 09:46 PM

Member Since	‎06-01-2026 09:46 PM
Last Visited	‎06-01-2026 09:46 PM
Posts	2

Cloudera Community

Re: Slow ingest on 64 core machine

Re: DELETE rows in table, how HDFS file size is im...