question Re: Hive little query slow in Support Questions

Hive little query slow

Dariuz82 — Sat, 17 May 2025 11:32:57 GMT

Hi all,

I'm working with Cloudera Hive for the first time, seen that my company chose to use this database.
I received the JDBC connectionString and I created a little method in java.
Here the script

public String[] getEmployee(){ String[] idcod ={"ABC123", "FLS163","XYZ001","PLE456", "ERV021", and so on......}; String[] names= new String[idcod.length]; int ncount=0; long start1, elapsedTimequery; try{ Statement s = conn.createStatement(); System.out.println("Fetch size: "+s.getFetchSize()); ResultSet r=null; for(int i=0; i<idcod.length; i++) { String query=""; query+=" SELECT code, surname, name FROM mytable where code='"+idcod[i]+"'"; ncount=0; start1 = System.nanoTime(); r = s.executeQuery(query); elapsedTimequery = System.nanoTime() - start1; System.out.println((double)(elapsedTimequery/1000/1000)/1000+" seconds"); while(r.next()){ names[i]=r.getString("surname") + " " +r.getString("name"); ncount++; } System.out.println(ncount+"--"+idcod[i]); } r.close(); s.close(); } catch(SQLException sqlex){ sqlex.printStackTrace(); } catch(Exception ex){ ex.printStackTrace(); } return names; }

For each query r = s.executeQuery(query) it takes more of 4 seconds to search data and got it, meanwhile with Oracle a query takes about 2 milliseconds with Fetch "10".

With technical referent tried ICEBERG Table but it's no change.
I tried to add "hive.fetch.task.conversion" to my connectionString with all possible values (none, minimal, more) with no change.
I can't use clause IN because my main program manage the codes row by row.

How can I avoid all this time for an hive query and to go near to Oracle timing?

Thank you to all.

Re: Hive little query slow

ggangadharan — Mon, 19 May 2025 10:02:29 GMT

Hive is not a low-latency OLTP Database like Oracle.

Hive is designed for batch processing, not fast single-row lookups.
Every Select you run triggers a full query execution plan.

From the code snippet observed , queries executing row by row. (i.e.) executeQuery() multiple times , it looks expensive.

hive.fetch.task.conversion won't help here, since it will be useful for optimizing simple SELECT's into client-side fetches, but Hive still builds a full plan behind the scenes .

Better approach would be , Refactor the loop into a single IN clause.

SELECT code, surname, name FROM mytable WHERE code IN ('ABC123', 'FLS163', 'XYZ001', ...)

Then store the results in a map.

Map<String, String> codeToName = new HashMap<>(); while (r.next()) { codeToName.put(r.getString("code"), r.getString("surname") + " " + r.getString("name")); }

Even if you must process row-by-row , fetching all data in a batch drastically reduces query overhead.

If the list is too large for IN clause, insert those values in temp Hive table.

// Insert your id list into a temp table CREATE TEMPORARY TABLE tmp_ids (code STRING); -- Then insert all your codes into tmp_ids SELECT a.code, a.surname, a.name FROM mytable a JOIN tmp_ids b ON a.code = b.code;

Hive optimize the join rather than executing multiple separate queries.

Re: Hive little query slow

ggangadharan — Mon, 19 May 2025 10:10:52 GMT

Would like to mention few more recommendations ,

Hive will be fast with columnar storage and predicate pushdown. Store the table as ORC (with Snappy/Zlib) if possible .
Ref - https://docs.cloudera.com/runtime/7.2.0/hive-performance-tuning/topics/hive_prepare_to_tune_performance.html#:~:text=,vectorized%20by%20examining%20explain%20plans
Collect statistics and enable predicate push-down (hive.optimize.ppd=true, default in Hive recent versions) so that filtering on code skips irrelevant data.
If code column has limited distinct values, consider partitioning or bucketing on it: a partitioned ORC table will read only the needed partition. Also keep vectorization enabled (hive.vectorized.execution.enabled=true), which processes rows in batches – a big speedup for scans.

Re: Hive little query slow

Dariuz82 — Mon, 26 May 2025 13:35:43 GMT

Too simple 😋
I can't use clause IN as already written in my first post.
Then my credential are "read only," I can't open a temporary tables.