Support Questions

Dariuz82 · ‎05-17-2025

Hi all,

I'm working with Cloudera Hive for the first time, seen that my company chose to use this database.
I received the JDBC connectionString and I created a little method in java.
Here the script

public String[] getEmployee(){	
	String[] idcod ={"ABC123", "FLS163","XYZ001","PLE456", "ERV021", and so on......};
	String[] names= new String[idcod.length];
	int ncount=0;
	long start1, elapsedTimequery;
	try{
		Statement  s = conn.createStatement();
		System.out.println("Fetch size: "+s.getFetchSize());
		ResultSet  r=null;
		for(int i=0; i<idcod.length; i++) {
			String query="";
			query+=" SELECT code, surname, name FROM mytable where code='"+idcod[i]+"'";
		
			ncount=0;	
			start1 = System.nanoTime(); 
			r = s.executeQuery(query);
			elapsedTimequery = System.nanoTime() - start1;
			System.out.println((double)(elapsedTimequery/1000/1000)/1000+" seconds");
			while(r.next()){
				names[i]=r.getString("surname") + " " +r.getString("name");
				ncount++;
			}
	  	
			System.out.println(ncount+"--"+idcod[i]);
				
				
		}	
		r.close();
		s.close();
    	}
    	catch(SQLException sqlex){
    		sqlex.printStackTrace();
    	}

    	catch(Exception ex){
    		ex.printStackTrace();
    	}
    	
 	return names;
 }

For each query r = s.executeQuery(query) it takes more of 4 seconds to search data and got it, meanwhile with Oracle a query takes about 2 milliseconds with Fetch "10".

With technical referent tried ICEBERG Table but it's no change.
I tried to add "hive.fetch.task.conversion" to my connectionString with all possible values (none, minimal, more) with no change.
I can't use clause IN because my main program manage the codes row by row.

How can I avoid all this time for an hive query and to go near to Oracle timing?

Thank you to all.

ggangadharan · ‎05-19-2025

Hive is not a low-latency OLTP Database like Oracle.

Hive is designed for batch processing, not fast single-row lookups.
Every Select you run triggers a full query execution plan.

From the code snippet observed , queries executing row by row. (i.e.) executeQuery() multiple times , it looks expensive.

hive.fetch.task.conversion won't help here, since it will be useful for optimizing simple SELECT's into client-side fetches, but Hive still builds a full plan behind the scenes .

Better approach would be , Refactor the loop into a single IN clause.

SELECT code, surname, name FROM mytable WHERE code IN ('ABC123', 'FLS163', 'XYZ001', ...)

Then store the results in a map.

Map<String, String> codeToName = new HashMap<>();
while (r.next()) {
    codeToName.put(r.getString("code"), r.getString("surname") + " " + r.getString("name"));
}

Even if you must process row-by-row , fetching all data in a batch drastically reduces query overhead.

If the list is too large for IN clause, insert those values in temp Hive table.

// Insert your id list into a temp table
CREATE TEMPORARY TABLE tmp_ids (code STRING);
-- Then insert all your codes into tmp_ids

SELECT a.code, a.surname, a.name
FROM mytable a
JOIN tmp_ids b ON a.code = b.code;

Hive optimize the join rather than executing multiple separate queries.

View solution in original post

ggangadharan · ‎05-19-2025

Hive is not a low-latency OLTP Database like Oracle.

Hive is designed for batch processing, not fast single-row lookups.
Every Select you run triggers a full query execution plan.

From the code snippet observed , queries executing row by row. (i.e.) executeQuery() multiple times , it looks expensive.

hive.fetch.task.conversion won't help here, since it will be useful for optimizing simple SELECT's into client-side fetches, but Hive still builds a full plan behind the scenes .

Better approach would be , Refactor the loop into a single IN clause.

SELECT code, surname, name FROM mytable WHERE code IN ('ABC123', 'FLS163', 'XYZ001', ...)

Then store the results in a map.

Map<String, String> codeToName = new HashMap<>();
while (r.next()) {
    codeToName.put(r.getString("code"), r.getString("surname") + " " + r.getString("name"));
}

Even if you must process row-by-row , fetching all data in a batch drastically reduces query overhead.

If the list is too large for IN clause, insert those values in temp Hive table.

// Insert your id list into a temp table
CREATE TEMPORARY TABLE tmp_ids (code STRING);
-- Then insert all your codes into tmp_ids

SELECT a.code, a.surname, a.name
FROM mytable a
JOIN tmp_ids b ON a.code = b.code;

Hive optimize the join rather than executing multiple separate queries.

Dariuz82 · ‎05-26-2025

Too simple 😋
I can't use clause IN as already written in my first post.
Then my credential are "read only," I can't open a temporary tables.

ggangadharan · ‎05-19-2025

Would like to mention few more recommendations ,

Hive will be fast with columnar storage and predicate pushdown. Store the table as ORC (with Snappy/Zlib) if possible .
Ref - https://docs.cloudera.com/runtime/7.2.0/hive-performance-tuning/topics/hive_prepare_to_tune_performa...
Collect statistics and enable predicate push-down (hive.optimize.ppd=true, default in Hive recent versions) so that filtering on code skips irrelevant data.
If code column has limited distinct values, consider partitioning or bucketing on it: a partitioned ORC table will read only the needed partition. Also keep vectorization enabled (hive.vectorized.execution.enabled=true), which processes rows in batches – a big speedup for scans.

Cloudera Community

Support Questions

Hive little query slow