Support Questions

Find answers, ask questions, and share your expertise

Hive little query slow

avatar
New Contributor

Hi all,

I'm working with Cloudera Hive for the first time, seen that my company chose to use this database.
I received the JDBC connectionString and I created a little method in java.
Here the script

public String[] getEmployee(){	
	String[] idcod ={"ABC123", "FLS163","XYZ001","PLE456", "ERV021", and so on......};
	String[] names= new String[idcod.length];
	int ncount=0;
	long start1, elapsedTimequery;
	try{
		Statement  s = conn.createStatement();
		System.out.println("Fetch size: "+s.getFetchSize());
		ResultSet  r=null;
		for(int i=0; i<idcod.length; i++) {
			String query="";
			query+=" SELECT code, surname, name FROM mytable where code='"+idcod[i]+"'";
		
			ncount=0;	
			start1 = System.nanoTime(); 
			r = s.executeQuery(query);
			elapsedTimequery = System.nanoTime() - start1;
			System.out.println((double)(elapsedTimequery/1000/1000)/1000+" seconds");
			while(r.next()){
				names[i]=r.getString("surname") + " " +r.getString("name");
				ncount++;
			}
	  	
			System.out.println(ncount+"--"+idcod[i]);
				
				
		}	
		r.close();
		s.close();
    	}
    	catch(SQLException sqlex){
    		sqlex.printStackTrace();
    	}

    	catch(Exception ex){
    		ex.printStackTrace();
    	}
    	
 	return names;
 }

For each query  r = s.executeQuery(query) it takes more of 4 seconds to search data and got it, meanwhile with Oracle a query takes about 2 milliseconds with Fetch "10".

With technical referent tried ICEBERG Table but it's no change.
I tried to add "hive.fetch.task.conversion" to my connectionString with all possible values (none, minimal, more) with no change.
I can't use clause IN because my main program manage the codes row by row.

How can I avoid all this time for an hive query and to go near to Oracle timing?

Thank you to all.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hive is not a low-latency OLTP Database like Oracle. 

  • Hive is designed for batch processing, not fast single-row lookups.
  • Every Select you run triggers a full query execution plan. 

From the code snippet observed , queries executing row by row. (i.e.) executeQuery() multiple times , it looks expensive. 

hive.fetch.task.conversion won't help here, since it will be useful for optimizing simple SELECT's into client-side fetches, but Hive still builds a full plan behind the scenes . 

Better approach would be , Refactor the loop into a single IN clause. 

SELECT code, surname, name FROM mytable WHERE code IN ('ABC123', 'FLS163', 'XYZ001', ...)

Then store the results in a map. 

Map<String, String> codeToName = new HashMap<>();
while (r.next()) {
    codeToName.put(r.getString("code"), r.getString("surname") + " " + r.getString("name"));
}

 Even if you must process row-by-row , fetching all data in a batch drastically reduces query overhead. 

If the list is too large for IN clause, insert those values in temp Hive table. 

// Insert your id list into a temp table
CREATE TEMPORARY TABLE tmp_ids (code STRING);
-- Then insert all your codes into tmp_ids

SELECT a.code, a.surname, a.name
FROM mytable a
JOIN tmp_ids b ON a.code = b.code;

Hive optimize the join rather than executing multiple separate queries. 

View solution in original post

3 REPLIES 3

avatar
Master Collaborator

Hive is not a low-latency OLTP Database like Oracle. 

  • Hive is designed for batch processing, not fast single-row lookups.
  • Every Select you run triggers a full query execution plan. 

From the code snippet observed , queries executing row by row. (i.e.) executeQuery() multiple times , it looks expensive. 

hive.fetch.task.conversion won't help here, since it will be useful for optimizing simple SELECT's into client-side fetches, but Hive still builds a full plan behind the scenes . 

Better approach would be , Refactor the loop into a single IN clause. 

SELECT code, surname, name FROM mytable WHERE code IN ('ABC123', 'FLS163', 'XYZ001', ...)

Then store the results in a map. 

Map<String, String> codeToName = new HashMap<>();
while (r.next()) {
    codeToName.put(r.getString("code"), r.getString("surname") + " " + r.getString("name"));
}

 Even if you must process row-by-row , fetching all data in a batch drastically reduces query overhead. 

If the list is too large for IN clause, insert those values in temp Hive table. 

// Insert your id list into a temp table
CREATE TEMPORARY TABLE tmp_ids (code STRING);
-- Then insert all your codes into tmp_ids

SELECT a.code, a.surname, a.name
FROM mytable a
JOIN tmp_ids b ON a.code = b.code;

Hive optimize the join rather than executing multiple separate queries. 

avatar
New Contributor

Too simple 😋
I can't use clause IN as already written in my first post.
Then my credential are "read only," I can't open a temporary tables.

avatar
Master Collaborator

Would like to mention few more recommendations , 

  • Hive will be fast with columnar storage and predicate pushdown. Store the table as ORC (with Snappy/Zlib) if possible .
    Ref - https://docs.cloudera.com/runtime/7.2.0/hive-performance-tuning/topics/hive_prepare_to_tune_performa...
  • Collect statistics and enable predicate push-down (hive.optimize.ppd=true, default in Hive recent versions) so that filtering on code skips irrelevant data.
  • If code column has limited distinct values, consider partitioning or bucketing on it: a partitioned ORC table will read only the needed partition. Also keep vectorization enabled (hive.vectorized.execution.enabled=true), which processes rows in batches – a big speedup for scans.