Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Retrieving data from hive is very slow when creating pandas dataframe in jupyter

Highlighted

Retrieving data from hive is very slow when creating pandas dataframe in jupyter

New Contributor

Hi to all,

recently I noticed that creating pandas dataframe in Jupyter is very slow. I'm using HDP as data storage and the data is stored in Hive tables in parquet format. Table dimensions are not so big, 11 columns and 145719 rows.

When I connect from jupyter cell to HiveServer (using pyhive library), data retrieval is very slow (query is <Select * from table_name>)

I also tried different connection libraries, pyhs2 and impyla and there is no reduction in cell execution time. It tooks for 6 min to execute jupyter cell and to create pandas dataframe.

When I run query in Hue, exec time is in ms.

I tried to configure Hive like below and nothing happend:

  • - TEZ as execution engine
  • - CBO neabled
  • - set hive.exec.parallel=true')
  • - set hive.vectorized.execution.enabled = true
  • - set hive.vectorized.execution.reduce.enabled = true

I also tried to compute table statistics and nothing happend:

  • set hive.compute.query.using.stats=true
  • set hive.stats.fetch.column.stats=true
  • set hive.stats.fetch.partition.stats=true
  • ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR COLUMNS
  • ANALYZE TABLE <table_name> COMPUTE STATISTICS

I don't know if there are some Network issues or memory issues, because Jupyter is not on the server where the data is. Hardware configuration is not bad for both servers.

Can someone share some info or ideas what is going on?

Thanks!

Don't have an account?
Coming from Hortonworks? Activate your account here