recently I noticed that creating pandas dataframe in Jupyter is very slow. I'm using HDP as data storage and the data is stored in Hive tables in parquet format. Table dimensions are not so big, 11 columns and 145719 rows.
When I connect from jupyter cell to HiveServer (using pyhive library), data retrieval is very slow (query is <Select * from table_name>)
I also tried different connection libraries, pyhs2 and impyla and there is no reduction in cell execution time. It tooks for 6 min to execute jupyter cell and to create pandas dataframe.
When I run query in Hue, exec time is in ms.
I tried to configure Hive like below and nothing happend:
- TEZ as execution engine
- CBO neabled
- set hive.exec.parallel=true')
- set hive.vectorized.execution.enabled = true
- set hive.vectorized.execution.reduce.enabled = true
I also tried to compute table statistics and nothing happend:
ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR COLUMNS
ANALYZE TABLE <table_name> COMPUTE STATISTICS
I don't know if there are some Network issues or memory issues, because Jupyter is not on the server where the data is. Hardware configuration is not bad for both servers.
Can someone share some info or ideas what is going on?