Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Downloading huge results from Hue

avatar
Explorer

Hi,

 

If I run query in Hue that returns huge amount of rows, is it possible to download them through UI? I tried it using Hive query and .csv, download was succesful, but it turned out the file had exactly 100000001 rows, while actual result should be bigger.  Is 100 milion some kind of limit - if so could it be lifted?

 

I was also thinking about storing results in HDFS and downloading them through file browser, but the problem is that when you click "save in HDFS", the whole query runs again from scratch, so effectively you need to run it twice to be able to do it (and i haven't checked if result would be stored as one file and if Hue could download it).

 

In short, is such a use case possible in Hue?

 

1 ACCEPTED SOLUTION

avatar
Super Guru
Please read the above JIRA for more details. Hue is only one lightweight
Python server. Google, Dropbox etc... have tens of servers dedicated to
serving files and not Web pages (the download happens from another machine).

In Hue 4 we will very probably introduce some new types of Hue servers that
will take care of this part.

Romain

View solution in original post

13 REPLIES 13

avatar
Explorer

Errata, the file had only 1 milion lines, not 100 milions

avatar
Super Guru
This is https://issues.cloudera.org/browse/HUE-2142

In short right now Hue will not perform well for downloading and streaming
a lot of data to a browser as it is not designed for that.

avatar
Explorer

But i dont need to see that data in a browser, i just want to download it on my PC...

avatar
Super Guru
The webserver is sending it to your browser, a webserver is supposed to
just send some web pages

avatar
Explorer

I can download gigs of data from google drive or file hosting websites using my browser, why wouldn't it be possible here?

 

This means my only alternative is to tell users to install hive and tell to run something like

 

beeline -u jdbc:hive2://bla:10000 -n user -p password -f yourscript.q > yourresults.txt

 

which is a bit crap... (not to mention until Hive 13 beeline doesnt report any progress on the operation). Or let them log to my server directly and wreak havoc there 😕

 

All that Hue gives you already is awesome, but it needs to do more!

avatar
Super Guru
Please read the above JIRA for more details. Hue is only one lightweight
Python server. Google, Dropbox etc... have tens of servers dedicated to
serving files and not Web pages (the download happens from another machine).

In Hue 4 we will very probably introduce some new types of Hue servers that
will take care of this part.

Romain

avatar
Explorer

I see. Maybe then there should be also some option like "execute and save to hdfs", where Hue doesnt dump results to the browser, but puts them in one file in HDFS directly? So user can get it by other means? I recently managed to store results and then download 600 MB csv file in HDFS using Hue and it kinda worked (9 milions lines, new record). Altough few minutes the service went down (not sure if because of it, or because i just started presenting Hue to my boss) so not sure if this would work.

 

I guess we gonna instructl users to always use LIMIT clause on their quiries, telling that this is to avoid overloading our servers (which is technically true).

 

Thanks for your help!

avatar
Super Guru
Hue has the option to save the results to HDFS and it is very scalable as
Hive is doing the writing to HDFS and then downloading from HDFS does not
require much computation from Hue.

But it indeed re-executes the SQL with the INSERT INTO /... or CREATE TABLE
AS SELECT ...

Hive or Impala does not offer a way to do both show the data in the Hue
screen and make it easy to download.

In the next version we should have some optimizations that should make more
stable to download or bump the limit.

In Hue 4 which is a big version we will tackle this as it would require a
new twin server.

So for now we recommend downloading directly from HDFS by redoing the query
for large resultsets and not bumping the 'download_row_limit' limit.

Romain

avatar
Explorer

Got it. We will go this way, ironically it turned out that due to some regulatory stuff, downloading raw data from our system shouldn't bee too easy, so... we are going for good old 'it's not a bug, it's a feature' 😉

 

FYI, i also tried this :

 

beeline -u jdbc:hive2://hname:10000 -n bla -p bla -f query.q > results.txt

 

but it didn't do much, just hanged. Maybe hive2 (or beeline?) isn't powerful enough as well.

 

Thanks for all the clarifications!