Support Questions

cjervis · ‎01-28-2015

Hi,

If I run query in Hue that returns huge amount of rows, is it possible to download them through UI? I tried it using Hive query and .csv, download was succesful, but it turned out the file had exactly 100000001 rows, while actual result should be bigger. Is 100 milion some kind of limit - if so could it be lifted?

I was also thinking about storing results in HDFS and downloading them through file browser, but the problem is that when you click "save in HDFS", the whole query runs again from scratch, so effectively you need to run it twice to be able to do it (and i haven't checked if result would be stored as one file and if Hue could download it).

In short, is such a use case possible in Hue?

Romainr · ‎01-28-2015

Please read the above JIRA for more details. Hue is only one lightweight
Python server. Google, Dropbox etc... have tens of servers dedicated to
serving files and not Web pages (the download happens from another machine).

In Hue 4 we will very probably introduce some new types of Hue servers that
will take care of this part.

Romain

View solution in original post

Kranach · ‎01-28-2015

Errata, the file had only 1 milion lines, not 100 milions

Romainr · ‎01-28-2015

This is https://issues.cloudera.org/browse/HUE-2142

In short right now Hue will not perform well for downloading and streaming
a lot of data to a browser as it is not designed for that.

Kranach · ‎01-28-2015

But i dont need to see that data in a browser, i just want to download it on my PC...

Romainr · ‎01-28-2015

The webserver is sending it to your browser, a webserver is supposed to
just send some web pages

Kranach · ‎01-28-2015

I can download gigs of data from google drive or file hosting websites using my browser, why wouldn't it be possible here?

This means my only alternative is to tell users to install hive and tell to run something like

beeline -u jdbc:hive2://bla:10000 -n user -p password -f yourscript.q > yourresults.txt

which is a bit crap... (not to mention until Hive 13 beeline doesnt report any progress on the operation). Or let them log to my server directly and wreak havoc there 😕

All that Hue gives you already is awesome, but it needs to do more!

Romainr · ‎01-28-2015

Please read the above JIRA for more details. Hue is only one lightweight
Python server. Google, Dropbox etc... have tens of servers dedicated to
serving files and not Web pages (the download happens from another machine).

In Hue 4 we will very probably introduce some new types of Hue servers that
will take care of this part.

Romain

Kranach · ‎01-28-2015

I see. Maybe then there should be also some option like "execute and save to hdfs", where Hue doesnt dump results to the browser, but puts them in one file in HDFS directly? So user can get it by other means? I recently managed to store results and then download 600 MB csv file in HDFS using Hue and it kinda worked (9 milions lines, new record). Altough few minutes the service went down (not sure if because of it, or because i just started presenting Hue to my boss) so not sure if this would work.

I guess we gonna instructl users to always use LIMIT clause on their quiries, telling that this is to avoid overloading our servers (which is technically true).

Thanks for your help!

Romainr · ‎01-28-2015

Hue has the option to save the results to HDFS and it is very scalable as
Hive is doing the writing to HDFS and then downloading from HDFS does not
require much computation from Hue.

But it indeed re-executes the SQL with the INSERT INTO /... or CREATE TABLE
AS SELECT ...

Hive or Impala does not offer a way to do both show the data in the Hue
screen and make it easy to download.

In the next version we should have some optimizations that should make more
stable to download or bump the limit.

In Hue 4 which is a big version we will tackle this as it would require a
new twin server.

So for now we recommend downloading directly from HDFS by redoing the query
for large resultsets and not bumping the 'download_row_limit' limit.

Romain

Kranach · ‎02-06-2015

Got it. We will go this way, ironically it turned out that due to some regulatory stuff, downloading raw data from our system shouldn't bee too easy, so... we are going for good old 'it's not a bug, it's a feature' 😉

FYI, i also tried this :

beeline -u jdbc:hive2://hname:10000 -n bla -p bla -f query.q > results.txt

but it didn't do much, just hanged. Maybe hive2 (or beeline?) isn't powerful enough as well.

Thanks for all the clarifications!

Cloudera Community

Support Questions

Downloading huge results from Hue