Support Questions

Find answers, ask questions, and share your expertise

How does zeppelin storage query result?

avatar
Rising Star

I am working on designing a hdfs query system based on spark, which containing a paging function, and zeppelin seems be a good sample for me.

Now I have a problem. I see spark or spark sql query results are existed even I refresh or reopen the notebook. So the results must be saved on some place.

So I am wondering where these result data is saved on? If the data is saved on database, what if the result data size is pretty huge so that causing the database performance problem?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Junfeng Chen,

Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.

It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here

Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json
HDFS path:  /user/zeppelin/notebook/{notebook-id}/note.json

You can check for results key in the note.json.

.

If this helps , please take a moment to login and "Accept" the answer

View solution in original post

5 REPLIES 5

avatar
Super Guru

@Junfeng Chen,

Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.

It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here

Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json
HDFS path:  /user/zeppelin/notebook/{notebook-id}/note.json

You can check for results key in the note.json.

.

If this helps , please take a moment to login and "Accept" the answer

avatar
Rising Star

@Aditya Sirna Thanks Aditya

So what about paging? Since the whole results are saved on hdfs in JSON format, if I need to load part of whole result, just load the whole json file and cut out part of it by given page size and page number in memory ? In practice for zeppelin, will it have out of memory problem if the size is too huge?

avatar
Super Guru

@Junfeng Chen,

There will be interpreter level properties. For ex: spark has (zeppelin.spark.maxResult) whose default value is 1000. So even if there are more than 1000 rows it will just fetch 1000 rows. If you need more rows, you can increase the limit.

You may need to tweak ( zeppelin.interpreter.output.limit, zeppelin.websocket.max.text.message.size, ZEPPELIN_MEM, ZEPPELIN_INTP_MEM ) these properties according to your output size. Refer this link for more info on all the properties

https://zeppelin.apache.org/docs/0.7.2/install/configuration.html

avatar
Rising Star
@Aditya Sirna

So in default , there are up to 1000 lines of results stored on hdfs for each query?

If I increase the limit, will it have some negative effects? Such as slow http transferring? Or result receiving failed?

avatar
Super Guru

1000 is for spark. You can set common.max_count at a global level. You should not have negative results if you increase the limit. But if your data size if very huge then you may need to tweak the above mentioned params accordingly.