Created 10-10-2018 03:52 AM
I am working on designing a hdfs query system based on spark, which containing a paging function, and zeppelin seems be a good sample for me.
Now I have a problem. I see spark or spark sql query results are existed even I refresh or reopen the notebook. So the results must be saved on some place.
So I am wondering where these result data is saved on? If the data is saved on database, what if the result data size is pretty huge so that causing the database performance problem?
Created 10-10-2018 04:35 AM
Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.
It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here
Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json HDFS path: /user/zeppelin/notebook/{notebook-id}/note.json
You can check for results key in the note.json.
.
If this helps , please take a moment to login and "Accept" the answer
Created 10-10-2018 04:35 AM
Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.
It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here
Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json HDFS path: /user/zeppelin/notebook/{notebook-id}/note.json
You can check for results key in the note.json.
.
If this helps , please take a moment to login and "Accept" the answer
Created 10-10-2018 04:49 AM
@Aditya Sirna Thanks Aditya
So what about paging? Since the whole results are saved on hdfs in JSON format, if I need to load part of whole result, just load the whole json file and cut out part of it by given page size and page number in memory ? In practice for zeppelin, will it have out of memory problem if the size is too huge?
Created 10-10-2018 05:26 AM
There will be interpreter level properties. For ex: spark has (zeppelin.spark.maxResult) whose default value is 1000. So even if there are more than 1000 rows it will just fetch 1000 rows. If you need more rows, you can increase the limit.
You may need to tweak ( zeppelin.interpreter.output.limit, zeppelin.websocket.max.text.message.size, ZEPPELIN_MEM, ZEPPELIN_INTP_MEM ) these properties according to your output size. Refer this link for more info on all the properties
https://zeppelin.apache.org/docs/0.7.2/install/configuration.html
Created 10-10-2018 05:57 AM
So in default , there are up to 1000 lines of results stored on hdfs for each query?
If I increase the limit, will it have some negative effects? Such as slow http transferring? Or result receiving failed?
Created 10-10-2018 06:05 AM
1000 is for spark. You can set common.max_count at a global level. You should not have negative results if you increase the limit. But if your data size if very huge then you may need to tweak the above mentioned params accordingly.