- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How does zeppelin storage query result?
- Labels:
-
Apache Spark
-
Apache Zeppelin
Created ‎10-10-2018 03:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am working on designing a hdfs query system based on spark, which containing a paging function, and zeppelin seems be a good sample for me.
Now I have a problem. I see spark or spark sql query results are existed even I refresh or reopen the notebook. So the results must be saved on some place.
So I am wondering where these result data is saved on? If the data is saved on database, what if the result data size is pretty huge so that causing the database performance problem?
Created ‎10-10-2018 04:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.
It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here
Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json HDFS path: /user/zeppelin/notebook/{notebook-id}/note.json
You can check for results key in the note.json.
.
If this helps , please take a moment to login and "Accept" the answer
Created ‎10-10-2018 04:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. Zeppelin notebook results are stored in JSON format HDFS (from HDP 2.6) and on native filesystem prior to this version.
It is stored in HDFS , so it will not be a problem even if the size is huge. You can check the output here
Native FS path : /usr/hdp/current/zeppelin-server/notebook/{notebook-id}/note.json HDFS path: /user/zeppelin/notebook/{notebook-id}/note.json
You can check for results key in the note.json.
.
If this helps , please take a moment to login and "Accept" the answer
Created ‎10-10-2018 04:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Aditya Sirna Thanks Aditya
So what about paging? Since the whole results are saved on hdfs in JSON format, if I need to load part of whole result, just load the whole json file and cut out part of it by given page size and page number in memory ? In practice for zeppelin, will it have out of memory problem if the size is too huge?
Created ‎10-10-2018 05:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There will be interpreter level properties. For ex: spark has (zeppelin.spark.maxResult) whose default value is 1000. So even if there are more than 1000 rows it will just fetch 1000 rows. If you need more rows, you can increase the limit.
You may need to tweak ( zeppelin.interpreter.output.limit, zeppelin.websocket.max.text.message.size, ZEPPELIN_MEM, ZEPPELIN_INTP_MEM ) these properties according to your output size. Refer this link for more info on all the properties
https://zeppelin.apache.org/docs/0.7.2/install/configuration.html
Created ‎10-10-2018 05:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So in default , there are up to 1000 lines of results stored on hdfs for each query?
If I increase the limit, will it have some negative effects? Such as slow http transferring? Or result receiving failed?
Created ‎10-10-2018 06:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1000 is for spark. You can set common.max_count at a global level. You should not have negative results if you increase the limit. But if your data size if very huge then you may need to tweak the above mentioned params accordingly.
