Created on 10-03-2014 05:13 PM - edited 09-16-2022 02:09 AM
Hi Sean,
I tried to import a CSV file (150K data with 3000 entries) from the Oryx UI of /ingest (browse to a file and upload).
I can see the data saved into /tmp/Oryx (as .csv.gz file).
However; in the server console, the uploading of data seems never happened... The console message is as below...
---------------------------------
Fri Oct 03 16:47:53 PDT 2014 INFO Initializing ProtocolHandler ["http-nio-8091"]
Fri Oct 03 16:47:53 PDT 2014 INFO Using a shared selector for servlet write/read
Fri Oct 03 16:47:53 PDT 2014 INFO Starting service Tomcat
Fri Oct 03 16:47:53 PDT 2014 INFO Starting Servlet Engine: Apache Tomcat/7.0.55
Fri Oct 03 16:47:53 PDT 2014 INFO Serving Layer console available at http://192.168.2.6:8091
Fri Oct 03 16:47:53 PDT 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Fri Oct 03 16:47:53 PDT 2014 INFO Namespace prefix: file:
Fri Oct 03 16:47:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 16:47:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 16:47:53 PDT 2014 INFO Starting ProtocolHandler ["http-nio-8091"]
Fri Oct 03 16:54:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 17:01:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 17:08:53 PDT 2014 INFO No available generation, nothing to do
----------------------------
When I "Contrl+C" to terminate the process, then I see the data uploading message (as below). Any idea why ?
^CFri Oct 03 17:10:51 PDT 2014 INFO Caught signal INT (2)
Fri Oct 03 17:10:51 PDT 2014 INFO Pausing ProtocolHandler ["http-nio-8091"]
Fri Oct 03 17:10:52 PDT 2014 INFO Stopping service Tomcat
Fri Oct 03 17:10:52 PDT 2014 INFO Uploading /tmp/Oryx/oryx-append-2428716983456732298.csv.gz to /Users/............/00000/inbound/oryx-append-2428716983456732298.csv.gz
Fri Oct 03 17:10:52 PDT 2014 INFO Uploaded to /....../00000/inbound/oryx-append-2428716983456732298.csv.gz
Fri Oct 03 17:10:52 PDT 2014 INFO Stopping ProtocolHandler ["http-nio-8091"]
Fri Oct 03 17:10:53 PDT 2014 INFO Destroying ProtocolHandler ["http-nio-8091"]
--------------------------------
Created 10-04-2014 01:32 AM
The serving layer holds on to data until a certain amount has been written, and then uploads it to HDFS. It will try to upload, however, if the process quits and there is still some data not uploaded to HDFS. You can configure the number of writes to be lower and thus upload more frequently (but perhaps creating more, smaller files).
The serving layer is just saying that it does not yet see any model from the computation layer. That's probably true since the computation layer hasn't gotten data yet.
Created 10-05-2014 06:21 AM
Yes, that's it.
Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server.
No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given.
In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.
Created 10-10-2014 09:06 AM
Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.
Created 10-04-2014 01:32 AM
The serving layer holds on to data until a certain amount has been written, and then uploads it to HDFS. It will try to upload, however, if the process quits and there is still some data not uploaded to HDFS. You can configure the number of writes to be lower and thus upload more frequently (but perhaps creating more, smaller files).
The serving layer is just saying that it does not yet see any model from the computation layer. That's probably true since the computation layer hasn't gotten data yet.
Created 10-04-2014 09:30 AM
Is "writes-between-upload" of the file reference.conf controlling this ?
Also, additional questions related to model in general...
(1) Is it possible to delete a model (say, from API) withoout restarting Oryx ?
(2) Relate to (1), is it possible to force to rebuild a model using different parameters on the fly (say, through API or Java call) without restarting Oryx ?
Thanks..
Jason
Created 10-05-2014 06:21 AM
Yes, that's it.
Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server.
No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given.
In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.
Created 10-10-2014 08:45 AM
Sean,
Follow up questions on the model generations....
Let's say, we have data coming to the system about 100,000 associations (user, item, preference) every hour.
Based on the reading of the document/code, my understanding on the data flow is that
(1) As long as the data arrive, it will be temporarily put into /tmp/Oryx...
(2) And, meanwhile the serving layer will approximate the recommendations based on these new data
(3) Once the data accmulated to "writes-between-upload" amount, it will be move to inbound.
(4) If there is new data in inbound, the computation layer will be triggered to create new generation.
Questions:
(a) Please correct the above-mentioned data flow.
(b) About (4), I am thinking it's related to the time-threshold and data-threshold in the configuration file. However, these two parameters are not clear in the documents. Can you explain more? For example, how can I set the computation to rebuild the model daily?
(c) About generations created by computation layer: I think each generation uses a full set of data snapshot at the moment the ALS computation is triggered... So, basically, the data used for generation 00000 is a subset of generation 00001 (if not considering removing data).
Thanks.
Created 10-10-2014 09:06 AM
Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.