Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Question on /ingest service

Solved Go to solution
Highlighted

Question on /ingest service

Explorer

Hi Sean,

 

I tried to import a CSV file (150K data with 3000 entries) from the Oryx UI of /ingest (browse to a file and upload).

I can see the data saved into /tmp/Oryx (as .csv.gz file).

However; in the server console, the uploading of data seems never happened... The console message is as below...

 

---------------------------------

Fri Oct 03 16:47:53 PDT 2014 INFO Initializing ProtocolHandler ["http-nio-8091"]
Fri Oct 03 16:47:53 PDT 2014 INFO Using a shared selector for servlet write/read
Fri Oct 03 16:47:53 PDT 2014 INFO Starting service Tomcat
Fri Oct 03 16:47:53 PDT 2014 INFO Starting Servlet Engine: Apache Tomcat/7.0.55
Fri Oct 03 16:47:53 PDT 2014 INFO Serving Layer console available at http://192.168.2.6:8091
Fri Oct 03 16:47:53 PDT 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Fri Oct 03 16:47:53 PDT 2014 INFO Namespace prefix: file:
Fri Oct 03 16:47:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 16:47:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 16:47:53 PDT 2014 INFO Starting ProtocolHandler ["http-nio-8091"]
Fri Oct 03 16:54:53 PDT 2014 INFO No available generation, nothing to do
Fri Oct 03 17:01:53 PDT 2014 INFO No available generation, nothing to do

Fri Oct 03 17:08:53 PDT 2014 INFO No available generation, nothing to do

 

----------------------------

 

When I "Contrl+C" to terminate the process, then I see the data uploading message (as below). Any idea why ?

 

^CFri Oct 03 17:10:51 PDT 2014 INFO Caught signal INT (2)
Fri Oct 03 17:10:51 PDT 2014 INFO Pausing ProtocolHandler ["http-nio-8091"]
Fri Oct 03 17:10:52 PDT 2014 INFO Stopping service Tomcat
Fri Oct 03 17:10:52 PDT 2014 INFO Uploading /tmp/Oryx/oryx-append-2428716983456732298.csv.gz to /Users/............/00000/inbound/oryx-append-2428716983456732298.csv.gz
Fri Oct 03 17:10:52 PDT 2014 INFO Uploaded to /....../00000/inbound/oryx-append-2428716983456732298.csv.gz
Fri Oct 03 17:10:52 PDT 2014 INFO Stopping ProtocolHandler ["http-nio-8091"]
Fri Oct 03 17:10:53 PDT 2014 INFO Destroying ProtocolHandler ["http-nio-8091"]

 

--------------------------------

3 ACCEPTED SOLUTIONS

Accepted Solutions

Re: Question on /ingest service

Master Collaborator

The serving layer holds on to data until a certain amount has been written, and then uploads it to HDFS. It will try to upload, however, if the process quits and there is still some data not uploaded to HDFS. You can configure the number of writes to be lower and thus upload more frequently (but perhaps creating more, smaller files).

 

The serving layer is just saying that it does not yet see any model from the computation layer. That's probably true since the computation layer hasn't gotten data yet.

Re: Question on /ingest service

Master Collaborator

Yes, that's it.

 

Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server.

 

No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given.

 

In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.

Re: Question on /ingest service

Master Collaborator

Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.

5 REPLIES 5

Re: Question on /ingest service

Master Collaborator

The serving layer holds on to data until a certain amount has been written, and then uploads it to HDFS. It will try to upload, however, if the process quits and there is still some data not uploaded to HDFS. You can configure the number of writes to be lower and thus upload more frequently (but perhaps creating more, smaller files).

 

The serving layer is just saying that it does not yet see any model from the computation layer. That's probably true since the computation layer hasn't gotten data yet.

Re: Question on /ingest service

Explorer

Is "writes-between-upload" of the file reference.conf controlling this ?

 

Also, additional questions related to model in general...

(1) Is it possible to delete a model (say, from API) withoout restarting Oryx ?

(2) Relate to (1), is it possible to force to rebuild a model using different parameters on the fly (say, through API or Java call) without restarting Oryx ?

 

Thanks..

 

Jason

Re: Question on /ingest service

Master Collaborator

Yes, that's it.

 

Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server.

 

No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given.

 

In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.

Re: Question on /ingest service

Explorer

Sean,

 

Follow up questions on the model generations....

 

Let's say, we have data coming to the system about 100,000 associations (user, item, preference) every hour.

 

Based on the reading of the document/code, my understanding on the data flow is that

(1) As long as the data arrive, it will be temporarily put into /tmp/Oryx...

(2) And, meanwhile the serving layer will approximate the recommendations based on these new data

(3) Once the data accmulated to "writes-between-upload" amount, it will be move to inbound.

(4) If there is new data in inbound, the computation layer will be triggered to create new generation.

 

Questions:

(a) Please correct the above-mentioned data flow.

(b) About (4), I am thinking it's related to the time-threshold and data-threshold in the configuration file. However, these two parameters are not clear in the documents. Can you explain more? For example, how can I set the computation to rebuild the model daily?

(c) About generations created by computation layer: I think each generation uses a full set of data snapshot at the moment the ALS computation is triggered... So, basically, the data used for generation 00000 is a subset of generation 00001 (if not considering removing data).

 

Thanks.

Re: Question on /ingest service

Master Collaborator

Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.

Don't have an account?
Coming from Hortonworks? Activate your account here