question How to use Oryx 1 to detect spam email in Support Questions

How to use Oryx 1 to detect spam email

JasonChen1114 — Sat, 05 Mar 2016 22:08:14 GMT

Hi Sean,

We are trying to use Oryx 1 to detect spam email. We have trating data (spam emails with subject and email body as text).

Can we use Oryx 1 classification to resolve such a problem?

If so, how ?

Thanks.

Chien

Re: How to use Oryx 1 to detect spam email

srowen — Sun, 06 Mar 2016 19:34:03 GMT

Yes, though I would describe Oryx as support for productionizing some
kind of learning system. Just making a model is something you should
do with other tools whose purpose is to build models. Oryx 1 is not
exactly deprecated, but Oryx 2 is the only version in active
development, and I'd really encourage you to look there. The good news
is that it's a lot easier in 2.x to reuse a model building process you
created in, say, Spark. In 1.x it's not possible.

Re: How to use Oryx 1 to detect spam email

JasonChen1114 — Sun, 06 Mar 2016 21:38:47 GMT

Thanks for the reply.

We will try Oryx 2. However, my question is how to use Oryx (1 or 2) to support spam email classification?

Frrom the Oryx classification exapmle, it looks it requires sample examples with numeric values, while email body is textual. One way is to convert email body into tf*idf features and then it's numeric values to apply Oryx classifier.

It looks the vector dimensition is too high though. Any suggestions ? Thanks.

Re: How to use Oryx 1 to detect spam email

srowen — Mon, 07 Mar 2016 09:52:03 GMT

It includes an implementation of classification using random decision
forests. Decision forests actually support both categorical and
numeric features. However, for text classification, you're correct
that you typically transform your text into numeric vectors via TF-IDF
first. This is something you'd have to do separately. Yes, the
dimensionality is high. Decision forests can be fine with this, but,
they're not the most natural choice for text classification.

You may see what I mean that Oryx is not a tool for classification,
but a tool for productionizing, which happens to have an
implementation of a classifier.

In 2.x, you also have an implementation of decision forests, and also
don't have magic TF-IDF built in or anything. However the architecture
is much more supportive of putting your own Spark-based pipeline and
model build into the framework. 1.x did not support this.