Member since
04-04-2016
147
Posts
40
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1175 | 07-22-2016 12:37 AM | |
4241 | 07-21-2016 11:48 PM | |
1605 | 07-21-2016 11:28 PM | |
2245 | 07-21-2016 09:53 PM | |
3358 | 07-08-2016 07:56 PM |
11-21-2016
09:37 PM
Special thanks to Michael Young for the help to be my mentor. Step1: cd /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/ screen-shot-2016-11-21-at-100524-am.png Step2: vi managed-schema: add these 3 lines <field
name="_timestamp_" type="date" indexed="true"
stored="true" multiValued="false" />
<field name="_ttl_" type="string"
indexed="true" multiValued="false" stored="true"
/>
<field name="_expire_at_" type="date"
multiValued="false" indexed="true" stored="true"
/> screen-shot-2016-11-21-at-100929-am.png Step3: vi solrconfig.xml on the same directory. Replace the below 3 lines with the lines after it:
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
<!-- UUIDUpdateProcessorFactory will generate an id if none is present in
the incoming document -->
<processor /> as <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
<processor>
<str name="fieldName">_timestamp_</str>
</processor>
<processor>
<str name="fieldName">_ttl_</str>
<str name="value">+30SECONDS</str>
</processor>
<processor
class="solr.processor.DocExpirationUpdateProcessorFactory">
<str name="ttlFieldName">_ttl_</str>
<str name="ttlParamName">_ttl_</str>
<int name="autoDeletePeriodSeconds">30</int>
<str name="expirationFieldName">_expire_at_</str>
</processor>
<processor>
<str name="fieldName">_expire_at_</str>
</processor>
<!-- UUIDUpdateProcessorFactory will generate an id if none is present in
the incoming document --> <processor
class="solr.UUIDUpdateProcessorFactory" /> screen-shot-2016-11-21-at-101045-am.png Hope that helps. Thanks, Sujitha
... View more
Labels:
08-18-2016
05:43 AM
1 Kudo
Solr indexing the
MySQL database table on HDP 2.5 Tech Preview: Solr version used: solr 4.9.0 Step1: Downloaded the solr 4.9.0.zip from https://archive.apache.org/dist/lucene/solr/4.9.0/ Step2: Extract the file: Step3: modify the solrconfig.xml, schema.xml and add the
db-data-config.xml at Step4: add the jar at this location
a.vi solrconfig.xml: add these lines in between
the config tags. <lib
dir="../../../contrib/dataimporthandler/lib/"
regex=".*\.jar" /> <lib dir="../../../dist/"
regex="solr-dataimporthandler-\d.*\.jar" /> <lib dir="../../../lib/"
regex="mysql-connector-java-5.0.8-bin.jar" /> <requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst> </requestHandler>
b.vi schema.xml add the below line: <dynamicField name="*_name" type="text_general" multiValued="false"
indexed="true" stored="true"
/>
c.Create a file called db-data-config.xml at the
same path later in this session I would create a database employee in mysql
server add these <dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/employees" user="root"
password="hadoop" />
<document>
<entity name="id" query="select emp_no as 'id',
first_name, last_name from employees limit 1000;" />
</document> </dataConfig> After this
is complete run the below command (d) to start solr and check if solr is up and
running at url below: 8983 is the default port of solr d.java –jar start.jar http://localhost:8983/solr/#/
e.select the core selector as collection1. f.Click on Data Import, expand configuration and check if its
pointing to our db-data-config.xml file we created. g.After the completion of Step5 below click on execute on the page. Step5: Setting
up database: Import an already
available database into Mysql:
Ref:
https://dev.mysql.com/doc/employee/en/employees-installation.html shell> tar -xjf
employees_db-full-1.0.6.tar.bz2 shell> cd
employees_db/ shell> mysql -t
< employees.sql With this
installation of employees db in mysql is complete. Step6: With this our indexing is complete using
Solr. To do: I will try indexing the tables in Mysql using latest
version of Solr. Reference: http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/ Hope this helps…. Thanks, Sujitha
... View more
Labels:
08-16-2016
09:58 AM
1 Kudo
@sujitha sanku : "hadoop" is the root password of mysql server in HDP 2.5 sandbox.
... View more
07-29-2016
03:10 AM
4 Kudos
I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.
When working with text / unstructured data, there are a few things to keep in mind:
Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.
This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)
You may also want to check out Word2Vec (I have an example in my github).
Hope this helps!
... View more
07-22-2016
01:18 PM
I changed the Hive-Tez Java Opts from 200 to 512 and it worked. Thanks
... View more
07-21-2016
11:28 PM
1 Kudo
Hi @Johnny Fugers, Input file data as: dataset.csv This gives answer in CET 563355,1388481000000 563355,1388481000000 563355,1388481000000 563356,1388481000000 a = load '/tmp/dataset.csv' using PigStorage(',') as (id:chararray, at:chararray); b = foreach a generate id, ToString( ToDate( (long)at), 'yyyy-MM-dd hh:ss:mm' ); c = group b by id; dump c; This is how it works in GMT: a = load '/tmp/dataset.csv' using PigStorage(',') as (id:chararray, at:chararray);
b = foreach a generate id, ToDate(ToString(ToDate((long) at), 'yyyy-MM-dd hh:ss:mm'), 'yyyy-MM-dd hh:ss:mm', 'GMT'); c = group b by id; dump c; Hope that helps, Thanks, Sujitha
... View more