About ssanku

TomMonkeyMan · ‎05-21-2021

Thx, this works for me also.

anrathen · ‎09-01-2017

This worked. Thanks for the detailed steps.

ssanku · ‎11-21-2016

Special thanks to Michael Young for the help to be my mentor. Step1: cd /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/ screen-shot-2016-11-21-at-100524-am.png Step2: vi managed-schema: add these 3 lines <field name="_timestamp_" type="date" indexed="true" stored="true" multiValued="false" /> <field name="_ttl_" type="string" indexed="true" multiValued="false" stored="true" /> <field name="_expire_at_" type="date" multiValued="false" indexed="true" stored="true" /> screen-shot-2016-11-21-at-100929-am.png Step3: vi solrconfig.xml on the same directory. Replace the below 3 lines with the lines after it: <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">  <processor /> as <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> <processor> <str name="fieldName">_timestamp_</str> </processor> <processor> <str name="fieldName">_ttl_</str> <str name="value">+30SECONDS</str> </processor> <processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <str name="ttlFieldName">_ttl_</str> <str name="ttlParamName">_ttl_</str> <int name="autoDeletePeriodSeconds">30</int> <str name="expirationFieldName">_expire_at_</str> </processor> <processor> <str name="fieldName">_expire_at_</str> </processor>  <processor class="solr.UUIDUpdateProcessorFactory" /> screen-shot-2016-11-21-at-101045-am.png Hope that helps. Thanks, Sujitha

bbende · ‎11-28-2016

Can you start a new post describing your problem? Thanks.

ssanku · ‎08-18-2016

Solr indexing the MySQL database table on HDP 2.5 Tech Preview: Solr version used: solr 4.9.0 Step1: Downloaded the solr 4.9.0.zip from https://archive.apache.org/dist/lucene/solr/4.9.0/ Step2: Extract the file: Step3: modify the solrconfig.xml, schema.xml and add the db-data-config.xml at Step4: add the jar at this location a.vi solrconfig.xml: add these lines in between the config tags. <lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" /> <lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" /> <lib dir="../../../lib/" regex="mysql-connector-java-5.0.8-bin.jar" /> <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">db-data-config.xml</str> </lst> </requestHandler> b.vi schema.xml add the below line: <dynamicField name="*_name" type="text_general" multiValued="false" indexed="true" stored="true" /> c.Create a file called db-data-config.xml at the same path later in this session I would create a database employee in mysql server add these <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/employees" user="root" password="hadoop" /> <document> <entity name="id" query="select emp_no as 'id', first_name, last_name from employees limit 1000;" /> </document> </dataConfig> After this is complete run the below command (d) to start solr and check if solr is up and running at url below: 8983 is the default port of solr d.java –jar start.jar http://localhost:8983/solr/#/ e.select the core selector as collection1. f.Click on Data Import, expand configuration and check if its pointing to our db-data-config.xml file we created. g.After the completion of Step5 below click on execute on the page. Step5: Setting up database: Import an already available database into Mysql: Ref: https://dev.mysql.com/doc/employee/en/employees-installation.html shell> tar -xjf employees_db-full-1.0.6.tar.bz2 shell> cd employees_db/ shell> mysql -t < employees.sql With this installation of employees db in mysql is complete. Step6: With this our indexing is complete using Solr. To do: I will try indexing the tables in Mysql using latest version of Solr. Reference: http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/ Hope this helps…. Thanks, Sujitha

sandyy006 · ‎08-16-2016

@sujitha sanku : "hadoop" is the root password of mysql server in HDP 2.5 sandbox.

ssanku · ‎08-01-2016

@mqureshi many thanks for the reply. Thanks, Sujitha

dzaratsian · ‎07-29-2016

I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer. When working with text / unstructured data, there are a few things to keep in mind: Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case. Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function. From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition). You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc. This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.) You may also want to check out Word2Vec (I have an example in my github). Hope this helps!

rajinder_kaur · ‎07-22-2016

I changed the Hive-Tez Java Opts from 200 to 512 and it worked. Thanks

ssanku · ‎07-21-2016

Hi @Johnny Fugers, Input file data as: dataset.csv This gives answer in CET 563355,1388481000000 563355,1388481000000 563355,1388481000000 563356,1388481000000 a = load '/tmp/dataset.csv' using PigStorage(',') as (id:chararray, at:chararray); b = foreach a generate id, ToString( ToDate( (long)at), 'yyyy-MM-dd hh:ss:mm' ); c = group b by id; dump c; This is how it works in GMT: a = load '/tmp/dataset.csv' using PigStorage(',') as (id:chararray, at:chararray); b = foreach a generate id, ToDate(ToString(ToDate((long) at), 'yyyy-MM-dd hh:ss:mm'), 'yyyy-MM-dd hh:ss:mm', 'GMT'); c = group b by id; dump c; Hope that helps, Thanks, Sujitha

Online	Offline
Last Visited	‎09-26-2018 10:00 PM

Member Since	‎04-04-2016 09:50 PM
Last Visited	‎09-26-2018 10:00 PM
Posts	147
Kudos received	23

Cloudera Community

Re: Pig user cache files are not automatically rem...

Re: Apache PIG - When insert STORE function it giv...

Re: Convert millseconds into Unix TimeStamp

Re: Query execution time

Re: Sqoop Query returns Backend I/O Exception

Re: Adding libraries to Zeppelin

Re: Set Time to Live (TTL) on solr records:

How to add Time To Live on Solr5:

Re: Error in NiFi Flow:

Solr Indexing the database tables :

Re: Mysql database error:

Re: Ease of integration

Re: Text and Data Mining

Re: Query execution time

Re: Convert millseconds into Unix TimeStamp