About myoung

myoung · ‎04-21-2017

@Stefan Schuster Can you confirm that Zeppelin is actually running? Are you accessing the Ambari via "http://sandbox.hortonworks.com:8080"? If you are not, have you put sandbox.hortonworks.com in your local computer host file? You can try "http://localhost:9995" to see if that works for you, assuming you are using the Sandbox on your local computer.

myoung · ‎04-13-2017

@Kelvin Tong Based on the screen shots you are providing, you are attempting to push the file to HDFS using the command line within the Sandbox itself. However, you are specifying a file path that is local to the computer running the VIrtualBox Sandbox VM. That won't work. The Sandbox has no way of knowing how to access "C:\". You must first push the file to the Sandbox using WinSCP. Then you can use the hdfs dsfs -put command using a local directory within the Sandbox (something like /root/<my filename>).

myoung · ‎03-29-2017

@Girish Mane The JournalNodes are for shared edits. They are responsible for keep in the Active and Standby NameNodes in sync in terms of filesystem edits. You do not need a JournalNode for each of your data nodes. The normal approach is to use 3 JournalNodes to give the greatest level of high availability. It's the same idea behind 3x replication of data.

myoung · ‎03-21-2017

@tuxnet Preemption will not kill existing tasks that are running. As tasks for any given job finish, those resources are then made available to the jobs in the queue that are relying on preemption. The idea behind using the queues is to assign a minimum amount of cluster resources to a given user/job. With preemption enabled, jobs can get access to a larger percentage of resources when they are available. If a new job comes in that requires a minimum percentage of resources than what is currently available, those resources will be made available as the currently running jobs individual tasks are completed. What does your capacity scheduler queues look like in terms of percentage of cluster resources? What is the min and the max values?

myoung · ‎03-21-2017

@Peter Teunissen If you log into the Cloudbreak Deployer node via SSH, you can access the logs in /var/lib/cloudbreak-deployer. You can also run the cbd logs command (as root) to see output of the logs in realtime as the cluster is deploying.

myoung · ‎03-19-2017

@mqureshi @james.jones I recommend you read up on information about SolrCloud. The reference guide provides a good overview for how it works starting on page 419: http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf A SolrCloud cluster uses Zookeeper for cluster coordination. This means keeping track of which nodes are up, how many shards a collection has and which hosts are currently serving those shards, etc. Zookeeper is also used to store configuration sets. These are the index and schema configuration files that are used for your indexes. When you create a collection using the Solr scripts, the configuration files for the collection are uploaded to Zookeeper. An collection is comprised of 1 or more shard indexes and 0 or more replica indexes. When you use HDFS to store the indexes, it is much easier to add/remove SolrCloud nodes to your cluster. You don't have to copy the indexes which are normally stored locally. The new SolrCloud node is configured to coordinate with Zookeeper. Upon startup, the new SolrCloud node will be told by Zookeeper which shards for which it is responsible and then use the respective indexes stored on HDFS. All of the index data itself is stored within the index directories on HDFS. These directories are self contained. Solr stores collections within index directories where each index has its own directory within the top level Solr index directory. This is true for local storage and HDFS. When you replicate your HDFS index directories to another HDFS cluster, all of the data is maintained within the respective index directories. HDFS: /solr/collectionname_shard1_replica1/<index files> HDFS: /solr/collectionname_shard2_replica1/<index files> 1. In the case of having Solr running on a DR cluster, you would need to ensure the index configuration (schemas, configuration sets, etc) are updated in the DR Solr Zookeeper. If you create collections on your primary cluster, then you would need to similarly create collections on the DR cluster. This is primarily to ensure the collection metadata exists in both clusters. As long as these settings are in sync, copying the index directories from one HDFS cluster to another HDFS cluster is all you need to do to keep DR the cluster in sync with the production cluster. As I mentioned above, both clusters will be configured to store indexes in an HDFS location. As long as the index directories exist, the SolrCloud nodes will read the indexes from those HDFS directories. Solr creates those index directories based on the name of the collection/index. That is how it knows which data goes with which index. 2. Yes, you should be able to do this. If you need to "restore" a collection from backup, then you would have to copy each of the collection index shards. If you create a collection with 5 shards, then you will have 5 index directories that you need to restore from DR. Using something like Cross Data Center Replication in SolrCloud 6 is the easiest way to get Solr DR in place. Second to that, using the native Backup/Restore functionality in SolrCloud 5 is a viable alternative. Unfortunately, SolrCloud 4 has neither of these more user friendly approaches. I highly recommend upgrading to at least Solr 5 to get a better handle on backups and disaster recovery.

myoung · ‎03-18-2017

@Yogesh Sharma The _all field is analyzed by default, so you shouldn't have problems performing case-insensitive queries. You are also specifying the analyze_wildcard: true parameter which will attempt to analyze the query string with wildcards before running the query. As you have shown, the query itself returns hits. So the problem is with the aggregations. For your aggregations you are using the include parameter. Can you try using ".*drama.*" as the include value instead of "*drama*"?

myoung · ‎03-17-2017

@mqureshi If Solr is storing the indexes on HDFS, then you have a fairly easy way of doing backups. You can use HDFS snapshots to take incremental backups of the Solr index directories on HDFS and then use distcp to copy those snapshots to another HDFS cluster. That provides the ability to have local backup copies and remote backup copies. If you didn't want to perform the HDFS snapshots, you could simply use distcp to replicate the HDFS data to another cluster. However, you lose the easy ability to restore an HDFS snapshot from a local backup.

myoung · ‎03-17-2017

@mqureshi Cross Data Center Replication for Solr was released in Solr 6.x. It is not available in version 4.10.3. http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf Take a look at page 409 that talks about using the ReplicationHandler for making backup copies of indexes. You can always use standard filesystem methods for performing backups, but it isn't as clean as CDCR in Solr 6.x. Solr 5.x introduced the ability to backup and restore your indexes using the API. I would encourage customers to upgrade to at least Solr 5.x.

myoung · ‎03-08-2017

@Yogesh Sharma Have you disabled the _all field? That is the catch-all field that is used for a query when you don't specify a field. Your queries are not specifying a specific field, so it should be going against the _all field. By default the _all field should be able to handle mixed-case queries. Have you verified the query returns results without any of the aggregations: GET /movies/_search?pretty { "size": 10, "_source": false, "query": { "query_string": { "analyze_wildcard": true, "query": "*drama*" } } }

Online	Offline
Last Visited	‎02-08-2019 07:03 PM

Member Since	‎02-09-2016 09:44 PM
Last Visited	‎02-08-2019 07:03 PM
Posts	559
Kudos received	413

Cloudera Community

Re: How can I force the getTwitter processor to no...

Re: Send Ambari Metric to Elasticsearch

Re: Ingesting unformatted, unordered data from hdf...

Re: What would the audit record on Zeppelin users ...

Re: Automate loading data into HDFS

Re: Error starting Zeppelin Notebook in Sandbox -

Re: Why can't I upload a simple text file onto HDF...

Re: Automate HDP installation using Ambari Bluepri...

Re: Capacity scheduler preemption doesn't work

Re: Every cluster creation hangs at update in prog...

Re: HDFS Replication for SOLR 4.10.3

Re: ElasticSearch query to perform case-insensitiv...

Re: SOLR Cross Data Center Replication version 4.1...

Re: SOLR Cross Data Center Replication version 4.1...

Re: ElasticSearch query to perform case-insensitiv...