About pdvorak

pdvorak · ‎03-15-2017

1. Simply copying the PRO machine index and collection folder of hdfs to DR Cluster. will it work? This will not work unfortunately. the solr index and tlog files are in a constant state of being updating, and there is no way to ensure a consistent snapshot while solr is running. This could be done if solr was shut down, however, the core_node directories that exist under the /solr/<collection_name> in hdfs are mapped to specific shards/replicas, and you would have to ensure that when creating the corresponding collection in DR, that you map the core_node directories to the same shards/replicas at collection creation time. 2. Is it any possibility there to make both CDH 5.4.8 and CDH 5.4.8 DR machine always sync on index and collection. Prior to CDH 5.9, the best way to do this is to have your indexing jobs publish documents to both collections. As of CDH5.9, there is the ability to backup and restore collections, either locally or in DR: https://www.cloudera.com/documentation/enterprise/5-9-x/topics/search_backup_restore.html 3. What is the recommeded way to take backup of PRO solr indexes and collection to DR Cluster. If you can't upgrade to CDH5.9, then the recommended way to backup the solr indexes is to stop the solr service and do an hdfs snapshot or distcp to copy the indexes to a backup location. For the backup location, if you need to run the same collection there, you would need to create it with the createNodeSet property for Solr 4.10.3 to ensure the collection gets created on the proper nodes, and you'd have to verify that the core_noden directories map to the same shards in the clusterstate.json as whats in production. -pd

pdvorak · ‎03-13-2017

Based on this error: 17/03/11 23:35:34 WARN conf.FlumeConfiguration: Could not configure sink agent-sink due to: No channel configured for sink: agent-sink org.apache.flume.conf.ConfigurationException: No channel configured for sink: agent-sink Sinks can only have one channel that they are attached to, change the following line: agent.sinks.agent-sink.channels = agent-chan To: agent.sinks.agent-sink.channel = agent-chan

pdvorak · ‎03-13-2017

Are you following this plugin directory architecture: http://flume.apache.org/FlumeUserGuide.html#the-plugins-d-directory If you look in the flume stderr.log, you should see it on the cmd line: stderr.log:+ exec /opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p0.19/lib/flume-ng/bin/flume-ng agent --conf /var/run/cloudera-scm-agent/process/434-flume-AGENT --classpath /var/run/cloudera-scm-agent/process/434-flume-AGENT/hbase-conf:/var/run/cloudera-scm-agent/process/434-flume-AGENT/hadoop-conf: --conf-file /var/run/cloudera-scm-agent/process/434-flume-AGENT/flume.conf --name tier2 -Djava.net.preferIPv4Stack=true -Duser.home=/var/lib/flume-ng -Xms1073741824 -Xmx1073741824 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/FLUME-1_FLUME-AGENT-1_pid30399.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dflume.monitoring.type=HTTP -Dflume.monitoring.port=24001 --plugins-path /usr/lib/flume-ng/plugins.d:/var/lib/flume-ng/plugins.d

pdvorak · ‎02-15-2017

numFound is the number that should be returned each time. If it is different, there are a couple of possibilities: 1. You are indexing in real time, so the numFound would keep increasing, or if using the lily hbase indexer, docs could be deleted. 2. Your replicas for a given shard are out of sync. You can find out if this is the case by sending the same query to each replica in the shard, and add the following property to the URL string: distrib=false http://solr.server/solr/collection1_shard1_replica1/select?q=*:*&distrib=false http://solr2.server/solr/collection1_shard1_replica2/select?q=*:*&distrib=false If that returns different results and you aren't doing real time indexing, then there is likely an issue, and you can do DELETEREPLICA and ADDREPLICA to recreate they synced replica: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cm_mc_solr_service.html#id_s15_n33_45 -pd

pdvorak · ‎02-13-2017

You are correct, once the batch of messages have been read from the queue and confirmed delivered to the channel (flushed to disk), then they are marked as acknowledged and, depending on your settings in IBM MQ, can be deleted. -pd

pdvorak · ‎01-31-2017

Take a look at the following settings: num.streams num.producers Increasing the num.streams will increase the number of consumer threads that you have running and increasing num.producers will allow you to produce more messages to the destination in parallel. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330 -pd

pdvorak · ‎01-06-2017

As I stated before, flume can't consume from a remote http server. You would need to have something that could consume from the remote server and then post to flume. -pd

pdvorak · ‎01-06-2017

It seems like you are having problems even reaching hdfs. have you tried a simple 'hdfs dfs -ls' from that flume node? Are you running iptables? can you ping/traceroute to the NN? -pd

pdvorak · ‎01-06-2017

Take a look at the preferred leader election tool: https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools This assumes that the desired leaders are listed first in the partition list. If 30 is listed first, it will still be the leader for all those partitions. -pd

pdvorak · ‎12-02-2016

Flume doesn't have the ability to poll an http service, however it can act as an http service itself (http://flume.apache.org/FlumeUserGuide.html#http-source) that you can post json data to (or other formats). I would suggest reviewing the documentation here: http://flume.apache.org/FlumeUserGuide.html To see some examples and different configuration options. In Cloudera Manager, you will be editing Configuration file section, and that is the configuration that is read when flume starts up. -pd

Online	Offline
Last Visited	‎01-08-2020 04:37 PM

Member Since	‎01-09-2014 08:15 AM
Last Visited	‎01-08-2020 04:37 PM
Posts	283
Kudos received	70

Cloudera Community

Re: spooldir channel error - too many files. - how...

Re: How to configure Flume with Kafka channel with...

Re: How to configure Flume with Kafka channel with...

Re: Solrcloud Replica Names

Re: flume kafkasource, hdfs sink remove avro field

Re: Solr disaster recovery at CDH 5.4.8

Re: Flume ingestion error ( need solution)

Re: Flume plugin directories are not in classpath

Re: Solr query document count

Re: Reading IBM MQ as a flume source.

Re: When there are a lot message the mirror maker ...

Re: Use Flume to get a webpage data. How to config...

Re: hdfs.HDFSEventSink: HDFS IO error java.io.IOEx...

Re: Leader election after replication in Kafka clu...

Re: Use Flume to get a webpage data. How to config...