Member since
01-09-2014
283
Posts
68
Kudos Received
50
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
812 | 06-19-2019 07:50 AM | |
1373 | 05-01-2019 08:07 AM | |
1422 | 04-10-2019 08:49 AM | |
1030 | 03-20-2019 09:30 AM | |
1345 | 01-23-2019 10:58 AM |
06-19-2019
04:52 PM
2 Kudos
How many brokers have you configured? If it is less then 3, then you need to make sure that your offsets.topic.replication.factor is reduced to match. If that isn't the problem, there should be some indication in the broker logs of what the issue is. -pd
... View more
06-19-2019
07:50 AM
Moving them in 10% at a time would be a good plan. You'll want to make sure that they are on the same filesystem, just in a different directory, so the move isn't copying across filesystems, just changing inodes. Going forward, it would be recommended to add a flume channel trigger, to alert you when the channel starts filling up, if your downstream agent isn't accepting events. -pd
... View more
06-05-2019
08:35 AM
Can you please provide the reassign-partitions command and files that you are using to migrate? What version of CDK are you using? -pd
... View more
05-01-2019
08:07 AM
No, if you only have one sink, you would have one file (assuming you don't use header variable buckets). The sink will consume from all three partitions and may deliver those in one batch to one file. -pd
... View more
05-01-2019
08:06 AM
No, the hdfs.path and any variables used will determine how many files get created in hdfs.Depending on whether you use headers or not (you could use a %{topic} header) in the hdfs.path or filePrefix will determine how many files get written. The sink will consume events from the channel, and won't differentiate on different topics. The kafka channel can only have one topic, and the sink can only have one channel, so effectively one topic. If you use the flume kafka source with multiple topics, then all those events will end up in the channel that the sink pulls from. -pd
... View more
04-10-2019
08:49 AM
1 Kudo
What version of CDH are you using? In newer versions it displays a warning, but still allows you to save the changes without a source. -pd
... View more
04-09-2019
08:26 AM
Can you provide some more of the logs on the agent that should be sending, but isnt? It's hard to see if you have any errors with a small snippet of logs. Did you verify, are you overriding agent name to 'a1' and 'a2' on each flume instance configuration? I would recommend using the taildir source instead of the exec source, it's much more reliable, and you can use patterns to match the files that you want to send: http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#taildir-source -pd
... View more
03-22-2019
08:28 AM
1 Kudo
Unfortunately, we don't support either Safenet or Elastic search. The recommendation would be to run elasticsearch on nodes that are separate from the CDH cluster, and then you can configure safenet without any concern from other service users. Since elasticsearch provides all interaction through its API, other service users shouldn't need any access to decrypt the data that elastic search is using. Alternatively you could use Cloudera Navigator Encrypt [1] to encrypt the data at rest and solr as your search engine, which is fully integrated into CDH. -pd 1. https://www.cloudera.com/documentation/enterprise/latest/topics/sg_navigator_encrypt.html#concept_navigator_encrypt
... View more
03-22-2019
08:16 AM
Try to run your query against each replica, and add the "distrib=false" to the custom parameters. Are you seeing the same numDocs on each replica in the shard? -pd
... View more
03-20-2019
09:30 AM
The snapshots are part of the indexes, representing a point in time list of the segments in the index. When you perform the backup, the metadata (information about the cluster) and the snapshot specified indicate s which set of index files to be backup up/copied to the destination hdfs directory (as specified in the <backup> section of the source solr.xml) This blog walks through the process https://blog.cloudera.com/blog/2017/05/how-to-backup-and-disaster-recovery-for-apache-solr-part-i/ When you run the --prepare-snapshot-export, it creates a copy of the metadata, and a copylisting of all the files that will be copied by the distcp command, to the remote cluster. Then, when you execute the snapshot export, the distcp command will copy those files to the remote cluster. The -b on the restore command is just the name of the directory (represented by the snapshot name) that was created and copied by distcp. -pd
... View more
03-19-2019
04:10 PM
You are correct, thtat there isn't a predictable or guaranteed order for the core_noden names. The recommendation would be to use the solr backup and restore functionality (which uses distcp to transfer the index files and metadata) between your source cluster and your target cluster: https://www.cloudera.com/documentation/enterprise/latest/topics/search_backup_restore.html -pd
... View more
02-27-2019
11:29 AM
Thats odd that the VM is read only....Are you making the change in CM for the flume logging safety valve? -pd
... View more
02-27-2019
11:26 AM
There isn't any way to run extra scripts when CM restarts kafka, it sounds like your python solution may be the best option. You could also consider creating a custom CSD that would run your scripts/deamons: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_addon_services.html -pd
... View more
02-25-2019
11:52 AM
Have you enabled log4j DEBUG to see if there is any additional information? If you review the /data/flume/positions/tuzla2kafka-taildir_position.json file, do you see reference to the missing files there? -pd
... View more
01-23-2019
02:16 PM
1 Kudo
Just realized, the log4j setting should go in the flume logging safety valve, not the broker. Also, make sure you can run a kafka-console-consumer and connect to the topic as well, just to make sure its not something with kafka. -pd
... View more
01-23-2019
10:58 AM
1 Kudo
Morphlines would be the preferred way to selectively choose the data that will be passing through the source to the sink. you can use the morphline removeFields command [1] to selectively drop the fields you don't want. If you need to review what is happening with the data you can turn on morphline TRACE by adding the following to the flume logging safety valve: log4j.logger.org.kitesdk.morphline=TRACE -pd [1] http://kitesdk.org/docs/1.1.0/morphlines/morphlines-reference-guide.html#removeFields
... View more
01-23-2019
10:49 AM
1 Kudo
What is your channel size in the flume metrics page reported as? Is it decreasing? flume keeps at least the two most recent log files in the flume file channel at all times, regardless of whether it is fully drained or not. The best is to review the channel size in the flume metrics page, or on the channel size charts. -pd
... View more
01-23-2019
10:47 AM
1 Kudo
The problem is usually that the kafka consumer is not configured properly, and is failing silently while it is running. You can verify if the flume consumer group is actually connected to partitions by running the "kafka-consumer-groups" command. You could also turn on log4j.logger.org.apache.kafka=DEBUG in the broker logging safety valve, and review the messages when flume tries to connect to kafka. A lot of "errors" are retryable, meaning they won't throw an exception, but you won't see any output. -pd
... View more
01-17-2019
01:19 PM
The recommended path in this situation is to just comment out the sources line that specifies which sources are configured.: # tier1.sources = kafkasource1 kafkasource2 etc The flume agent can function without any sources and will then drain the channel through the sinks, without adding any new data to the channel. -pd
... View more
12-27-2018
01:59 PM
1 Kudo
It is possible to use it, although kafka connect isn't officially supported by cloudera: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_known_issues.html -pd
... View more
12-27-2018
01:50 PM
1 Kudo
You can use the flume jms source (http://flume.apache.org/FlumeUserGuide.html#jms-source) to consume messages of the IBM MQ queue and either use a kafka channel or kafka sink to send those messages to kafka. -pd
... View more
12-26-2018
03:50 PM
2 Kudos
A word of caution: Flume isn't really designed for transferring files of large sizes. It would be recommended for you to use oozie or an nfs gateway with cron to transfer files on a regular basis, especially if you want the file preserved in its entirety. One of the things that you will observe, is that if flume has any temporary transmission errors, it will attempt to resend parts of those files, which will result in duplicates (a standard and expected scenario when using flume), and so your resultant files in hdfs would have those duplicates within them. Additionally, when you do have interruptions, existing hdfs files are closed and new ones are opened. -pd
... View more
11-08-2018
01:08 PM
The issue is not whether kerberos is used, rather that the curl command expects it to be there (since it is there by default with the standard OS distribution of curl). Since it is not there, then the curl command fails, thus the solrctl script fails. If you run the following, what is your result: curl --version If you are running redhat, can you also do: which curl yum whatprovides curl And provide the output? -pd
... View more
11-05-2018
04:18 PM
CDH6 has rebased to Solr 7. Given the large new set of features, it is included in a major release and not a minor release. If you need the functionality in Solr 7, the recommendation would be to upgrade to CDH6. -pd
... View more
11-05-2018
04:16 PM
Thats your problem, you are using a version of curl that doesn't support kerberos you should see something like this for the curl --version command: [root@nightly515-1 ~]# curl --version curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.21 Basic ECC zlib/1.2.7 libidn/1.28 libssh2/1.4.3 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets It needs to support "GSS-Negotiate". It's likely you installed a custom version of curl, or updated to a version that doesn't support it. -pd
... View more
11-02-2018
12:53 PM
Does the curl command I noted return an actual web page? From the output, it is possible there is something wrong with the curl binaries that you are using... -pd
... View more
11-01-2018
09:58 AM
It looks like its failing contacting the solr nodes. are you able to run this successfully from the host where the solrctl command is running? curl -i --retry 5 -s -L -k --negotiate -u : http://ip-172-31-82-140.ec2.internal:8983/solr -pd
... View more
11-01-2018
08:37 AM
Can you run with the --trace option and see if theres any indication of why the ZK_ENSEMBLE is not being used? -pd
... View more
08-31-2018
09:01 AM
FLUME-3027 has been backported to CDH5.11.0 and above, so if you are able to upgrade, it would prevent the issue of offsets bouncing back and forward. One thing you may want to consider, if you are getting rebalances, it may be because it is taking too long to deliver by your sink, before polling kafka again. You may want to consider lowering your sink batch size in order to deliver and ack the messages in a timely fashion. Additionally, if you upgrade to CDH5.14 or higher, the flume kafka client is 0.10.2, and you would be able to set max.poll.records to match the batchSize you are using for the flume sink. Additionally, you could increase the max.poll.interval.ms, which is decoupled from the session.timeout.ms in 0.10.0 and above. This would prevent the rebalancing from occurring since the client would still heartbeat without having to do a poll to pull more records before session.timeout.ms expires. -pd
... View more
08-30-2018
01:22 PM
You can have multiple flume agents running on multiple hosts. If they are sharing the same flume configured group.id, the messages will be distributed to all the agents (not duplicated). If you don't need to do any processing on the events (via an interceptor), you could just use a kafka channel and hdfs sink, and that would deliver events directly from the channel. In that case you can only use one topic per channel, but could then have an associated sink delivering just that topic. If you did want to use a flume kafka source, it adds a 'topic' header that specifies the topic name that the message was consumed from, and you could put that in the hdfs.path or hdfs.filePrefix as %{topic} -pd
... View more