Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 6985 | 09-21-2018 09:54 PM | |
| 8741 | 03-31-2018 03:59 AM | |
| 2623 | 03-31-2018 03:55 AM | |
| 2757 | 03-31-2018 03:31 AM | |
| 6185 | 03-27-2018 03:46 PM |
12-24-2016
08:50 PM
1 Kudo
@Ashnee Sharma Additional to what Sagar provided, be aware that in case of Oracle Directory Server Enterprise 11g (there are a few more LDAPs with the same issue), when synchronizing LDAP users/groups, Ambari uses LDAP results paging control to sync large number of LDAP objects. If that is your case, then set authentication.ldap.pagination.enabled property to false in the /etc/ambari-server/conf/ambari-properties file to disable result paging controls. This will limit the maximum number of entities that can be imported at any given time to the maximum result limit of the LDAP server. To work around this, import sets of users or groups using the -users and -groups as Sagar already included in his commands. Also, when syncing ldap, local user accounts with matching username will switch to LDAP type, which means their authentication will be against the external LDAP and not against the Local Ambari user store. Be advised! LDAP sync only syncs up-to-1000 users. If your LDAP contains over 1000 users and you plan to import over 1000 users, you must use the --users option when syncing and specify a filtered list of users to perform import in batches. This is another thing to be aware.
... View more
12-24-2016
08:36 PM
2 Kudos
@Sampath Kumar Have you followed the pre-requisites to prepare all the nodes? The following "Unable to determine the IP address of the Ambari server \'node1.example.com:8080\'" indicates that you did not resolve the FDQN requirement. Read this: https://community.hortonworks.com/questions/42911/fqdns-are-they-necessary.html. Let me know.
... View more
12-24-2016
08:09 PM
@R c Of courseOf course you can, it is not ideal, but it is useable. If it helped, pls vote/accept best answer.
... View more
12-24-2016
05:05 AM
1 Kudo
@Huahua Wei The statement in recommendation is related to the configuration what @Michael Young already mentioned. @ddharam is a different issue applicable to RHEL 6, good to know too.
... View more
12-23-2016
03:38 PM
@Boris
Demerov Yes. Reference
provided by @Randy Gelhausen is
awesome, one of the best articles in HCC. It covers Kafka tuning practices
beyond the scope of your question, but it is a must read article. I'd like to point out a few good rules of thumb related to your
question, which Wes in his article covered too (I extracted and commented a few): Set num.io.threads to at
least no. of disks you are going to use by default its 8. It be can higher than
the number of disks. A common broker server
has 8 disks. That is my current experience, however, this number can be
increased. Set num.network.threads
higher based on number of concurrent producers, consumers, and replication
factor. The default value of 3
has been set based on field experience, however, you can take an iterative
approach and test different values until you find what is optimal for your
case. Ideally you want to
assign the default number of partitions (num.partitions) to at least n-1
servers. This can break up the write workload and it allows for greater
parallelism on the consumer side. Remember that Kafka does total ordering
within a partition, not over multiple partitions, so make sure you partition
intelligently on the producer side to parcel up units of work that might span
multiple messages/events. Consumers
benefit from this approach, on producers – careful design is recommended. You
need to balance the benefits between producer and consumers based on your
business needs. Kafka is designed for
small messages. I recommend you to avoid using kafka for larger messages. If
that’s not avoidable there are several ways to go about sending larger messages
like 1MB. Use compression if the original message is json, xml or text using
compression is the best option to reduce the size. Large messages will affect
your performance and throughput. Check your topic partitions and
replica.fetch.size to make sure it doesn’t go over your physical ram. Another
approach is to break the message into smaller chunks and use the same message
key to send it same partition. This way you are sending small messages and these
can be re-assembled at the consumer side. This complicates your
Producer and Consumer code in case of very large messages. Design carefully how
Producers and Consumers deal with large size messages. There are many ways to
implement compression or chunking, as well decompression and assembly. Choose
after testing your approach. For example, a high compression ratio is most of
the time an advantage but it comes with a price paid for
compression/decompression time. It is sometimes more efficient to have a lesser
compression as long as you can reduce the size of the message under 1MB, but
faster compression/decompression. It all comes-down to your SLAs whether they
are ms or seconds. I am sure I did not cover
the domain completely, but hopefully this helps.
... View more
12-23-2016
02:59 AM
12 Kudos
Introduction The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. Optimization Approach Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput. In order to find the optimal batch size and latency, iterative test supported by producer statistics monitoring is needed. Enable Monitoring Start the producer with the JMX parameters enabled: JMX_PORT=10102 bin/kafka-console-producer.sh --broker-list localhost:9092--topic testtopic Producer Metrics Use jconsole application via JMX at port number 10102. Tip: run jconsole application remotely to avoid impact on broker machine. See metrics in MBeans tab. The <strong>clientId</strong> parameter is the producer client ID for which you want the statistics. <strong>kafka.producer:type=ProducerRequestMetrics,name=ProducerRequestRateAndTimeMs,clientId=console-producer</strong> This MBean give values for the rate of producer requests taking place as well as latencies involved in that process. It gives latencies as a mean, the 50th, 75th, 95th, 98th, 99th, and 99.9thlatency percentiles. It also gives the time taken to produce the data as a mean, one minute average, five minute average, and fifteen minute average. It gives the count as well. <strong>kafka.producer:type=ProducerRequestMetrics,name=ProducerRequestSize,clientId=console-producer</strong> This MBean gives the request size for the producer. It gives the count, mean, max, min, standard deviation, and the 50th, 75th, 95th, 98th, 99th, and 99.9thpercentile of request sizes. <strong>kafka.producer:type=ProducerStats,name=FailedSendsPerSec,clientId=console-producer</strong> This gives the number of failed sends per second. It gives this value of counts, the mean rate, one minute average, five minute average, and fifteen minute average value for the failed requests per second. <strong>kafka.producer:type=ProducerStats,name=SerializationErrorsPerSec,clientId=console-producer</strong> This gives the number of serialization errors per second. It gives this value of counts, mean rate, one minute average, five minute average, and fifteen minute average value for the serialization errors per second. <strong>kafka.producer:type=ProducerTopicMetrics,name=MessagesPerSec,clientId=console-producer</strong> This gives the number of messages produced per second. It gives this value of counts, mean rate, one-minute average, five-minute average, and fifteen-minute average for the messages produced per second. References
https://kafka.apache.org/documentation.html#monitoring
Apache Kafka Cookbook by Saurabh Minni, 2015
... View more
Labels:
12-23-2016
02:54 AM
9 Kudos
@Boris Demerov Usually, you don't need to modify these settings, however, if you want to extract every last bit of performance from your machines, then changing some of them can help. You may have to tweak some of the values, but these worked 80% of the cases for me: message.max.bytes=1000000 num.network.threads=3 num.io.threads=8 background.threads=10 queued.max.requests=500 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 num.partitions=1 Quick explanations of the numbers:
message.max.bytes : This sets the maximum size of the message that the server can receive. This should be set to prevent any producer from inadvertently sending extra large messages and swamping the consumers. The default size is 1000000 . num.network.threads : This sets the number of threads running to handle the network's request. If you are going to have too many requests coming in, then you need to change this value. Else, you are good to go. Its default value is 3 . num.io.threads : This sets the number of threads spawned for IO operations. This is should be set to the number of disks present at the least. Its default value is 8 . background.threads : This sets the number of threads that will be running and doing various background jobs. These include deleting old log files. Its default value is 10 and you might not need to change it. queued.max.requests : This sets the queue size that holds the pending messages while others are being processed by the IO threads. If the queue is full, the network threads will stop accepting any more messages. If you have erratic loads in your application, you need to set queued.max.requests to a value at which it will not throttle. socket.send.buffer.bytes : This sets the SO_SNDBUFF buffer size, which is used for socket connections. socket.receive.buffer.bytes : This sets the SO_RCVBUFF buffer size, which is used for socket connections. socket.request.max.bytes : This sets the maximum size of the request that the server can receive. This should be smaller than the Java heap size you have set. num.partitions : This sets the number of default partitions of a topic you create without explicitly giving any partition size. Number of partitions may have to be higher than 1 for reliability, but for performance (even not realistic :)), 1 is better. These are no silver bullet :), however, you could test these changes with a test topic and 1,000/10,000/100,000 messages per second to see the difference between default values and adjusted values. Vary some of them to see the difference. You may need to configure your Java installation for maximum performance. This includes the settings for heap, socket size, and so on. *** Hope it helps. Pls vote/accept best answer
... View more
12-22-2016
03:00 AM
@ARUN As @Xi Wang responded and suggested. I assess her response as appropriate. Please see also doc link (assuming you use Ambari 2.1.1.0, change to the proper link if you use an earlier version): http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_Ambari_Users_Guide/content/_adding_hosts_to_a_cluster.html Also, keep in mind that only new data will use the new data nodes, unless you execute rebalance hdfs command, then you have a distribution of all the data across all data nodes. Default threshold is 10, but you could change it to your desired threshold. You may want to execute it during off-peak hours.
... View more
12-22-2016
02:27 AM
3 Kudos
@vpemawat If you don't change anything from your process and logging approach (e.g. separate workload to not compete for the same disks IOPS, timing etc), the only option left is SSD which will increase IOPS significantly. Even then is good to separate the workload to avoid contention. One of your challenges is driven by the quite high number of files written. If you would have used a tool like NiFi (or at least Flume) to ingest the logs and write lesser number output files and spread those to log folders across dedicated drives, then you could see some improvements. There is no magic bullet.
... View more
12-21-2016
07:28 PM
@Anitha R Support for Windows have been deprecated starting with HDP 2.4. The response was true for version before 2.4. There are no plans to support Windows.
... View more