About aervits

aervits · ‎02-15-2016

@Brenden Cobb since this is after-effect, I'd open a ticket with support. In my experience, i would backup all configs on every node, then try to restart agent on one node at a time as agent will advertise current config on a node to ambari server. Once you confirm everything is restored for that node, you can go to next node.

aervits · ‎02-15-2016

@Pedro Gandola HDP ships with 10gb size region size by default. Having more regions, in the order of 100-200 per RS is recommended. If your size is 30GB but fewer regions, consider reducing that. How many nodes do you have? Balancer will handle data locality until major compaction happens. I wouldn't mess with that. How often do you expect to apply config and do rolling restarts? You can increase time between RS restarts to minimize impact, you can increase replication factor but that may be overkill, you can enable read replicas and have read-only replicas available for more data availability.

aervits · ‎02-14-2016

@Paul Boal use this guide to work with hive udfs in spark http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/ And here's example of invoking csvserde https://community.hortonworks.com/content/kbentry/8313/apache-hive-csv-serde-example.html

aervits · ‎02-14-2016

@Andrea Squizzato It's a jvm program and Windows is suppored, here's admin guide. https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

aervits · ‎02-14-2016

@Mark Herring no response since 02/02.

aervits · ‎02-14-2016

@vshukla @Ram Sriharsha

aervits · ‎02-14-2016

@Pedro Gandola splitting occurs when your regions grow to the max size (hbase.hregion.max.filesize) as defined in your hbase-site.xml http://hbase.apache.org/book.html#disable.splitting when you run major compaction, the data locality is restored. Run major compactions on a busy system in off-peak hours. balancer distributes regions across the cluster, runs every 5 minutes by default, do not turn it off. You can implement your own balancer and replace the default StochasticLoadBalancer class, not recommended unless you know what you're doing. Another option is to enable read replicas, so essentially you're duplicating data in a different region server. The secondary replicas are read-only and maximize your data availablity. All in all, it's more art than science and you need to experiment with many hbase properties to get an ultimate result.

aervits · ‎02-14-2016

@Revathy Mourouguessane spooling dir is good when you want to watch directory for new files. Syslog listens on a port. So if your logs land in a directory, you would use spooling dir. For hdfs you would use hdfs sink. When you master flume, check out Apache NiFi.

aervits · ‎02-14-2016

@Jim Fratzke what does your datanode log say?

aervits · ‎02-13-2016

@Zaher Mahdhi your question has many answers, I suggest you read our cluster planning guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations_chapter.html

Online	Offline
Last Visited	‎08-15-2019 06:35 AM

Member Since	‎10-01-2015 11:46 AM
Last Visited	‎08-15-2019 06:35 AM
Posts	3,933
Kudos received	1074

Cloudera Community

Re: Where can I get latest resource_management.c...

Re: How to Kerberize Flume?

Re: Load Hive Table form Pig Output File.

Re: HDP 2.6 Cluster Issues with Hive Metastore

Re: which HDP release will storm 1.1.0 be packaged...

Re: Ambari settings after database restore

Re: How to keep data locality after a HBase Region...

Re: When to Use Hive CSVSerde

Re: Import data from multiple servers

Re: How to use Encrypted zone vi NFS?

Re: PYSPARK with different python versions on yarn...

Re: How to keep data locality after a HBase Region...

Re: Data ingestion using flume - Visualize website...

Re: Cannot copy from local machine to VM datanode ...

Re: Best cluster configuration