About SQLShaw

SQLShaw · ‎08-01-2016

Building out a cluster is a bit of puzzle and gets especially hairy when the cluster is small, say < 12 nodes. For good or bad this is how I tend to generalize my approach: 1. There are master services (NN, RM) and there are client services (Spark, Hive). Think HA and redundancy for master services. Best not to co-locate multiple master services since that could cause a SPOF. Do not co-locate master and worker (HDFS) services. 2. Services such as Storm, HBase, and Solr will do better on dedicated servers because of their high resource requirements. Not required of course, but be cognizant of the trade-offs. 3. Spark is memory bound, Kafka is IO bound, Storm is CPU bound. When looking at co-locating services try to mix and match. Don't put 2 memory bound services on a single server. 4. I prefer to have a small, dedicated Ambari server. Seems cleaner to me but your mileage may vary. 5. Try to use existing database infrastructure for all your metastores, e.g. Oracle. 6. Never use SAN 7. Think about virtualizing master services, edge nodes, and dev. This list is by no means conclusive and every architect will have additional details (e.g. placing the Spark History server on the same server as HiveServer2). When it really comes down to it you can plan for the worst and hope for the best. Your cluster WILL change over time...guaranteed. Of course you could just deploy in Azure HDInsight and be done with it.... 😉

SQLShaw · ‎08-01-2016

Hi @Christopher Amatulli. I'd strongly advise against siloing your cluster based on storage, processing, and services. This goes against the concepts of a cluster and moves you back into traditional application silos. Think of it more as a single cluster with distributed and shared storage and processing. You may want to assign certain servers to certain services based on high availability requirements, or IO\CPU\Memory requirements, but the cluster as a whole will be under a single operations and management service (Ambari) as well as a single resource layer (YARN). For small clusters you may have 2 Master servers, an edge node, and n number of data nodes. You should review our cluster planning guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardware-recommendations_chapter.html as well as any number of good design articles on HCC. Hope this helps.

SQLShaw · ‎07-31-2016

Hi @Jon Maestas. Executing the following should resolve the issue. #Set file-max; no. of open files for single user sudo sh -c 'echo "* soft nofile 200000" >> /etc/security/limits.conf' sudo sh -c 'echo "* hard nofile 200000" >> /etc/security/limits.conf' sudo sh -c 'echo "200000" >> /proc/sys/fs/file-max' sudo sh -c 'echo "fs.file-max=65536" >> /etc/sysctl.conf' #Set process-max sudo sh -c 'echo "* soft nproc 8192" >> /etc/security/limits.conf' sudo sh -c 'echo "* hard nproc 16384" >> /etc/security/limits.conf' sudo sh -c 'echo "* soft nproc 16384" >> /etc/security/limits.d/90-nproc.conf' # ULIMITS to be set sudo sh -c 'echo ULIMITS adjustments' sudo sh -c 'echo "hdfs - nofile 32768" >> /etc/security/limits.conf' sudo sh -c 'echo "mapred - nofile 32768" >> /etc/security/limits.conf' sudo sh -c 'echo "hbase - nofile 32768" >> /etc/security/limits.conf' sudo sh -c 'echo "hdfs - nproc 32768" >> /etc/security/limits.conf' sudo sh -c 'echo "mapred - nproc 32768" >> /etc/security/limits.conf' sudo sh -c 'echo "hbase - nproc 32768" >> /etc/security/limits.conf' #

SQLShaw · ‎07-12-2016

@Sunit Gupta Here are some good resources: http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/

SQLShaw · ‎07-11-2016

@mrizvi could you please attach your code? Thanks.

SQLShaw · ‎07-11-2016

@Anurag Setia This doesn't directly answer your question but I would advise against installing HDP on Windows. This option will be deprecated. I'd suggest installing the sandbox if you would like to get familiar with the HDP platform. http://hortonworks.com/products/sandbox/#downloads

SQLShaw · ‎06-29-2016

@Rahul Mishra You may want to start with this documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardware-recommendations_chapter.html. For small clusters like yours where HA isn't a concern you basically are dealing with only 2 types of nodes - master and worker nodes. I certainly wouldn't over-architect it. For an 8 node cluster you would have your Ambari Server which can also hold your client services, 2 master nodes, and finally 5 worker nodes. If you have a homogeneous cluster like yours where each node has low resources, you're primary concern is co-locating services requiring the same type of resources. For example, it would be ok to have an in-memory service like Spark co-exist with a more IO intensive service, but not 2 in-memory intensive services on the same node. In your case you'll just have to build it out and monitor and be aware that running certain operations together may cause performance issues. The good thing about HDP is its ability to scale so you are never really quite "locked-in" to a particular architecture.

SQLShaw · ‎06-14-2016

@charan tej At one point you could use Microsoft System Center to monitor HDP on Windows. https://cwiki.apache.org/confluence/display/AMBARI/Ambari+SCOM+Management+Pack. It looks like it hasn't been updated in awhile. Because of the lack of Ambari and Kerberos, we recommend not running HDP on Windows.

SQLShaw · ‎06-13-2016

@sankar rao It indicates there is no server component. Many services consist of a server component and a client component. For example, MapReduce has a History server component, Hive has a HiveServer2, etc. Service such as Tez, Sqoop, and Pig do not have any server component. You can see this by clicking on the service and looking at the running components and noticing there is only a client service running. This is important when considering management and operations, especially stopping and starting services and understanding where the service runs. Many server services run on different nodes than the client components. Clients will run on all data nodes. Hope this helps.

SQLShaw · ‎06-13-2016

@charan tej You're correct. HDP on Windows does not support Ambari. You could try using a 3rd party tool such as SQuirreL for Hive or access Hive using SQL Server 2016 Polybase. HiveServer2 accepts ODBC connections.

Online	Offline
Last Visited	‎06-25-2024 10:10 AM

Member Since	‎07-31-2019 06:56 AM
Last Visited	‎06-25-2024 10:10 AM
Posts	346
Kudos received	257

Cloudera Community

Re: Regarding to activate HIVE ACID transactions o...

Re: Hive 1.2.1++

Re: What is the fastest way to load data into Apac...

Re: Do i have to commit my insert statment in hive...

Re: Deploying hortonworks sandbox VM to cluster

Re: Physical layout of architecture

Re: Physical layout of architecture

Re: HiveServer2 Hive user's nofile ulimit above 64...

Re: import json data into hive table,store JSON da...

Re: Getting parse exception while using get_json_o...

Re: HDP Windows installation failing, perhaps arou...

Re: Any guidelines on assigning masters and assign...

Re: ambari service on windows

Re: Hi ..I have HDP sandbox 2.3..i was notices tha...

Re: ambari service on windows