About cstanca

cstanca · ‎07-21-2016

@Arunkumar Dhanakumar Since replication factor is by default 3, using Benjamin's approach for sizing, 10 GB will be per data node. That is very simplified and assumes big files. Additional to Benjamin's response, let's keep in mind that the block size matter. The calculation presented above is rough-order of magnitude and it does not account that data could be as many small files that may not fill the blocks. For example, your block size could be 256 MB and you stored 100 files of 1 KB. That could take 100 x 256 MB. Also, compression plays a role here. It depends the type of compression used, etc. Also, if you data is stored as ORC, you could have your 10 GB data reduced at 3 GB data even without compression.

cstanca · ‎07-20-2016

@Srinivasan Hariharan I am sure that you are aware already that WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE. There you encounter performance implications due to the use of the HTTP server, Jetty. The FileSystem Shell API is a java application that uses java FileSystem class to provide FileSystem operations. FileSystem Shell API creates RPC connection for the operations. Here are some numbers, but this is not a serious benchmarking study. I am not surprised to see the results for <10 MB files. That's what I expect to see. You can run the test for yourself. If that is the size of your files, then <10 MB should be fine. From my past experience, performance was a concern for large files, visible from 1 GB and higher. http://wittykeegan.blogspot.com/2013/10/webhdfs-vs-native-performance.html I'm checking for newer in Hortonworks docs and will post the link, if found. If this is a reasonable response, please vote it or accept it as a best answer.

cstanca · ‎07-15-2016

@kishore sanchina Timing only, no special technical reasons. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC. https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions If this is what you were looking for, please vote and accept this as the best answer.

cstanca · ‎07-15-2016

@Leonid Zavadskiy You are dealing with this issue: https://issues.apache.org/jira/browse/HADOOP-3733 As a workaround you can set first the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties then the URI would be S3:/mybucket/dest. Putting things on the command line is not very secure anyway.

cstanca · ‎07-15-2016

@Rita McKissick Logging output is written to a target by using an appender. If no appenders are attached to a category nor to any of its ancestors, you will get the following message when trying to log: log4j:No appenders could be found for category (some.category.name). log4j:Please initialize the log4j system properly. Log4j does not have a default logging target. It is the user's responsibility to ensure that all categories can inherit an appender. This can be easily achieved by attaching an appender to the root category. You can find info on how to configure the root logger ( log4j.rootLogger ) in the log4j documentation, basically adding something as simple as this at the beginning of the file: log4j.rootLogger=debug, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=... This should clear those WARN messages you get on startup (make sure you don't already have an appender named stdout ; also be careful of what level you give the root logger, debug will be very verbose and every library in your app will start writing stuff to the console). In a nutshell, you're missing the log4j.properties or log4j.xml in your classpath. You can also bypass this by using BasicConfigurator.configure(); But beware this will ONLY log to System.out and is not recommended. You should really use one of the files above and write to a log file. A very simple example of log4j.properties would also be #Log to Console as STDOUT log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target=System.out log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss}%-5p%c %3x-%m%n #Log to file FILE log4j.appender.file=org.apache.log4j.DailyRollingFileAppender log4j.appender.file.File=logfile.log log4j.appender.file.DatePattern='.'yyyy-MM-dd log4j.appender.file.append=true log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss}%-5p%c %3x-%m%n #RootLogger log4j.rootLogger=INFO, stdout, file

cstanca · ‎07-14-2016

Multi-Polygon example: SELECT st_astext(ST_MultiPolygon('multipolygon (((0 0, 0 1, 1 0, 0 0)), ((2 2, 2 3, 3 2, 2 2)))')) from YourTable LIMIT 1;

cstanca · ‎07-13-2016

@Alpha3645 This could be an entire questionnaire, however, if I were an enterprise architect and needed to provide a 100,000ft view number, assuming a basic data lake to support 25 TB and grow another 25 TB (data replication factor of 3) and average workloads of several services, e.g. HDFS, Hive, HBase, and 3 master (16 core, 128 GB RAM, 2 x 2 TB) + 7 data nodes (16 cores, 256 GB RAM, 12 x 2 TB), 10 Gbps network, the cost for hardware would be anywhere between 60 and 100K. Add the cost of Hortonworks Data Subscription (check with sales rep for better number), your budget (exclude labor) would be anywhere between 100K and 150K. Check cluster capacity planning here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardware-recommendations_chapter.html More: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/index.html If this is what you were looking ballpark, vote this answer and accept it as a best answer.

cstanca · ‎07-13-2016

@ripunjay godhani It depends on how much data you plan to write and read from the disk and compare that with what the disks you use provide, double that just to be safe. That is by your design. For example a SATA 3 Gbps provides a theoretical throughput of 600 Mbs, but that is unlilkely to be met. I would count on 50% of that. That leads to about 300 Mbs which is roughly 30 MB/s. If you need about 100 MB/s at peak and you double that then you need about 200 MB/s, that would be roughly 6-7 drives, This is very simplistic because when you have a cluster there is a lot of going on due to block replication. If your network is 1 Gbps your limitation will be 100 MB/s by the network, but you still need each server in the cluster to provide more IOPS for various local operations. Vote the answer, accept it is as a best answer using the arrow up next to the response.

cstanca · ‎07-13-2016

@ripunjay godhani Just knowing the amount of data (250 GB) is insufficient for capacity planning. Intake and output is also necessary, data processing requirements. Workloads, concurrency, expected response time, resiliency and availability are also important factors. Those determine what CPU/RAM/Network/DiskIO you need. Account for how much data will be read from the disk vs from memory based on your SLA and design. This is all art of estimation. NN is best to use reliable hardware and RAID is a good option. Anyhow, it is usually good to begin small and gain experience by measuring actual workloads during a pilot project. We recommend starting with a relatively small pilot cluster, provisioned for a “ balanced ” workload. For pilot deployments, you can start with 1U/machine and use the following recommendations: Two quad core CPUs | 12 GB to 24 GB memory | Four to six disk drives of 2 terabyte (TB) capacity. Even you have only 250 GB, that is multiplied by a replication factor of 3, you need temporary space, plus room to grow. Multiple spindles will also give enough IOPS for disk operations. Don't think only in matter of "I need 250 GB storage". The minimum requirement for network is 1GigE all-to-all and can be easily achieved by connecting all of your nodes to a Gigabyte Ethernet switch. In order to use the spare socket for adding more CPUs in future, you can also consider using either a six or an eight core CPU. For more check the following references: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_cluster-planning-guide/content/balanced-workloads-deployments.1.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_clust_capacity.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_hbase_io.html If you find this response helpful, accept it as a best answer.

cstanca · ‎07-11-2016

@mrizvi Your SQL statement uses a reserved word, key. In general, it is a good practice to use back ticks, e.g. `key` to avoid special keywords or special characters issues, ... These special apostrophes (back ticks) that you can find on upper left of your keyboard. Could you rewrite your query to use at least `key` instead of just key? There may be a similar problem with the use of "$.key"... Additionally, I would use '$.Foo' instead of "$.Foo", use single apostrophes instead quotes ... Just as a good practice.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: How much actual space required to store 10GB t...

Re: WebHDFS Performance

Re: I just want to know why hive supports insert,...

Re: access amazon S3 bucket from hdfs

Re: Receiving error when I try to push my parsed S...

Re: Geo-spatial Queries with Hive using ESRI Geome...

Re: Estimate cost for Data Lake architecture

Re: How to plan a hortonworks hadoop cluster if my...

Re: How to plan a hortonworks hadoop cluster if my...

Re: Getting parse exception while using get_json_o...