Please help me about minimal hardware requirements for out small cluster.
We decided to make very small production cluster with high availability for archiving purposes based on Cloudera CDH 6.3.3 (community version)
Storage size planning as about of 10-20 TiB
- every 2 minutes ETL from external oracle to local parquet about of 500-1000 rows of data
- periodically (very rarely) analytic queries to hive about search through all of parquets
- periodically (very rarely) run spark ad-hoc tasks with goals same as above
- Cloudera manager
- Yarn with MR2
- Streamsets parcel (as a part of cloudera)
We want to use only 3 hosts (not more) and disaster of any of this host must not crash all system.
So we plan to place all of above components to all of the hosts.
In another words, each component will be on each host.
Is it normal and available or someone may advise another alignment?
We also want to know if we can place HDFS namenode and cloudera manager on only 2 hosts or this components also better to put to all three hosts?
And, finally, which minimal requirements of RAM, CPU and disk storage to each of this three hosts?
Big thanks in advance!
... View more
Thank you for reply! I have read in official hive doc that orc format is native for hive, so I prefered use orc and rebuild my ETL from parquet to orc. But you showed me that cloudera's hive is something other then hive in general and I am very suprised by that ) Ok, I will switch to parquet again. By the way, if I create external table with stored as orc and run some insert from hive then all is ok and cloudera's hive created 000000_0.orc files and works with it very well. But from external world (streamsets, spark), yes, hive does not accept orc. I have some problems with hive+parquet processing too (this is a reason I switched to orc format), but it is another question and another story 🙂 Thank you again! I spent a lot of time trying to understand what is wrong with hive and orc. So classic is classic, I will use parquet at this time.
... View more
I can not solve the problem of compatibility of an external orc and Claudera’s hive. I have cloudera express version 6.3.2 with hive version 2.1.1 In general, it’s strange, I downloaded the latest version of claudera, and there is old hive 2.1.1 there Case: - Externally I create some orc (I tried to create it in the local spark and in the same cloudera through map reducer job - the same result) - I'm trying to read this orc in my claudera even through orcfiledump - I get Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6 at org.apache.orc.OrcFile $ WriterVersion.from (OrcFile.java:145) - I downloaded the orc-tools-1.5.5-uber.jar utility locally to my computer - Also downloaded there the problematic orc - Performed by java -jar orc-tools-1.5.5-uber.jar meta msout2o12.orc - Uber jar with its own hadoop inside have read this orc ok Structure for msout2o12.orc File Version: 0.12 with ORC_135 Rows: 242 Compression: ZLIB Compression size: 262144 Without any creation of tables, just a hive in the cloudera can stupidly not be able to read the orc using its own utility. The problem begun from the fact that I created an external table and hiveql on the orc generated such error. But here it just stupidly reduced the problem to a minimum, just hive --orafiledump can not read the orc. How to make cloudera read normally orcs? .. What to tighten up in my cloudera?
... View more