Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1968 | 07-09-2019 12:53 AM | |
| 11845 | 06-23-2019 08:37 PM | |
| 9132 | 06-18-2019 11:28 PM | |
| 10108 | 05-23-2019 08:46 PM | |
| 4566 | 05-20-2019 01:14 AM |
08-15-2016
07:28 PM
Thanks Harsh! Nice explanation!
... View more
08-15-2016
10:42 AM
1 Kudo
The cause for crashes would be unrelated to this observance. I'd recommend starting a new topic by posting the logs of the service that crashes for you, specifically the earliest FATAL message it produces before it aborts, if there is one.
... View more
08-15-2016
09:15 AM
Thank you for a quick response. I will make the change and post if I have any issues. Thanks, Vivek
... View more
08-15-2016
08:16 AM
1 Kudo
These fixes will appear in CDH 5.8.2 onwards for 5.8.x series. They were pulled as bug-fixes into the branch after 5.8.0's cut. 5.7.2 has already seen the day, but 5.8.2 will arrive later. Our current release schedules are parallel for each minor version level, which would explain this observance.
... View more
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more
08-14-2016
04:01 PM
1 Kudo
Try stopping all roles and restarting your CM agent on just this host and ensure you do not use the init.d script directly, and that instead you use the recommended service command approach: ~> service cloudera-scm-agent restart (i.e., never do this: "~> /etc/init.d/cloudera-scm-agent restart")
... View more
08-14-2016
03:43 PM
The error "Temporary failure in name resolution" comes out of the DNS lookup sub-system on your OS, and likely indicates a fault of some sort when accessing one or more of your nameservers (defined in /etc/resolv.conf). If this is a repeating yet intermittent problem, I'd recommend contacting the DNS maintainers to find out if there are maintenance events or other downtime related issues ongoing with their servers. You can also check your /var/log/messages or "dmesg" contents for more clues about this lower-env trouble. The RM and other alerts you see coming out as a result of this failure is an avalanche effect. The agent polls metrics and states from the roles it runs, by contacting their webserver end-points. Since that's failing to resolve (its really a local address, shouldn't have to go through DNS if your /etc/nsswitch.conf is setup right) the alert gets flagged too. Its worth also running a local nameservice caching daemon (Such as nscd, etc.) to help cushion such effects to a certain degree and also to prevent overloading the DNS with too many queries which could also cause this potentially.
... View more
08-14-2016
11:35 AM
1 Kudo
The whole support around Parquet is documented at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_parquet.html Impala's support for Parquet is ahead of Hive at this moment, while https://issues.apache.org/jira/browse/HIVE-8950 will help it catch up in future. In Hive you will still need to manually specify a column, but you may alternatively create the table in Impala and use it then in Hive. Parquet's loader in Pig supports reading the schema off the file [1] [2], as does Spark's Parquet support [3]. None of the eco system approaches use an external schema file as was the case with Avro storages. [1] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoader.java#L90-L95 [2] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java#L94-L97 [3] - http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
... View more
08-13-2016
08:35 PM
Thank you Harsh! Yes, rhbase 1.2.1 works with thrift 8.0 and I tested on Cloudera Quickstart 5.8.0.
... View more
08-13-2016
04:40 AM
1 Kudo
The ExportSnapshot is an MR job, and as a result of that it will run across your NodeManager hosts. To provide its destination as a local filesystem URI, such as your file:///local_linux_fs_dir would only work if that passed path is visible with the same consistent content across all your cluster hosts. You can do this perhaps by mounting the same NFS across all hosts, and then using a controlled ExportSnapshot parallelism to write to them without overloading them (limit the # of maps to be low-enough). If that's not desirable, then you can also opt to run the MR job in local mode, which would still be parallel but limitedly so, by passing -Dmapreduce.framework.name=local to ExportSnapshot before any other option.
... View more