About Harsh J

Ravoz · ‎08-15-2016

Thanks Harsh! Nice explanation!

Harsh J · ‎08-15-2016

The cause for crashes would be unrelated to this observance. I'd recommend starting a new topic by posting the logs of the service that crashes for you, specifically the earliest FATAL message it produces before it aborts, if there is one.

evariantVM · ‎08-15-2016

Thank you for a quick response. I will make the change and post if I have any issues. Thanks, Vivek

Harsh J · ‎08-15-2016

These fixes will appear in CDH 5.8.2 onwards for 5.8.x series. They were pulled as bug-fixes into the branch after 5.8.0's cut. 5.7.2 has already seen the day, but 5.8.2 will arrive later. Our current release schedules are parallel for each minor version level, which would explain this observance.

FrozenWave · ‎08-15-2016

Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!

Harsh J · ‎08-14-2016

Try stopping all roles and restarting your CM agent on just this host and ensure you do not use the init.d script directly, and that instead you use the recommended service command approach: ~> service cloudera-scm-agent restart (i.e., never do this: "~> /etc/init.d/cloudera-scm-agent restart")

Harsh J · ‎08-14-2016

The error "Temporary failure in name resolution" comes out of the DNS lookup sub-system on your OS, and likely indicates a fault of some sort when accessing one or more of your nameservers (defined in /etc/resolv.conf). If this is a repeating yet intermittent problem, I'd recommend contacting the DNS maintainers to find out if there are maintenance events or other downtime related issues ongoing with their servers. You can also check your /var/log/messages or "dmesg" contents for more clues about this lower-env trouble. The RM and other alerts you see coming out as a result of this failure is an avalanche effect. The agent polls metrics and states from the roles it runs, by contacting their webserver end-points. Since that's failing to resolve (its really a local address, shouldn't have to go through DNS if your /etc/nsswitch.conf is setup right) the alert gets flagged too. Its worth also running a local nameservice caching daemon (Such as nscd, etc.) to help cushion such effects to a certain degree and also to prevent overloading the DNS with too many queries which could also cause this potentially.

Harsh J · ‎08-14-2016

The whole support around Parquet is documented at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_parquet.html Impala's support for Parquet is ahead of Hive at this moment, while https://issues.apache.org/jira/browse/HIVE-8950 will help it catch up in future. In Hive you will still need to manually specify a column, but you may alternatively create the table in Impala and use it then in Hive. Parquet's loader in Pig supports reading the schema off the file [1] [2], as does Spark's Parquet support [3]. None of the eco system approaches use an external schema file as was the case with Avro storages. [1] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoader.java#L90-L95 [2] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java#L94-L97 [3] - http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

BenYu · ‎08-13-2016

Thank you Harsh! Yes, rhbase 1.2.1 works with thrift 8.0 and I tested on Cloudera Quickstart 5.8.0.

Harsh J · ‎08-13-2016

The ExportSnapshot is an MR job, and as a result of that it will run across your NodeManager hosts. To provide its destination as a local filesystem URI, such as your file:///local_linux_fs_dir would only work if that passed path is visible with the same consistent content across all your cluster hosts. You can do this perhaps by mounting the same NFS across all hosts, and then using a controlled ExportSnapshot parallelism to write to them without overloading them (limit the # of maps to be low-enough). If that's not desirable, then you can also opt to run the MR job in local mode, which would still be parallel but limitedly so, by passing -Dmapreduce.framework.name=local to ExportSnapshot before any other option.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: What is the significance of /.//* in hadoop cl...

Re: hbase PriorityRpcServer.handler Waiting for a ...

Re: Unable to change Kafka property num.network.th...

Re: HDFS-9259 and HDFS-8829 in CDH 5.8

Re: HBase Table Design - Multiple "Time Series" Co...

Re: TaskTracker not starting -CDH5.8.1

Re: RLError: <urlopen error [Errno -3] Temporary f...

Re: Parquet external schema

Re: Failed to install rhbase in Cloudera Quickstar...

Re: Hbase ExportSnapshot copy-to localFS(NFS)