About Harsh J

Harsh J · ‎03-06-2019

The issue appears to crop up when distributing certain configuration files to prepare for installing packages. Could you check or share what the failure is via the log files present under /tmp/scm_prepare_node.*/*?

Harsh J · ‎03-06-2019

Currently Hive's connections to LDAP do not support the StartTLS extension [1]. This does make sense as a feature request however, could you log your request over at https://issues.apache.org/jira/projects/HIVE please? [1] - https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/auth/ldap/LdapSearchFactory.java#L52-L62

Harsh J · ‎03-05-2019

> Clear Cache > This is the one I am not too sure what happens It appears to clear the cached entries within Hue frontend, so the metadata for assist and views is loaded again from its source (Impala, etc.). I don't see it calling a refresh on the tables, but it is possible I missed some implicit action. > Perform Incremental metadata Update > I assume this issues a refresh command for all tables within the current database which is been viewed? If no database is veiwed does it do it for everything? This will compare HMS listing against Impala's for the DB in context and run specific "INVALIDATE METADATA [[db][.table]];" for the missing ones in Impala. Yes, if no DB is in the context, it will equate to running a global "INVALIDATE METADATA;" > Invalidate All metadata and rebuild index This runs a plain "INVALIDATE METADATA;"

Harsh J · ‎03-05-2019

The CMS + Parallel New collector does a decent job for mid-to-high heaps. It continues to remain our default, although things may change with the introduction of newer JDK (LTS) support in future. G1 is a great new collector that's been improving since its inception, and if you are going to use it, I advise using the latest JDK8 version available. Most of the cases where we've had to recommend G1 GC over the stock defaults typically arise out of specific workload and heap pattern analysis. Are you facing long pauses with CMS + ParNew collectors? What are the pauses caused by, according to your GC logging (allocation failure? too low new-size? etc.)? Its worth measuring what's impacting the current heap collector configuration, as a simple switch will only bring in limited improvements that may not truly/automatically solve your existing problems.

Harsh J · ‎03-05-2019

You will be connecting into a remote cluster, so you require a machine that can run a browser and a terminal with stable internet connection. Checkout the exam specific FAQs here for more insight: https://www.cloudera.com/about/training/certification/faq.html#launch

Harsh J · ‎02-26-2019

Is Hive a good choice for the analysis you're attempting to perform on this dataset? Wouldn't something more expressive such as Spark be more useful? You cannot insert binaries as part of a simple statement. You'll need to create a writer that converts the raw forms into sequence files and then use LOAD DATA to place those files into the table.

Harsh J · ‎02-20-2019

The HBase shell currently only prints out ASCII printable range of characters, and not unicode, to make it easier to pass around values. In practice, HBase keys are often not designed to be readable and are binary forms (such as encoded integers of hashed values, etc.). That said, the HBase shell is a programmable JRuby console, so you can use HBase Java APIs within it to get a desired output if you are going to be relying on HBase shell for your scripting work. Here's a simple example: hbase(main):013:0> config = org.apache.hadoop.hbase.HBaseConfiguration.create => #<Java::OrgApacheHadoopConf::Configuration:0x4a864d4d> hbase(main):014:0> table = org.apache.hadoop.hbase.client.HTable.new(config, 't') => #<Java::OrgApacheHadoopHbaseClient::HTable:0x5e85c21b> hbase(main):015:0> scanner = table.getScanner(Scan.new()) => #<Java::OrgApacheHadoopHbaseClient::ClientScanner:0x5aa76ad2> hbase(main):030:0> scanner.each do |row| hbase(main):031:1* key = String.from_java_bytes(row.getRow()) hbase(main):032:1> puts "'#{key}'" hbase(main):033:1> end '我'

Harsh J · ‎02-19-2019

> 1. What about connection-manager When there's no specialized connection manager, Sqoop will use its generic/standard one that's inbuilt. There's a chance this may be adequate. > 2. "driver" looks like > jdbc:mysql://:/ > So "mysql" is strange part of it. How to replace it? Your JDBC driver download should have come with a manual that shows how and what to use as its connection string. Searching over the web reveals http://support.sas.com/documentation/cdl/en/jdbcref/63713/HTML/default/viewer.htm#p1a7nrsg36gaf8n1j33tfwf7nsj7.htm which seems to suggest jdbc:sasiom://host:port (and other options). > 3. Where must i put drives (jar-files) to use it from sqoop? Place it directly under /var/lib/sqoop/, preferably on all hosts (if your Sqoop command may run from any host).

Harsh J · ‎02-15-2019

The share-lib in Oozie is modular, so it only adds necessary jars for each action type. The java action is the most generic of all action types, and therefore receives none of the other action type dependencies (such as hive, pig, distcp, spark, etc.). The article you've linked to carries an answer to the question of 'how do I further include jars from action type X into my action type Y', which I've quoted below for convenience: """ For example, if you want all Pig actions in one of your Workflows to include the HCatalog ShareLib, you would add oozie.action.sharelib.for.pig=pig,hcatalog to your job.properties. """ So in your case, you may want to try and add: oozie.action.sharelib.for.java=java,hive

Harsh J · ‎02-15-2019

If you are dealing with unordered partitioning from a data source, you can end up creating a lot of files in parallel as the partitioning is attempted. In HDFS, when a file (or more specifically, its block) is open, the DataNode performs a logical reservation of its target block size. So if your configured block size is 128 MiB, then every concurrently open block will deduct that value (logically) from the available remaining space the DataNode publishes to the NameNode. This reservation is done to help manage space and guarantees of a full block write to a client, so that a client that's begun writing its file never runs into an out of space exception mid-way. Note: When the file is closed, only the actual length is persisted, and the reservation calculation is adjusted to reflect the reality of used and available space. However, while the file block remains open, its always considered to be holding a full block size. The NameNode further will only select a DataNode for a write if it can guarantee full target block size. It will ignore any DataNodes it deems (based on its reported values and metrics) unfit for the requested write's parameters. Your error shows that the NameNode has stopped considering your only live DataNode when trying to allocate a new block request. As an example, 70 GiB of available space will prove insufficient if there will be more than 560 concurrent, open files (70 GiB divided into 128 MiB block sizes). So the DataNode will 'appear full' at the point of ~560 open files, and will no longer serve as a valid target for further file requests. It appears per your description of the insert that this is likely, as each of the 300 chunks of the dataset may still carry varied IDs, resulting in a lot of open files requested per parallel task, for insert into several different partitions. You could 'hack' your way around this by reducing the request block size within the query (set dfs.blocksize to 8 MiB for ex.), influencing the reservation calculation. However, this may not be a good idea for larger datasets as you scale, since it will drive up the file:block count and increase memory costs for the NameNode. A better way to approach this would be to perform a pre-partitioned insert (sort first by partition and then insert in a partitioned manner). Hive for example provides this as an option: hive.optimize.sort.dynamic.partition [1], and if you use plain Spark or MapReduce then their default strategy of partitioning does exactly this. [1] - https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Adding new host to cloudera manager cluster fa...

Re: HiveServer2, is StartTLS an option for user au...

Re: HUE Clear Cache, Perform Incremental Metadata ...

Re: Using G1GC of JDK 8 on Cloudera

Re: Minimum hardware configuration for CCA 175 Exa...

Re: how to store a binare file (image teched from ...

Re: how to get Readable result in Hbase shell

Re: Sqoop: connect to random dbms?

Re: does sharelib jars needs to be explicitly incl...

Re: Why can't I partitioned a 1 gigabyte dataset i...