About Harsh J

Harsh J · ‎03-21-2019

The search-based HDFS find tool has been removed and is superseded in C6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find

Harsh J · ‎03-20-2019

Flume scripts need to be run under a Bash shell environment, but it appears that you are trying PowerShell in Windows.

Harsh J · ‎03-17-2019

Thank you for confirming the details, Does the subject part of your klist output match the added username in the HBase Superusers configuration precisely? If your user is in a different realm than the cluster services, is the realm name present as part of HDFS -> Configuration -> 'Trusted Realms'? Are all commands done as the superuser failing? What HBase shell command/operation specifically is leading to your quoted error? As to adding groups, it can be done in the same field, except you need to add an '@' prefix to the name. For ex. if your group is cluster_administrators, then add it in as '@cluster_administrators' in the HBase Superusers config. When using usernames, the @ must not be specified. Both approaches should work though. P.s. If you'll be relying on groups, ensure all cluster hosts return consistent group lookup output for id <user> commands, as the authorization check is distributed across the cluster roles for HBase.

Harsh J · ‎03-13-2019

Please create a new thread for distinct questions, instead of bumping an older, resolved thread. As to your question, the error is clear as is the documentation, quoted below: """ Spooling Directory Source This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted). Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated: If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing. If a file name is reused at a later time, Flume will print an error to its log file and stop processing. """ - https://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source It appears that you can get around this by using ExecSource with a script or command that reads the files, but you'll have to sacrifice reliability. It may be instead worth investing in an approach that makes filenames unique (`uuidgen` named softlinks in another folder, etc.)

Harsh J · ‎03-13-2019

Could you share your CDH version? I'm unable to reproduce this with a username added (without @ character prefix) to the config you've mentioned in the recent CDH 6.x releases. By 're-deployed' did you mean restart? I had to restart the service for all hosts to see the new superuser config.

Harsh J · ‎03-07-2019

It appears that you're trying to use Sqoop's internal handling of DATE/TIMESTAMP data types, instead of using Strings which the Oracle connector converts them to. Have you tried the option specified at https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_java_sql_timestamp? -Doraoop.timestamp.string=false You shouldn't need to map the column types manually in this approach.

Harsh J · ‎03-07-2019

You'll need to use lsof with a pid specifier (lsof -p PID). The PID must be your target RegionServer's java process (find via 'ps aux | grep REGIONSERVER' or similar). In the output, you should be able to classify the items as network (sockets) / filesystem (files) / etc., and the interest would be in whatever holds the highest share. For ex. if you see a lot more sockets hanging around, check their state (CLOSE_WAIT, etc.). Or if it is local filesystem files, investigate if those files appear relevant. If you can pastebin your lsof result somewhere, I can take a look.

Harsh J · ‎03-06-2019

MapReduce jobs can be submitted with ease, as all they mostly require is the correct config on the classpath (such as under src/main/resources for Maven projects). Spark/PySpark greatly relies on its script tooling to submit to a remote cluster so it is a little more involved to achieve this. IntelliJ IDEA has a remote execution option in its run targets that can be configured to copy over the build jar and invoke any arbitrary command on an edge host. This can be combined with remote debugging perhaps to get equal experience as MR. Another option is to use a web interface based editor such as CDSW.

Harsh J · ‎03-06-2019

> can we deploy the HttpFS role on more than one node? is it best practice? Yes, the HttpFs service is an end-point for REST API access to HDFS, so you can deploy multiple instances and also consider load balancing (might need sticky sessions for data read paging). > we can see that new logs are created on opt/hadoop/dfs/nn/current on the actine namenode on node01 but no new files . on the standby namenode no node02 - is it OK ?? Yes, this is normal. The new edit logs are redundantly written only when the NameNode is active. At all times the edits are primarily always written into the JournalNode directories.

Harsh J · ‎03-06-2019

It is not normal to see the file descriptors limit run out or run close to limit unless you have an overload problem of some form. I'd recommend checking via 'lsof' what is the major contributor towards the FD count for your RegionServer process - chances are it is avoidable (a bug, a flawed client, etc.). The number should be proportional to your total region store file counts and the number of connecting clients. While the article at https://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/ focuses on DN data transceiver threads in particular, the formulae at the end can be applied similarly to file descriptors in general too.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: org.apache.solr.hadoop.HdfsFindTool not availa...

Re: Flume cannot find teh flume-ng.ps1 file

Re: HBase Insufficient Permissions with Kerberos

Re: Some questions with Flume

Re: HBase Insufficient Permissions with Kerberos

Re: Modify datatype during sqoop import

Re: Calculate File Descriptor in HBase

Re: IDE for creating and running CDH jobs using Sc...

Re: can we deploy the HttpFS role on more than one...

Re: Calculate File Descriptor in HBase