About Harsh J

Harsh J · ‎10-22-2018

Should be doable in Spark using the CSV and Avro reader/writer. Your header is quite odd with quoting characters surrounding its column names, so it cannot be understood directly ('"' is an illegal character for an avro field name). We could have the Spark CSV reader ignore this line as a comment since no other line should start with a '"' character. Your data is expressed as quoted values with the quoted character being '|'. Something like the below can achieve a conversion, for CDH5: ~> spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:4.0.0 > import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} > // Manual schema declaration of the 'co' and 'id' column names and types > val customSchema = StructType(Array( StructField("co", StringType, true), StructField("id", IntegerType, true))) > val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt") > df.write.format("com.databricks.spark.avro").save("/tmp/avroout") > // Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir

Harsh J · ‎10-08-2018

The command in CM -> HDFS -> Actions to run Balancer is ad-hoc. There's no schedule it runs by - you'll need to invoke it manually to trigger the HDFS Balancer work. If you'd like to setup a frequency, you can use the CM API to trigger it via crontab/etc.

Harsh J · ‎09-27-2018

I'm not aware of an option to disable use of .archive, but you should certainly not be running a service with different minor versions on its hosts.

Harsh J · ‎09-25-2018

Data balancing uses the relative measure of percentage for heterogenous nodes so that it scales based on actual capacities. Are you looking to balance by byte count instead? That doesn't sound like a good idea for such a wide difference in space. Could you help further explain your goal here?

Harsh J · ‎09-24-2018

> tools.DiskBalancer: java.lang.IllegalArgumentException: Unable to find the specified node. lrcdhdn009 Please always try with your fully qualified domain names for the target DataNode, so ip-10-16-113-100.ec2.internal, lrcdhdn009.company.com, etc.

Harsh J · ‎09-24-2018

There is not a way to specify a raw scan when creating a remote scanner via HBase REST API. The backing server implementation does not carry arguments that toggle raw scans, but it could be added as a new feature. Please file a feature request over https://issues.apache.org/jira/browse/HBASE for this. The existing service-end implementation that builds the scanner on the REST server is at (upstream) https://github.com/apache/hbase/blob/master/hbase-rest/src/main/java/org/apache/hadoop/hbase/rest/ScannerResultGenerator.java#L74-L105

Harsh J · ‎09-14-2018

Glad you were able to resolve this! Since you are using Cloudera Manager, you can also perform these fixes of creating necessary system directories via the UI: YARN - Actions - Create Job History Directory YARN - Actions - Create NodeManager Remote Application Log Directory These would create the directories for you on HDFS with the exact required permission setup if they do not pre-exist.

Harsh J · ‎09-13-2018

Start by looking for FATAL/ERROR logs in each of those roles (ResourceManager first perhaps). You should see a potential cause in the logs right before the crash time. The logs are typically under /var/log/hadoop-yarn/ if you are using CDH. If you have trouble interpreting the logs once you've located them, please share them here (via pastebin or such if they are large).

Harsh J · ‎09-12-2018

The following pattern is the often seen when running DR-like HDFS DistCp jobs on secure clusters: 1. Define a HDFS admin group in your user identity backend (lets call it 'hdfsadmin') 2. Add qualified (strictly administrative users) users to the new 'hdfsadmin' group, and ensure all hosts in the cluster show up the new user group when running an 'id username' command 3. On both clusters, alter dfs.permissions.supergroup via HDFS - Configuration - "Superuser Group" field in CM to use "hdfsadmin", which allows members of this group to act as HDFS superuser (equivalent to 'hdfs' user when it comes to filesystem access activities) 4. Run DistCp as any user who has been allowed membership of 'hdfsadmin' group

Harsh J · ‎09-12-2018

In that case the balancer can be run later, but the compaction may still help with keeping the HBase application request latency low, if that is an immediate concern in the cluster.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Conversion of a file(with pipe(|), comma(,) an...

Re: How frequent kicks HDFS balancer

Re: I want to reduce disk usage.

Re: Error trying to balance disks on node

Re: Error trying to balance disks on node

Re: Hbase rest API, fetch deleted rows (Hbase Raw ...

Re: YARN is completely down.Cluster is in trouble

Re: YARN is completely down.Cluster is in trouble

Re: permission denied while using distcp

Re: HBase compaction while adding region servers