Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1543 | 07-09-2019 12:53 AM | |
9305 | 06-23-2019 08:37 PM | |
8055 | 06-18-2019 11:28 PM | |
8681 | 05-23-2019 08:46 PM | |
3477 | 05-20-2019 01:14 AM |
10-22-2018
12:15 AM
Should be doable in Spark using the CSV and Avro reader/writer. Your header is quite odd with quoting characters surrounding its column names, so it cannot be understood directly ('"' is an illegal character for an avro field name). We could have the Spark CSV reader ignore this line as a comment since no other line should start with a '"' character. Your data is expressed as quoted values with the quoted character being '|'. Something like the below can achieve a conversion, for CDH5: ~> spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:4.0.0
> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
> // Manual schema declaration of the 'co' and 'id' column names and types
> val customSchema = StructType(Array(
StructField("co", StringType, true),
StructField("id", IntegerType, true)))
> val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
> df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
> // Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
... View more
10-08-2018
07:19 PM
1 Kudo
The command in CM -> HDFS -> Actions to run Balancer is ad-hoc. There's no schedule it runs by - you'll need to invoke it manually to trigger the HDFS Balancer work. If you'd like to setup a frequency, you can use the CM API to trigger it via crontab/etc.
... View more
09-27-2018
12:39 AM
1 Kudo
I'm not aware of an option to disable use of .archive, but you should certainly not be running a service with different minor versions on its hosts.
... View more
09-25-2018
03:58 PM
Data balancing uses the relative measure of percentage for heterogenous nodes so that it scales based on actual capacities. Are you looking to balance by byte count instead? That doesn't sound like a good idea for such a wide difference in space. Could you help further explain your goal here?
... View more
09-24-2018
07:48 PM
1 Kudo
> tools.DiskBalancer: java.lang.IllegalArgumentException: Unable to find the specified node. lrcdhdn009 Please always try with your fully qualified domain names for the target DataNode, so ip-10-16-113-100.ec2.internal, lrcdhdn009.company.com, etc.
... View more
09-24-2018
06:06 AM
1 Kudo
There is not a way to specify a raw scan when creating a remote scanner via HBase REST API. The backing server implementation does not carry arguments that toggle raw scans, but it could be added as a new feature. Please file a feature request over https://issues.apache.org/jira/browse/HBASE for this. The existing service-end implementation that builds the scanner on the REST server is at (upstream) https://github.com/apache/hbase/blob/master/hbase-rest/src/main/java/org/apache/hadoop/hbase/rest/ScannerResultGenerator.java#L74-L105
... View more
09-14-2018
03:33 AM
2 Kudos
Glad you were able to resolve this! Since you are using Cloudera Manager, you can also perform these fixes of creating necessary system directories via the UI: YARN - Actions - Create Job History Directory YARN - Actions - Create NodeManager Remote Application Log Directory These would create the directories for you on HDFS with the exact required permission setup if they do not pre-exist.
... View more
09-13-2018
05:41 PM
1 Kudo
Start by looking for FATAL/ERROR logs in each of those roles (ResourceManager first perhaps). You should see a potential cause in the logs right before the crash time. The logs are typically under /var/log/hadoop-yarn/ if you are using CDH. If you have trouble interpreting the logs once you've located them, please share them here (via pastebin or such if they are large).
... View more
09-12-2018
03:46 PM
The following pattern is the often seen when running DR-like HDFS DistCp jobs on secure clusters: 1. Define a HDFS admin group in your user identity backend (lets call it 'hdfsadmin') 2. Add qualified (strictly administrative users) users to the new 'hdfsadmin' group, and ensure all hosts in the cluster show up the new user group when running an 'id username' command 3. On both clusters, alter dfs.permissions.supergroup via HDFS - Configuration - "Superuser Group" field in CM to use "hdfsadmin", which allows members of this group to act as HDFS superuser (equivalent to 'hdfs' user when it comes to filesystem access activities) 4. Run DistCp as any user who has been allowed membership of 'hdfsadmin' group
... View more
09-12-2018
04:58 AM
2 Kudos
In that case the balancer can be run later, but the compaction may still help with keeping the HBase application request latency low, if that is an immediate concern in the cluster.
... View more