About stevel

stevel · ‎02-17-2017

Its not so much that hotswap is difficult, but that with a 3 node cluster, a copy of every block is kept on every node. A cold swap, where HDFS notices things are missing, is the traumatic one, as it cannot re-replicate all the blocks and will be complaining about underreplication. If you can do a hot swap in OS & hardware, then you should stop the DN before doing that, and start it afterwards. It will examine its directories and report all the blocks it has to the namenode. If the cluster has underreplicated blocks, the DN will get told to copy them from the other two datanodes, which will take a time dependent on the number of blocks which were on the swapped disk (and which haven't already been considered missing and re-replicated onto other disks on the same datanode) Maybe @Arpit Agarwal has some other/different advice. Arpit, presumably the new HDD will be unbalanced compared to the rest of the disks on the DN. What can be done about that in HDFS?

stevel · ‎02-16-2017

Unless the hardware supports hotswapping, you are going to have to shut the server down. If you do this quickly enough, HDFS won't overreact by trying to re-replicate data: it will give you 10-15 minutes to get the machine opened up, the new disk inserted and mounted and then the services restarted. It's good to format the disk in advance, to save that bit of the process. Expect a lot of alert messages from the YARN side of things, which will notice in 60s, and from spark which will react even faster. Spark is likely to fail the job. It is probably safest to turn off the YARN service for that node (and HBase if it is running there), so the scheduler doesn't get upset. Spark will get told of the decommissioning event and not treat failures there as a problem. There's a more rigorous approach documented in replacing disks on a data node; it recommends a full HDFS node decommission. on a three node cluster that's likely to be problematic -there won't be anywhere to re-replicate the third block of every triple replicated block.

stevel · ‎02-16-2017

Better to use different file extensions and patterns for each, e.g .csv and .pipe, to make them their own RDD. Spark parallelises based on the number of sources; .csv files aren't splittable, so the max amount of executors you get depends on the file count. tip: use the inferSchema option to scan through a reference CSV file, look at the output and then convert that to a hard coded schema. The inference process involves a scan through the entire file, and is not something you want to repeat on a stable CSV format

stevel · ‎02-13-2017

Buckets are somehow sharded: the more load you put on the same bucket, the less bandwidth you apparently get from each. This sharding is based on the filename: the more diverse your filenames, the better the bandwidth is likely to be. This is something listed in the AWS docs, but they are deliberately vague as to what happens, so that they have the freedom to change sharding policy without complaints about backwards compatibility. You are also going to get different bandwidth numbers depending on the network capacity of your VM, and faster read rate than write rate. When netflix talk about their performance, assume multiple buckets and many, many, readers

stevel · ‎02-13-2017

There's also the documentation here: https://hortonworks.github.io/hdp-aws/s3-spark/

stevel · ‎02-02-2017

Looks to me like there's some conflicting versions of the servlet API on your classpath, and even though the spark jetty version is shaded, the version of the servlet API classes which are being loaded aren't. I can see that you've already excluded the one in hadoop-common, so I'm not sure where else it can be coming from, except maybe hbase. Alternatively, no servlet api is being pulled in, and you need to get one on the CP try doing a mvn dependency:tree -Dverbose > target/dependencies.txt and then examine the dependencies.txt file to see where it's being pulled in.

stevel · ‎01-20-2017

Without Kerberos, you don't have any authentication, hence no real security. Even if you encrypt the data, there's nothing to stop anyone talking to the cluster claiming to be the administrative user —so able to do lots of damage to the system. Same for yarn: everything is executed in the cluster as the same user, so code by user Alice, running on the same host as user Bob, can use OS-level permissions and debuggers to get at all the secret's Bob's code has (including decryption keys) I would recommend embracing Kerberos as the first step to having a secure cluster

stevel · ‎01-20-2017

..and if it was done from the command line, it shouldn't have been deleted, it should have been moved to the .Trash folder of the user

stevel · ‎01-20-2017

There's also HAR files, "Hadoop archive files", which are a halfway house between a tar file and unexpanded files: they all live in a single .har file, but the work is split up into the analytics code as independent XML files. see Har files

stevel · ‎01-20-2017

pretty low level. Looking into the source, it looks like the assertion is that assert(expectedAttrs.length == attrs.length) What does that mean? I'm not entirely sure. Looking through google shows up 1. Stack overflow http://stackoverflow.com/questions/38740862/not-able-to-fetch-result-from-hive-transaction-enabled-table-through-spark-sql 2. SPARK-18355:Spark SQL fails to read data from a ORC hive table that has a new column added to it If #2 is the cause, there's no obvious workaround right now. There's some details on #1 on maybe how to avoid the problem

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: How to hotswap Data node hard disk without sto...

Re: How to hotswap Data node hard disk without sto...

Re: How can I read all files in a directory using ...

Re: What are the cluster-wide bandwidth limitation...

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: I'm running a Spark2 job but get a java.lang.N...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Is there a way to find the person who deleted ...

Re: com.databricks.spark.xml parsing xml takes a v...

Re: Spark Weird Error