Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2626 | 02-27-2018 04:47 PM | |
5098 | 03-03-2017 10:04 PM | |
2655 | 02-16-2017 10:18 AM | |
1400 | 01-20-2017 02:15 PM | |
10563 | 01-20-2017 02:02 PM |
02-17-2017
01:45 PM
Its not so much that hotswap is difficult, but that with a 3 node cluster, a copy of every block is kept on every node. A cold swap, where HDFS notices things are missing, is the traumatic one, as it cannot re-replicate all the blocks and will be complaining about underreplication. If you can do a hot swap in OS & hardware, then you should stop the DN before doing that, and start it afterwards. It will examine its directories and report all the blocks it has to the namenode. If the cluster has underreplicated blocks, the DN will get told to copy them from the other two datanodes, which will take a time dependent on the number of blocks which were on the swapped disk (and which haven't already been considered missing and re-replicated onto other disks on the same datanode) Maybe @Arpit Agarwal has some other/different advice. Arpit, presumably the new HDD will be unbalanced compared to the rest of the disks on the DN. What can be done about that in HDFS?
... View more
02-16-2017
10:18 AM
Unless the hardware supports hotswapping, you are going to have to shut the server down. If you do this quickly enough, HDFS won't overreact by trying to re-replicate data: it will give you 10-15 minutes to get the machine opened up, the new disk inserted and mounted and then the services restarted. It's good to format the disk in advance, to save that bit of the process. Expect a lot of alert messages from the YARN side of things, which will notice in 60s, and from spark which will react even faster. Spark is likely to fail the job. It is probably safest to turn off the YARN service for that node (and HBase if it is running there), so the scheduler doesn't get upset. Spark will get told of the decommissioning event and not treat failures there as a problem. There's a more rigorous approach documented in replacing disks on a data node; it recommends a full HDFS node decommission. on a three node cluster that's likely to be problematic -there won't be anywhere to re-replicate the third block of every triple replicated block.
... View more
02-16-2017
10:07 AM
Better to use different file extensions and patterns for each, e.g .csv and .pipe, to make them their own RDD. Spark parallelises based on the number of sources; .csv files aren't splittable, so the max amount of executors you get depends on the file count. tip: use the inferSchema option to scan through a reference CSV file, look at the output and then convert that to a hard coded schema. The inference process involves a scan through the entire file, and is not something you want to repeat on a stable CSV format
... View more
02-13-2017
07:27 PM
1 Kudo
Buckets are somehow sharded: the more load you put on the same bucket, the less bandwidth you apparently get from each. This sharding is based on the filename: the more diverse your filenames, the better the bandwidth is likely to be. This is something listed in the AWS docs, but they are deliberately vague as to what happens, so that they have the freedom to change sharding policy without complaints about backwards compatibility. You are also going to get different bandwidth numbers depending on the network capacity of your VM, and faster read rate than write rate. When netflix talk about their performance, assume multiple buckets and many, many, readers
... View more
02-13-2017
07:21 PM
1 Kudo
There's also the documentation here: https://hortonworks.github.io/hdp-aws/s3-spark/
... View more
02-02-2017
03:50 PM
Looks to me like there's some conflicting versions of the servlet API on your classpath, and even though the spark jetty version is shaded, the version of the servlet API classes which are being loaded aren't. I can see that you've already excluded the one in hadoop-common, so I'm not sure where else it can be coming from, except maybe hbase. Alternatively, no servlet api is being pulled in, and you need to get one on the CP try doing a mvn dependency:tree -Dverbose > target/dependencies.txt and then examine the dependencies.txt file to see where it's being pulled in.
... View more
01-20-2017
02:15 PM
Without Kerberos, you don't have any authentication, hence no real security. Even if you encrypt the data, there's nothing to stop anyone talking to the cluster claiming to be the administrative user —so able to do lots of damage to the system. Same for yarn: everything is executed in the cluster as the same user, so code by user Alice, running on the same host as user Bob, can use OS-level permissions and debuggers to get at all the secret's Bob's code has (including decryption keys) I would recommend embracing Kerberos as the first step to having a secure cluster
... View more
01-20-2017
02:12 PM
..and if it was done from the command line, it shouldn't have been deleted, it should have been moved to the .Trash folder of the user
... View more
01-20-2017
02:08 PM
1 Kudo
There's also HAR files, "Hadoop archive files", which are a halfway house between a tar file and unexpanded files: they all live in a single .har file, but the work is split up into the analytics code as independent XML files. see Har files
... View more
01-20-2017
02:02 PM
1 Kudo
pretty low level. Looking into the source, it looks like the assertion is that
assert(expectedAttrs.length == attrs.length)
What does that mean? I'm not entirely sure. Looking through google shows up
1. Stack overflow http://stackoverflow.com/questions/38740862/not-able-to-fetch-result-from-hive-transaction-enabled-table-through-spark-sql 2. SPARK-18355:Spark SQL fails to read data from a ORC hive table that has a new column added to it If #2 is the cause, there's no obvious workaround right now. There's some details on #1 on maybe how to avoid the problem
... View more