Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2666 | 02-27-2018 04:47 PM | |
5133 | 03-03-2017 10:04 PM | |
2710 | 02-16-2017 10:18 AM | |
1425 | 01-20-2017 02:15 PM | |
10627 | 01-20-2017 02:02 PM |
12-24-2015
10:56 AM
Hue has something behind the scenes called Livy, which is a little REST server doing the work...they haven't teased that out and made it standalone which is a shame. There's actually something very interested starting in the apache incubator, IBM's Spark Kernel code (which will be renamed during the incubation process)..this lets you wire up Jupyter directly, but also offers the ability to upload code callbacks into the spark cluster itself. I think that's pretty nice, and will be keeping an eye on it —though I don't know when it will be ready for broad use.
... View more
12-24-2015
10:52 AM
1 Kudo
that doc is a bit confusion: I read it myself and wasn't too sure. I've file a JIRA on reviewing and updating it. Bearing in mind the python agent-side code is not something I know my way around, I think that comment about hostname:port is actually describing how site configurations can be built up. I believe that python installation code running in a container can actually push out any quicklink values it wants. Client apps do have to be aware that (a) that data isn't there until the container is up and running, (b) after failover the outdated entries will hang around until replaced
... View more
12-23-2015
12:35 PM
I thought that on a secure cluster zeppelin can only make queries as the user hosting the web ui...though I'm not sure there. Spark SQL doesn't do user authentication in general, not via the thrift server (JBDC and especially ODBC). Nor does it do column-level access control as Hive does. It's just going straight at the files themselves. So it's not that locked down.
... View more
12-19-2015
05:18 PM
1 Kudo
if its at networking, just download the JAR file yourself, and use the --jars option to add it to the classpath. looks like it lives under https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/
... View more
12-18-2015
08:28 PM
1 Kudo
it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled)
... View more
12-18-2015
08:27 PM
3 Kudos
if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through. See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
... View more
12-17-2015
12:05 PM
I'd put that down to DNS being in a mess or you not having a principal for the form service/host@REALM for the host in question. See: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html
... View more
12-17-2015
11:53 AM
I find my ZK logs end up under /var/log/zookeeper , at least with the HD installations. Make sure that the log directory has the permissions to be written to by the ZK account; if it doesn't you won't see logs
... View more
12-17-2015
11:49 AM
9 Kudos
The hdfs fsck operation doesn't check blocks for corruption; that takes too long. It looks at the directory structures alone. Blocks are checked for corruption whenever they are read; there are little CRC checksum files created for parts of a block which are validated on read() operations. If you work with the file:// filesystem you can see these same files in your local FS. If a block is found to be corrupt on a read, the dfs client will report this to the namenode, and ask for another block, which will be used instead. As Chris said, the namenode then schedules the uncorrupted block for re-replication, as if it was under replicated. The corrupted block doesn't get deleted until that replication succeeds. Why not? If all blocks are corrupt, then maybe you can salvage something from all the corrupt copies of the block. Datanodes scan all files in the background —they just do it fairly slowly by default so that applications don't suffer. The scan ensures that corrupted blocks are usually found before programs read them, and so that problems with "cold" data are found at all. It's designed to avoid the problem of all replicas getting corrupted and you not noticing until its too late to fix. Look in the HDFS XML description for the details on the two options you need to adjust dfs.datanode.scan.period.hours
dfs.block.scanner.volume.bytes.per.second How disks fail/data gets corrupt is a fascinating problem. Here are some links if you really want to learn more about it Did you really want that data (an old presentation of mine) Failure Trends in a Large Disk Drive Population (google) A Large-Scale Study of Flash Memory Failures in the Field (a recent facebook paper on Flash failures -shows they are less common than you'd fear) I'd also recommend you look at some of the work on memory corruption -that's enough to make you think that modern laptops and desktops should be using ECC RAM.
... View more
12-14-2015
07:34 PM
3 Kudos
the .hwx version is one which has a security fix in; no other bug; it's not published to the maven central repo so not easy to pick up. We do have a repo which has it, but I think it's some internal one whose URLs don't resolve. you can build with the normal one just by going -Djetty.version=6.1.26
... View more