About stevel

stevel · ‎12-24-2015

Hue has something behind the scenes called Livy, which is a little REST server doing the work...they haven't teased that out and made it standalone which is a shame. There's actually something very interested starting in the apache incubator, IBM's Spark Kernel code (which will be renamed during the incubation process)..this lets you wire up Jupyter directly, but also offers the ability to upload code callbacks into the spark cluster itself. I think that's pretty nice, and will be keeping an eye on it —though I don't know when it will be ready for broad use.

stevel · ‎12-24-2015

that doc is a bit confusion: I read it myself and wasn't too sure. I've file a JIRA on reviewing and updating it. Bearing in mind the python agent-side code is not something I know my way around, I think that comment about hostname:port is actually describing how site configurations can be built up. I believe that python installation code running in a container can actually push out any quicklink values it wants. Client apps do have to be aware that (a) that data isn't there until the container is up and running, (b) after failover the outdated entries will hang around until replaced

stevel · ‎12-23-2015

I thought that on a secure cluster zeppelin can only make queries as the user hosting the web ui...though I'm not sure there. Spark SQL doesn't do user authentication in general, not via the thrift server (JBDC and especially ODBC). Nor does it do column-level access control as Hive does. It's just going straight at the files themselves. So it's not that locked down.

stevel · ‎12-19-2015

if its at networking, just download the JAR file yourself, and use the --jars option to add it to the classpath. looks like it lives under https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/

stevel · ‎12-18-2015

it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled)

stevel · ‎12-18-2015

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through. See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

stevel · ‎12-17-2015

I'd put that down to DNS being in a mess or you not having a principal for the form service/host@REALM for the host in question. See: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html

stevel · ‎12-17-2015

I find my ZK logs end up under /var/log/zookeeper , at least with the HD installations. Make sure that the log directory has the permissions to be written to by the ZK account; if it doesn't you won't see logs

stevel · ‎12-17-2015

The hdfs fsck operation doesn't check blocks for corruption; that takes too long. It looks at the directory structures alone. Blocks are checked for corruption whenever they are read; there are little CRC checksum files created for parts of a block which are validated on read() operations. If you work with the file:// filesystem you can see these same files in your local FS. If a block is found to be corrupt on a read, the dfs client will report this to the namenode, and ask for another block, which will be used instead. As Chris said, the namenode then schedules the uncorrupted block for re-replication, as if it was under replicated. The corrupted block doesn't get deleted until that replication succeeds. Why not? If all blocks are corrupt, then maybe you can salvage something from all the corrupt copies of the block. Datanodes scan all files in the background —they just do it fairly slowly by default so that applications don't suffer. The scan ensures that corrupted blocks are usually found before programs read them, and so that problems with "cold" data are found at all. It's designed to avoid the problem of all replicas getting corrupted and you not noticing until its too late to fix. Look in the HDFS XML description for the details on the two options you need to adjust dfs.datanode.scan.period.hours dfs.block.scanner.volume.bytes.per.second How disks fail/data gets corrupt is a fascinating problem. Here are some links if you really want to learn more about it Did you really want that data (an old presentation of mine) Failure Trends in a Large Disk Drive Population (google) A Large-Scale Study of Flash Memory Failures in the Field (a recent facebook paper on Flash failures -shows they are less common than you'd fear) I'd also recommend you look at some of the work on memory corruption -that's enough to make you think that modern laptops and desktops should be using ECC RAM.

stevel · ‎12-14-2015

the .hwx version is one which has a security fix in; no other bug; it's not published to the maven central repo so not easy to pick up. We do have a repo which has it, but I think it's some internal one whose URLs don't resolve. you can build with the normal one just by going -Djetty.version=6.1.26

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: Web interface for querying Spark SQL ?

Re: Why only host:port export allowed at component...

Re: Web interface for querying Spark SQL ?

Re: Spark-csv support in HDP 2.3.2

Re: How to copy HDFS file to AWS S3 Bucket? hadoo...

Re: How to copy HDFS file to AWS S3 Bucket? hadoo...

Re: service ticket not found in the subject, what ...

Re: can not start zookeeper

Re: In HDFS, why corrupted block(s) happens?

Re: jetty-util-6.1.26.hwx.jar