About stevel

irunker · ‎09-30-2024

非安全集群被阻止rpc通信，使用webhdfs协议，hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true webhdfs://nn1:50070/foo/bar hdfs://nn2:8020/bar/foo

stevel · ‎06-23-2021

@Arjun_bedi I'm afraid you've just hit a problem which we've only just started encountering: HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain." This failure surfaces when _all_ the following conditions are met: Deployment outside EC2. Configuration option `fs.s3a.endpoint` is unset. Without the file `~/.aws/config` existing or without a region set in it. Without the JVM system property `aws.region` declaring a region. Without the environment variable `AWS_REGION` declaring a region. You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml <property> <name>fs.s3a.endpoint</name> <value>s3.amazonaws.com</value> </property> in your scala code: sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created. We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.

regeamor · ‎11-03-2020

Thanks, What I am experiencing is that the complete file, if 300GB, has to be assembled before upload to S3. This requires either 300GB of memory or disk. Distcp does not create a part file per block. I have not witnessed any file split being done. Multi part uploads require you get an upload ID and upload many part files with a numeric extension and in the end ask S3 to put them back together. I do not see any of this being done. I admit I do not know much about all this and it could be happening out of my sight.

stevel · ‎03-11-2020

I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments. Summary: * try to just stick to the central endpoint * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option

komathigv · ‎11-09-2018

@Venu Shanmukappa how did u add the proxy.. can u pls explain

stevel · ‎03-24-2017

Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.

stevel · ‎01-09-2017

Like I said, if you are running on EC2, you should be able to play with Netflix's Chaos Monkey direct. I haven't used it for a while; https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide covers starting it....I think it's got more complex than in the early days, when it was more of a CLI thing

stevel · ‎03-14-2016

There's no WS-* code, hence no need for the WS-* stuff. OAuth? Maybe some time in the future. Note also: SASL, SPNEGO

dpal · ‎03-07-2017

Using KEYRING was a state of art at the moment Kerberos was bundled for RHEL7. However moving forward into the world of containers using KEYRING becomes a challenge thus SSSD is building internal ticket cache that will be supported by the system Kerberos libraries. So in general the recommendation nowadays is to use native OS Kerberos libraries, they are most recent and will provide latest functionality and experience.

stevel · ‎01-14-2016

If you were logged in via kerberos when you submit work, they usually pick up your credentials, then request hadoop tokens off the various services. Try using "kdestroy" to remove your kerberos tickets and repeating your operations, to see what happens then

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: Running distcp between two cluster: One Kerber...

Re: AWS_REGION environment variable declaration

Re: Each map task of a distcp job to s3 is running...

Re: s3a issue when endpoint set to fs.s3a.endpoint...

Re: How to copy HDFS file to AWS S3 Bucket? hadoo...

Re: "Error: Java heap space" caused using distcp t...

Re: How to use Chaos Monkey in Ambari cluster setu...

Re: Does Hortonworks support the following Securit...

Re: Kerberos Cache in IPA /RedHat IDM (KEYRING) SO...

Re: Kerberos authentication from Edge node