Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2662 | 02-27-2018 04:47 PM | |
5133 | 03-03-2017 10:04 PM | |
2704 | 02-16-2017 10:18 AM | |
1424 | 01-20-2017 02:15 PM | |
10626 | 01-20-2017 02:02 PM |
09-30-2024
07:40 AM
非安全集群被阻止rpc通信,使用webhdfs协议,hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true webhdfs://nn1:50070/foo/bar hdfs://nn2:8020/bar/foo
... View more
06-23-2021
10:09 AM
@Arjun_bedi I'm afraid you've just hit a problem which we've only just started encountering: HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain." This failure surfaces when _all_ the following conditions are met: Deployment outside EC2. Configuration option `fs.s3a.endpoint` is unset. Without the file `~/.aws/config` existing or without a region set in it. Without the JVM system property `aws.region` declaring a region. Without the environment variable `AWS_REGION` declaring a region. You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml <property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property> in your scala code: sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created. We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.
... View more
11-03-2020
07:30 AM
Thanks, What I am experiencing is that the complete file, if 300GB, has to be assembled before upload to S3. This requires either 300GB of memory or disk. Distcp does not create a part file per block. I have not witnessed any file split being done. Multi part uploads require you get an upload ID and upload many part files with a numeric extension and in the end ask S3 to put them back together. I do not see any of this being done. I admit I do not know much about all this and it could be happening out of my sight.
... View more
03-11-2020
09:08 AM
I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments. Summary: * try to just stick to the central endpoint * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option
... View more
11-09-2018
06:23 AM
@Venu Shanmukappa how did u add the proxy.. can u pls explain
... View more
03-24-2017
11:19 AM
Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.
... View more
01-09-2017
09:45 AM
Like I said, if you are running on EC2, you should be able to play with Netflix's Chaos Monkey direct. I haven't used it for a while; https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide covers starting it....I think it's got more complex than in the early days, when it was more of a CLI thing
... View more
03-14-2016
03:30 PM
There's no WS-* code, hence no need for the WS-* stuff. OAuth? Maybe some time in the future. Note also: SASL, SPNEGO
... View more
03-07-2017
10:46 PM
Using KEYRING was a state of art at the moment Kerberos was bundled for RHEL7. However moving forward into the world of containers using KEYRING becomes a challenge thus SSSD is building internal ticket cache that will be supported by the system Kerberos libraries. So in general the recommendation nowadays is to use native OS Kerberos libraries, they are most recent and will provide latest functionality and experience.
... View more
01-14-2016
05:57 PM
If you were logged in via kerberos when you submit work, they usually pick up your credentials, then request hadoop tokens off the various services. Try using "kdestroy" to remove your kerberos tickets and repeating your operations, to see what happens then
... View more