The root repo URL's are not browsable like a file system directory, thus the 404/Not Found responses. (Looks like the public repo's are hosted on AWS CloudFront as objects.) Yum constructs paths to metadata and rpm's using the root URL's.
To test if your sandbox can access the repos try a curl from the sandbox.
Either of these URLs can be used to test connectivity:
http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/188.8.131.52/repodata/repomd.xml Another test is to add the hdp.repo config to /etc/yum.repos.d manually and trying to search for packages via yum search.
... View more
In regards to sampling, I would perform a random sample on the Hadoop side and upload that sample to Teradata - 10K's of records as opposed to millions or billions. Then perform the full outer join on the Teradata side - small data set joined against large dataset. If the primary key of the data set == the Teradata primary index, you should get reasonable join performance. Depending on capacity, you could also go the other way with sampling from Teradata and comparing to the full Hadoop dataset. There are a number of online tools you can use to generate sample sizes for different confidence levels. The processing capacity required by either system to generate a sample set will depend on how truly random you need the sample to be or if a sampling 'short-cut' technique will work. For example, on one of our systems we have a multi-billion row table that's very well distributed. Selecting the top-n rows from each block or amp gives a useful sample, though not technically a truly statistically random sample. Assuming you're not concerned with validating field values, but only record counts, e.g. you trust the field transforms, once you have the joined sample, you then have the following pieces of information (based on key matching): count of records found in the old data sets, but not the new data sets (null) count of records not found in the old data sets (null), but in the new data sets count of records found in both old and new data sets Depending on how many different datasets or schemas with which you're dealing, the key matching technique is something that can be generated more easily pragmatically. You just need to specify key fields, as opposed to having to enumerate all fields. Assuming you included the fields in the subset, not just the keys, you could then use this matched subset to also validate field transformations. Any of the measures you've listed above could be used to get confidence between the two data sets?
... View more
Depends on how rigorous you need the result to be and whether you're needing to validate the migration of the data or the migration of the code (stored procs) = the data processed after the migration. With a 95% tolerance target, you've got room to move. Our best case migration involves a parallel run on both old and new systems. Then we compare record counts by some grouping dimension on both systems for a general level of confidence. If the record counts match, that gives us a good measure of confidence. You can also sample record subsets and compare against the other system. A full outer join will provide a count of mismatches on both the new and old system. For larger data volumes, assuming well distributed data, 10K - 20K record samples will give reasonable coverage. Again, depending on the relative capacity of the Teradata and Hadoop systems, you'll probably get the fastest result pulling the samples from the Hadoop system into the Teradata system and performing the outer joins and aggregations there.
... View more
We're going through this process now, migrating a non-trivial amount
of data from an older cluster onto a new cluster and environment. We
have a couple of requirements and constraints that limited some of the
The datanodes on the 2 clusters don't have
network connectivity. Each cluster resides in it's own private
firewalled network. (As an added complication, we also use the same
hostnames in each of the two private environments.) distcp scales
requires the datanodes in the 2 clusters to be able communicate
directly. We have different security models in the two
models. The old cluster uses simple authentication. The new cluster
uses kerberos for authentication. I've found that getting some of the
tools to work with 2 different authentication models can be difficult. I
want to preserve the file metadata from the old cluster on the new
cluster - e.g. file create time, ownership, file system permissions.
Some of the options can move the data from the source cluster, but they
write 'new' files on the target cluster. The old cluster has been
running running for around 2 years so there's alot of useful information
in those file timestamps. I need to perform a near-live
migration. I have the keep the old cluster running in parallel while
migrating data and users to the new cluster. Can't just cut access to
the old cluster After trying a number of tools and combinations, inculding WebHDFS and Knox combinations. we've settled on the following:
old cluster via NFS gateways. We lock the NFS access controls to only
allow the edge servers on the new cluster to mount the HDFS NFS volume.
The edge servers in our target cluster are airflow workers running as a
grid. We've created a source NFS gateway for each target edge server
airflow worker enabling a degree of scale-out. Not as good as distcp
scale-out but better than a single point pipe.
old fashioned hdfs dfs -copyFromLocal -p <old_cluster_nfs_dir>
<new_cluster_hdfs_dir>. This enables us to preserve the file
timestamps as well as ownerships. As part of managing
the migration process, we're also making use of HDFS snapshots on both
source and target to enable consistency management. Our migration jobs
take snapshots at the beginning and end of each migration job and issue
delta or difference reports to identify if data was modified and
possibly missed during the migration process. I'm expecting that some
of our larger data sets will take hours to complete, for the largest
few, possible > 24hrs. In order to perform the snapshot management
we also added some additional wrapper code. WebHDFS can be used to
create and list snapshots, but it doesn't yet have an operation for
returning a snapshot difference report. For the hive metadata,
the majority of our hive DDL exists in git/source code control. We're
actually using this migration as an opportunity to enforce this for our
production objects. For end user objects, e.g. analysts data labs,
we're exporting the DDL on the old cluster and re-playing DDL on the new
cluster - with tweeks for any reserved words collisions. We don't have HBase operating on our old cluster so I didn't have to come up with a solution for that problem.
... View more
If you're looking to improve access to back-end service UI's for the ops team, as opposed to exposing the services to the larger user base, we make use of ssh tunneling via our admin jump hosts to effectively create personal SOCKS proxies for each ops/admin user. We then use one of the dynamic proxy config plugins in Chrome or Firefox to direct requests to those services based on hostname, or in our case the domain of the hadoop environment. This has the advantage of being very transparent and service URL's all tend to resolve correctly , including https based services. The disadvantage is that the person using this approach needs to know how to setup an ssh tunnel and how to configure their browser to use that tunnel for the Hadoop services.
... View more