About ravi1

ravi1 · ‎04-20-2016

True that only he knows the requirements. And if its plain compare, its easier just running hive queries on both datasets. The solution that I gave does not require distcp. Its a way to compare and merge data between 2 datasets without copying data, but have a hive table that points to data from other HDFS. This would still involve data transfer across the cluster on the mapside but works well for upsert/merge scenarios (which is generally where you will end up once comparison shows that there are differences in data)

ravi1 · ‎04-20-2016

Its ideal any comparison happens within hadoop/hive/pig/spark using yarn instead of exporting CSVs and doing a diff on a client node. Export would still work on smaller datasets but its better we do the comparison within cluster rather than by taking data out of 2 clusters and doing a compare at client level. We have done some of these kind of comparisons using Pig. With this approach any corrective measures on your dataset like upsert and merge will be easier than client based csv diff.

ravi1 · ‎04-20-2016

A shortcut you can take without copying data across is to copy metadata. This only works if all nodes from Staging can communicate with all nodes in Prod. Have a replica table from Prod into staging but hive 'location' pointing to the full hdfs path of prod. (If this is HA, you need to setup the client configuration to get the resolve HA path). Then you see both hive tables in staging, but one of it is actually pointing to HDFS from Prod. Compute in this case happens in staging but data comes from Prod for the Prod table. There will not be any 'data-local' processing but you can avoid copy of data. You can take it the other way as well where comparison happens on Prod, but comparison happening on staging is better since you are not adding any compute load on Prod with this.

ravi1 · ‎04-20-2016

From your example, you seem to be using Tez. Check this article https://community.hortonworks.com/articles/22419/hive-on-tez-performance-tuning-determining-reducer.html which has more detail on how reducers can be controlled. This is different from how it works in mapreduce. hive.exec.reducers.byte.per.reducer specifies who much goes to each reducer which determines number of reducers.

ravi1 · ‎04-19-2016

Most likely, your install of ambari-agent didn't go through. See if /var/lib/ambari-agent is empty. ambari-python-wrap is in this directory. You can check yum transaction log and see if the install went through or try a reinstall of ambari-agent.

ravi1 · ‎04-18-2016

That topic is mostly around Change Data Capture. We use similar techniques in that usecase. My question was not related to that. Most of our cases are full data loads. We are looking to make this process easier since we have hundreds of tables. Sqoop has a good way to create table metadata which we are using. But this ends up as textfiles and we have to create another set of tables manually to write as ORC files.

ravi1 · ‎04-18-2016

Right now, we use a 2 step process to import data from sqoop to ORC tables. Step 1: Use sqoop to import raw text (in text format) into Hive tables. Step 2: Use insert overwrite as select to write this into a hive table that is of type ORC. Now, with this approach, we have to manually create ORC backed tables that Step 2 writes into. This also ends up with raw data in text format that we don't really need. Is there a way to directly write into hive tables as ORC format? Also, is there a way to not manually create ORC backed tables from text file backed tables?

ravi1 · ‎03-02-2016

Its good it you separate this as a new question. Right now there is no support for time-based queue capacity change. However, we were able to run a cron based job that refreshes queues with manual changes to capacity scheduler. However, if you do this and someone either restarts RMs and/or refreshes queues from ambari, your cron based changes will be overwritten.

ravi1 · ‎01-21-2016

If the job got completed, most likely its connected to a different cluster. You can check 'yarn log' on the cluster to see if the job got submitted to this cluster. Also, check from RM UI (RM:8080) to see if you can see the job in RM UI.

ravi1 · ‎01-21-2016

DO NOT REFORMAT for missing blocks. If its not a test cluster, you need to identify how you ended up with missing blocks. One possible reason if you changed the data directories and removed some. If you identified the root cause and fine with it, just get the files missing from local and update into hdfs. And you can just delete the files in /user/ambari-qa that you listed.

Online	Offline
Last Visited	‎12-18-2021 05:54 PM

Member Since	‎01-09-2019 05:01 PM
Last Visited	‎12-18-2021 05:54 PM
Posts	401
Kudos received	163

Cloudera Community

Re: 2 hosts not running master services

Re: ambari restart and service restart updating kr...

Re: How to automate sqoop incremental import using...

Re: Path to core-site.xml in sandbox?

Re: Curious to know why majority of the people are...

Re: How to compare two hive tables that are in dif...

Re: How to compare two hive tables that are in dif...

Re: How to compare two hive tables that are in dif...

Re: How do you force the number of reducers in a m...

Re: Problem during Ambari Confirm Hosts (Error whi...

Re: Can sqoop be used to directly import data into...

Can sqoop be used to directly import data into an ...

Re: can we configure hive jobs not to run where sp...

Re: job history not available

Re: how to format HDFS in an already installed clu...