Member since
01-09-2019
401
Posts
163
Kudos Received
80
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2079 | 06-21-2017 03:53 PM | |
3142 | 03-14-2017 01:24 PM | |
1985 | 01-25-2017 03:36 PM | |
3162 | 12-20-2016 06:19 PM | |
1580 | 12-14-2016 05:24 PM |
04-20-2016
07:50 PM
True that only he knows the requirements. And if its plain compare, its easier just running hive queries on both datasets. The solution that I gave does not require distcp. Its a way to compare and merge data between 2 datasets without copying data, but have a hive table that points to data from other HDFS. This would still involve data transfer across the cluster on the mapside but works well for upsert/merge scenarios (which is generally where you will end up once comparison shows that there are differences in data)
... View more
04-20-2016
05:43 PM
Its ideal any comparison happens within hadoop/hive/pig/spark using yarn instead of exporting CSVs and doing a diff on a client node. Export would still work on smaller datasets but its better we do the comparison within cluster rather than by taking data out of 2 clusters and doing a compare at client level. We have done some of these kind of comparisons using Pig. With this approach any corrective measures on your dataset like upsert and merge will be easier than client based csv diff.
... View more
04-20-2016
03:34 PM
A shortcut you can take without copying data across is to copy metadata. This only works if all nodes from Staging can communicate with all nodes in Prod. Have a replica table from Prod into staging but hive 'location' pointing to the full hdfs path of prod. (If this is HA, you need to setup the client configuration to get the resolve HA path). Then you see both hive tables in staging, but one of it is actually pointing to HDFS from Prod. Compute in this case happens in staging but data comes from Prod for the Prod table. There will not be any 'data-local' processing but you can avoid copy of data. You can take it the other way as well where comparison happens on Prod, but comparison happening on staging is better since you are not adding any compute load on Prod with this.
... View more
04-20-2016
03:09 PM
1 Kudo
From your example, you seem to be using Tez. Check this article https://community.hortonworks.com/articles/22419/hive-on-tez-performance-tuning-determining-reducer.html which has more detail on how reducers can be controlled. This is different from how it works in mapreduce. hive.exec.reducers.byte.per.reducer specifies who much goes to each reducer which determines number of reducers.
... View more
04-19-2016
08:51 PM
2 Kudos
Most likely, your install of ambari-agent didn't go through. See if /var/lib/ambari-agent is empty. ambari-python-wrap is in this directory. You can check yum transaction log and see if the install went through or try a reinstall of ambari-agent.
... View more
04-18-2016
04:07 PM
That topic is mostly around Change Data Capture. We use similar techniques in that usecase. My question was not related to that. Most of our cases are full data loads. We are looking to make this process easier since we have hundreds of tables. Sqoop has a good way to create table metadata which we are using. But this ends up as textfiles and we have to create another set of tables manually to write as ORC files.
... View more
04-18-2016
03:04 PM
1 Kudo
Right now, we use a 2 step process to import data from sqoop to ORC tables. Step 1: Use sqoop to import raw text (in text format) into Hive tables. Step 2: Use insert overwrite as select to write this into a hive table that is of type ORC. Now, with this approach, we have to manually create ORC backed tables that Step 2 writes into. This also ends up with raw data in text format that we don't really need. Is there a way to directly write into hive tables as ORC format? Also, is there a way to not manually create ORC backed tables from text file backed tables?
... View more
Labels:
- Labels:
-
Apache Sqoop
03-02-2016
06:32 PM
1 Kudo
Its good it you separate this as a new question. Right now there is no support for time-based queue capacity change. However, we were able to run a cron based job that refreshes queues with manual changes to capacity scheduler. However, if you do this and someone either restarts RMs and/or refreshes queues from ambari, your cron based changes will be overwritten.
... View more
01-21-2016
09:06 PM
If the job got completed, most likely its connected to a different cluster. You can check 'yarn log' on the cluster to see if the job got submitted to this cluster. Also, check from RM UI (RM:8080) to see if you can see the job in RM UI.
... View more
01-21-2016
02:55 PM
DO NOT REFORMAT for missing blocks. If its not a test cluster, you need to identify how you ended up with missing blocks. One possible reason if you changed the data directories and removed some. If you identified the root cause and fine with it, just get the files missing from local and update into hdfs. And you can just delete the files in /user/ambari-qa that you listed.
... View more