Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5344 | 08-12-2016 01:02 PM | |
2186 | 08-08-2016 10:00 AM | |
2565 | 08-03-2016 04:44 PM | |
5446 | 08-03-2016 02:53 PM | |
1396 | 08-01-2016 02:38 PM |
01-04-2016
12:23 PM
Are you sure that dump behaves the same? If I do ( using your data 😞 a = load '/tmp/test' using PigStorage(',') as (year,month,day);
dump a;
(2015,,08)(2015,,09)... And if I do b = foreach a generate month;and dump b;
()()() Looks to me pigstorage works perfectly fine with dump. If I use illustrate everything goes wrong though. After using illustrate even the dump command fails with a nullpointer exception. So not only does it not work correctly it breaks the grunt shell until I restart it. I think the problem is the illustrate command: Which is not too surprising since this is the warning on top of it in the pig docs: Illustrate: (Note! This feature is NOT maintained at the moment. We are looking for someone to adopt it.)
... View more
01-04-2016
12:12 PM
I assume you are using HDP? In that case PIG_HOME is set when executing the pig command. If you cat /usr/bin/pig you can find the line export PIG_HOME=${PIG_HOME:-/usr/hdp/2.3.2.0-2950/pig}. So you could run this manually.
... View more
01-04-2016
11:40 AM
2 Kudos
If you have 4 nodes he will not be able to replicate 8 copies.It looks like some tmp files from accumulo. While I do not know accumulo too well some small files like jars normally have a high replication level so they are locally available on most nodes. You can check the filesystem with: hadoop fsck / -files -blocks -locations Normally programs honor a parameter called max replication which in your case should be 4 but it seems like accumulo doesnt always do that. https://issues.apache.org/jira/browse/ACCUMULO-683 Is this causing any problems? Or are you just worried about the errors in the logs.
... View more
01-04-2016
11:31 AM
1 Kudo
There are different possibilities why a hive action in oozie might fail. Missing jars ( oozie needs to start it with shared libs and they need to be configured correctly ), security ( kerberos configured? ) or just bad SQL. You would normally find more information in the logs. Either in the yarn logs of the oozie launcher task or the hive task or in the oozie logs. Hue lets you click through all of them pretty conveniently.
... View more
12-17-2015
09:27 AM
I implemented something similar to that, I wanted to run a data load every hour but load a dimension table from a database every 12 hours. I couldn't use two coordinators since the load would fail if the dimension table is loaded at the same time. So doing it in the same workflow was better. Instead of having a coordinator that starts two workflows I have a parameter in the coordinator that is given to the workflow which contains the hour like this: 2015121503 ( 2015-12-15-03 ) <property>
<name>hour</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyyMMddHH')}</value>
</property>
I then use a decision node in the workflow to only do the sqoop action every 12 hours ( in this case ) and do the load alone in all other cases . The sqoop action obviously continues with the load action. <start to="decision"/>
<decision name="decision">
<switch>
<case to="sqoop">
${( hour % 100) % 12 == 0}
</case>
<default to="load"/>
</switch>
</decision>
</decision>
... View more
12-16-2015
11:12 AM
3 Kudos
And you should be able to remove it in the hive.exec.pre/post/failure.hooks parameter in Ambari/Hive/Config/AdvancedConfig as a workaround if this is really resulting in the error. Perhaps a bug?
... View more
12-16-2015
10:09 AM
3 Kudos
I would think this is not enough information to give you an answer. Tez will normally be faster even on big amounts of data but your setup is pretty unusual. Huge amounts of memory ( Hive is normally CPU or IO bound so more nodes are in general a good idea ). And very big tasks which could lead to some Garbage Collection issues ( tez would reuse tasks which would not happen in mapreduce ) . Also you don't change the sort memory for example so that could still be small. The biggest question I have is cluster utilization and CPU utilization when you run both jobs. I.e. is the cluster fully utilized (run top on a node ) when you run the mapred job but not in tez?. Or is he waiting in a specific node? Tez dynamically adjusts the number of reducers running so it is possible that it decided on less tasks. Running set hive.tez.exec.print.summary=true; can help you figure out which part of your query took longest. The second question would be query complexity. Tez allows the reuse of tasks during execution so complex queries work better. It is always possible that there is a query that will run better on MapReduce.
... View more
12-16-2015
09:38 AM
3 Kudos
Just a question for clarification: Can you do a hdfs dfs -ls /user/oozie? If the test1 folder is not owned by user admin ( he only has rwx but is not the owner ), then he cannot change the ownership either. That is the same in Linux. I suppose this is not the case here but I just wanted to clarify
... View more
12-08-2015
10:14 AM
3 Kudos
I have a question about monitoring an application that reads files from one datasource ( oozie ) and streaming logs through a streaming application Logs -> Flume -> Kafka -> Storm -> HDFS. I would like to monitor this application and apart from Email actions in the Oozie workflow and Ambari monitoring errors in the setup I was wondering if anybody had done something like that before. 1) Storm Topology/Broker/Flume agent failures Is there any way to add ambari alerts or an ambari view that shows this in one? 2) Data problems For example if data stops flowing from the source Some things I have tried: Push number of received, inserted tuples from Storm into ambari metrics and show it on Ambari. Anybody did something like that? Are custom charts in Ambari supported now? Any simpler solutions.
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
-
Apache Storm
11-12-2015
01:04 PM
How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.
... View more