About bleonhardi

bleonhardi · ‎01-04-2016

Are you sure that dump behaves the same? If I do ( using your data 😞 a = load '/tmp/test' using PigStorage(',') as (year,month,day); dump a; (2015,,08)(2015,,09)... And if I do b = foreach a generate month;and dump b; ()()() Looks to me pigstorage works perfectly fine with dump. If I use illustrate everything goes wrong though. After using illustrate even the dump command fails with a nullpointer exception. So not only does it not work correctly it breaks the grunt shell until I restart it. I think the problem is the illustrate command: Which is not too surprising since this is the warning on top of it in the pig docs: Illustrate: (Note! This feature is NOT maintained at the moment. We are looking for someone to adopt it.)

bleonhardi · ‎01-04-2016

I assume you are using HDP? In that case PIG_HOME is set when executing the pig command. If you cat /usr/bin/pig you can find the line export PIG_HOME=${PIG_HOME:-/usr/hdp/2.3.2.0-2950/pig}. So you could run this manually.

bleonhardi · ‎01-04-2016

If you have 4 nodes he will not be able to replicate 8 copies.It looks like some tmp files from accumulo. While I do not know accumulo too well some small files like jars normally have a high replication level so they are locally available on most nodes. You can check the filesystem with: hadoop fsck / -files -blocks -locations Normally programs honor a parameter called max replication which in your case should be 4 but it seems like accumulo doesnt always do that. https://issues.apache.org/jira/browse/ACCUMULO-683 Is this causing any problems? Or are you just worried about the errors in the logs.

bleonhardi · ‎01-04-2016

There are different possibilities why a hive action in oozie might fail. Missing jars ( oozie needs to start it with shared libs and they need to be configured correctly ), security ( kerberos configured? ) or just bad SQL. You would normally find more information in the logs. Either in the yarn logs of the oozie launcher task or the hive task or in the oozie logs. Hue lets you click through all of them pretty conveniently.

bleonhardi · ‎12-17-2015

I implemented something similar to that, I wanted to run a data load every hour but load a dimension table from a database every 12 hours. I couldn't use two coordinators since the load would fail if the dimension table is loaded at the same time. So doing it in the same workflow was better. Instead of having a coordinator that starts two workflows I have a parameter in the coordinator that is given to the workflow which contains the hour like this: 2015121503 ( 2015-12-15-03 ) <property> <name>hour</name> <value>${coord:formatTime(coord:nominalTime(), 'yyyyMMddHH')}</value> </property> I then use a decision node in the workflow to only do the sqoop action every 12 hours ( in this case ) and do the load alone in all other cases . The sqoop action obviously continues with the load action. <start to="decision"/> <decision name="decision"> <switch> <case to="sqoop"> ${( hour % 100) % 12 == 0} </case> <default to="load"/> </switch> </decision> </decision>

bleonhardi · ‎12-16-2015

And you should be able to remove it in the hive.exec.pre/post/failure.hooks parameter in Ambari/Hive/Config/AdvancedConfig as a workaround if this is really resulting in the error. Perhaps a bug?

bleonhardi · ‎12-16-2015

I would think this is not enough information to give you an answer. Tez will normally be faster even on big amounts of data but your setup is pretty unusual. Huge amounts of memory ( Hive is normally CPU or IO bound so more nodes are in general a good idea ). And very big tasks which could lead to some Garbage Collection issues ( tez would reuse tasks which would not happen in mapreduce ) . Also you don't change the sort memory for example so that could still be small. The biggest question I have is cluster utilization and CPU utilization when you run both jobs. I.e. is the cluster fully utilized (run top on a node ) when you run the mapred job but not in tez?. Or is he waiting in a specific node? Tez dynamically adjusts the number of reducers running so it is possible that it decided on less tasks. Running set hive.tez.exec.print.summary=true; can help you figure out which part of your query took longest. The second question would be query complexity. Tez allows the reuse of tasks during execution so complex queries work better. It is always possible that there is a query that will run better on MapReduce.

bleonhardi · ‎12-16-2015

Just a question for clarification: Can you do a hdfs dfs -ls /user/oozie? If the test1 folder is not owned by user admin ( he only has rwx but is not the owner ), then he cannot change the ownership either. That is the same in Linux. I suppose this is not the case here but I just wanted to clarify

bleonhardi · ‎12-08-2015

I have a question about monitoring an application that reads files from one datasource ( oozie ) and streaming logs through a streaming application Logs -> Flume -> Kafka -> Storm -> HDFS. I would like to monitor this application and apart from Email actions in the Oozie workflow and Ambari monitoring errors in the setup I was wondering if anybody had done something like that before. 1) Storm Topology/Broker/Flume agent failures Is there any way to add ambari alerts or an ambari view that shows this in one? 2) Data problems For example if data stops flowing from the source Some things I have tried: Push number of received, inserted tuples from Storm into ambari metrics and show it on Ambari. Anybody did something like that? Are custom charts in Ambari supported now? Any simpler solutions.

bleonhardi · ‎11-12-2015

How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Field with empty or no data causing error in p...

Re: is there any way to print or read PIG_HOME thr...

Re: Why some Blocks are not replicated ?

Re: Hive action in failing in Oozie

Re: Can I have an oozie coordinator that runs once...

Re: Error on concatenating ORC Hive table (merge f...

Re: Does Tez run slower than hive on larger datase...

Re: Ranger authorization for HDFS - Unable to chan...

Monitoring Streaming application Flume->Kafka->Sto...

Re: What are good tools / methods to get data into...