About bleonhardi

bleonhardi · ‎02-22-2016

You have a Shell action which fails. For whatever reason. You need to find the logs of the application. There are essentially 3 logs in oozie The oozie log ( of the server ) The launcher log, which would most likely be interesting for you. You can find it in Hue, the oozie ui ( under the action logs ) or in yarn. This should have some information on what went wrong. ( Look for an oozie launcher in resource manager ) The Application log, not applicable for a shell action but for example for a pig action or other map reduce actions. Also Hue or yarn.

bleonhardi · ‎02-22-2016

Ok just tried it out. Your problem is that you didn't use the double quotes on the outside. The linux command line needs double quotes to escape strings. You can use single quotes inside. @Pooja Chawda [root@sandbox ~]# more test.sql select * from sample_07 where ${cond}; [root@sandbox ~]# beeline -u jdbc:hive2://sandbox:10000/default -f test.sql --hivevar cond="description='All Occupations'" ... +-----------------+------------------------+----------------------+-------------------+--+ | sample_07.code | sample_07.description | sample_07.total_emp | sample_07.salary | +-----------------+------------------------+----------------------+-------------------+--+ | 00-0000 | All Occupations | 134354250 | 40690 | +-----------------+------------------------+----------------------+-------------------+--+

bleonhardi · ‎02-21-2016

That is cool, I didn't find it in the original JIRA patch weird.

bleonhardi · ‎02-20-2016

Yeah I agree sometimes the marketing is a bit ahead of things 🙂

bleonhardi · ‎02-20-2016

The trick with a UDF would be similar to a Serde that you could write a GenericUDTF that returns all columns of the document. I.e. Instead of parsing the document 400 times it would parse it once. You could also decide how to add the XPATH information to it. As an input parameter through a config file in HDFS, hard coded in the code. Similar case would be pig. Essentially if you can write a standalone Java program that can parse your doc you can put that program into the UDF ( or pig function ) and execute it in parallel. Regarding automatic execution. This is normally done with Oozie. Its the standard workflow scheduler in hadoop. You can for example define an input folder that is partitioned by timestamp. /input/2016-12-01 /input/2016-12-02 ... And have oozie schedule a job for ever one of these folders when available. I am currently having a project which does that in 15 minute increments. Lower is not that efficient. You however still need to write an upload that puts data into hdfs first. This can be done in flume,nifi or manually ( cron job that puts file in HDFS in the timestamped folder. Or you do the transformations directly in nifi,storm,flume if you have to have it real real

bleonhardi · ‎02-20-2016

Just to make sure, when you use a Kafka Consumer on your test topic you see messages coming in? Below is the code of the KafkaWordCount example, the parameters of the Stream look good. ( I would add the port number just to be sure :2181 but he would complain if he couldn't connect to zookeeper. ( although if its a warning you wouldn't see that since you switched off logging ) I also don't trust your function, I know it might be some scala magic and do exactly what it should but it seems needlessly complicated. So my suggestions - Use consumer to test if data comes in - Turn on logging - Add zookeeper port - Replace your map function with the simple .map(_._2) from the Wordcount example - print those and see what happens. val Array(zkQuorum, group, topics, numThreads) = args val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpoint") val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)) .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2) wordCounts.print()

bleonhardi · ‎02-19-2016

Oh and finally. The learning curve is just harsh. Hadoop has so many possibilities. That is an amazing strength because you can do anything you want. And once you do something you can do it on humongous amounts of data But also its weaknesses because the feature space is so huge and it comes from so many different providers some areas will be more polished than others. So hang in there? The start is hard but once you got it its great.

bleonhardi · ‎02-19-2016

Second regarding how should you do it. Honestly if the dynamic extraction fails I would extract your data once. If you can write a Java program that extracts what you want you can put it in MapReduce or Pig . In Hadoop space is cheap. Write some Spark/MapReduce/Pig and extract the fields once, after that you can query the columns you want. Hive or Pig UDFs work too they are surprisingly easy to write. I know that goes against the "analyze the data just as it is thing" but to be fair that is never completely true in reality. ORCs are strongly typed ( and much faster than text files) , many users transform their data into Avro etc. So if you do something that pushes it like running 400 xpaths on big XML documents well you might have to do a transformation.

bleonhardi · ‎02-19-2016

First regarding Serde. Not sure if it would work. However the XPath is not provided during the query but during table creation. So its a bit different. You will need to have less than 8 or 16KB ( I think ) of TBLProperties however. ( the hcat table storing the properties has a length limit its possible to alter that ( log in to mysql and change the length of the column ) but that is obviously not really clean ). 400 columns might be pushing it. The thing where the SERDE is better is that the XML document is only parsed ONCE. Not 400 times.

bleonhardi · ‎02-19-2016

@Pavel Benes I know it has been forever. But its possible to provide a password file to beeline. So they might not have to log on. Similar to a keytab the file would be stored in their account.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Oozie Error Code: E1100 & ERROR, reason: Main...

Re: Beeline hivevar value with spaces and symbols

Re: Is there any way to disable "kill application"...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Re: Kafka Spark streaming: unable to get any messa...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Re: Access to Hive from Oozie java action with Ker...