Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7563 | 08-12-2016 01:02 PM | |
| 2763 | 08-08-2016 10:00 AM | |
| 3776 | 08-03-2016 04:44 PM | |
| 7352 | 08-03-2016 02:53 PM | |
| 1903 | 08-01-2016 02:38 PM |
02-22-2016
08:57 AM
1 Kudo
You have a Shell action which fails. For whatever reason. You need to find the logs of the application. There are essentially 3 logs in oozie The oozie log ( of the server ) The launcher log, which would most likely be interesting for you. You can find it in Hue, the oozie ui ( under the action logs ) or in yarn. This should have some information on what went wrong. ( Look for an oozie launcher in resource manager ) The Application log, not applicable for a shell action but for example for a pig action or other map reduce actions. Also Hue or yarn.
... View more
02-22-2016
08:51 AM
1 Kudo
Ok just tried it out. Your problem is that you didn't use the double quotes on the outside. The linux command line needs double quotes to escape strings. You can use single quotes inside. @Pooja Chawda [root@sandbox ~]# more test.sql select * from sample_07 where ${cond}; [root@sandbox ~]# beeline -u jdbc:hive2://sandbox:10000/default -f test.sql --hivevar cond="description='All Occupations'" ... +-----------------+------------------------+----------------------+-------------------+--+ | sample_07.code | sample_07.description | sample_07.total_emp | sample_07.salary | +-----------------+------------------------+----------------------+-------------------+--+ | 00-0000 | All Occupations | 134354250 | 40690 | +-----------------+------------------------+----------------------+-------------------+--+
... View more
02-21-2016
11:31 AM
1 Kudo
That is cool, I didn't find it in the original JIRA patch weird.
... View more
02-20-2016
03:06 PM
2 Kudos
Yeah I agree sometimes the marketing is a bit ahead of things 🙂
... View more
02-20-2016
03:05 PM
2 Kudos
The trick with a UDF would be similar to a Serde that you could write a GenericUDTF that returns all columns of the document. I.e. Instead of parsing the document 400 times it would parse it once. You could also decide how to add the XPATH information to it. As an input parameter through a config file in HDFS, hard coded in the code. Similar case would be pig. Essentially if you can write a standalone Java program that can parse your doc you can put that program into the UDF ( or pig function ) and execute it in parallel. Regarding automatic execution. This is normally done with Oozie. Its the standard workflow scheduler in hadoop. You can for example define an input folder that is partitioned by timestamp. /input/2016-12-01 /input/2016-12-02 ... And have oozie schedule a job for ever one of these folders when available. I am currently having a project which does that in 15 minute increments. Lower is not that efficient. You however still need to write an upload that puts data into hdfs first. This can be done in flume,nifi or manually ( cron job that puts file in HDFS in the timestamped folder. Or you do the transformations directly in nifi,storm,flume if you have to have it real real
... View more
02-20-2016
02:54 PM
2 Kudos
Just to make sure, when you use a Kafka Consumer on your test topic you see messages coming in? Below is the code of the KafkaWordCount example, the parameters of the Stream look good. ( I would add the port number just to be sure :2181 but he would complain if he couldn't connect to zookeeper. ( although if its a warning you wouldn't see that since you switched off logging ) I also don't trust your function, I know it might be some scala magic and do exactly what it should but it seems needlessly complicated. So my suggestions - Use consumer to test if data comes in - Turn on logging - Add zookeeper port - Replace your map function with the simple .map(_._2) from the Wordcount example - print those and see what happens. val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
... View more
02-19-2016
05:43 PM
3 Kudos
Oh and finally. The learning curve is just harsh. Hadoop has so many possibilities. That is an amazing strength because you can do anything you want. And once you do something you can do it on humongous amounts of data But also its weaknesses because the feature space is so huge and it comes from so many different providers some areas will be more polished than others. So hang in there? The start is hard but once you got it its great.
... View more
02-19-2016
05:33 PM
3 Kudos
Second regarding how should you do it. Honestly if the dynamic extraction fails I would extract your data once. If you can write a Java program that extracts what you want you can put it in MapReduce or Pig . In Hadoop space is cheap. Write some Spark/MapReduce/Pig and extract the fields once, after that you can query the columns you want. Hive or Pig UDFs work too they are surprisingly easy to write. I know that goes against the "analyze the data just as it is thing" but to be fair that is never completely true in reality. ORCs are strongly typed ( and much faster than text files) , many users transform their data into Avro etc. So if you do something that pushes it like running 400 xpaths on big XML documents well you might have to do a transformation.
... View more
02-19-2016
05:25 PM
3 Kudos
First regarding Serde. Not sure if it would work. However the XPath is not provided during the query but during table creation. So its a bit different. You will need to have less than 8 or 16KB ( I think ) of TBLProperties however. ( the hcat table storing the properties has a length limit its possible to alter that ( log in to mysql and change the length of the column ) but that is obviously not really clean ). 400 columns might be pushing it. The thing where the SERDE is better is that the XML document is only parsed ONCE. Not 400 times.
... View more
02-19-2016
01:10 PM
1 Kudo
@Pavel Benes I know it has been forever. But its possible to provide a password file to beeline. So they might not have to log on. Similar to a keytab the file would be stored in their account.
... View more