About kvasko

kvasko · ‎02-25-2016

@Neeraj Sabharwal There are two errors I've been fighting with on getting access to all of these columns in the same query. The second one I *thought* I had a work around for by disabling security (unchecking the security box in Ambari for Hive) but it keeps showing back up. Here is the defect which I think I'm running into for the FULL head issue. UPDATE: I'm about 99.99% sure I figured out the problem! I started looking further into the ERROR logs. This line here "at org.apache.atlas.security.SecureClientUtils$1$1.run(SecureClientUtils.java:103)" tipped me off that in some way ATLAS was being interacted with. I disabled ATLAS by turning off the atlas service and removing hive.exec.failure.hooks=org.apache.atlas.hive.hook.HiveHook I ran my entire query and it worked without issue! I would venture to say that this is an issue with ATLAS not being able to handle really long queries. https://issues.apache.org/jira/browse/HIVE-11720 1. Error writing to server https://gist.github.com/kur1j/513e5a1499eef6c727a1 2. FULL head https://gist.github.com/kur1j/217eae2065c7953d9cf7

kvasko · ‎02-25-2016

since there isn't really any hard limit, and 400 columns I shouldn't be enough to cause oom memory issues I'm not quite sure on what else to do. This issue to me purely looks like configuration/bugs in Hive or its dependencies. I posted this issue on the user mailing list but I haven't heard anything. Any suggestions?

kvasko · ‎02-24-2016

Thanks but same issue. How can I increase the value of how long of string Hive can take as a query? I created a SimpleUDF that takes an input of the XML string and does all the xpath parsing on that file and returns a map type. I was hoping that getting rid of all the xpath calls would eliminate the issue but didn't work. I can now do SELECT m["key"] FROM (SELECT myfunc(xmldata) FROM xmlSource). But when I do SELECT m["key1"]....m[key400" FROM ...(...) I'm back at the "full HEAD" issue for some reason.

kvasko · ‎02-23-2016

I have not tried. I'll try it and see.

kvasko · ‎02-19-2016

Thank you. I agree that it has a lot of possibilities and I'm not giving up by any means. The post was just to get feedback on others real world experience. I ask because I look at a lot of examples/tutorials/blogs/forums etc. and it is made out to look so simple. "Hey look I just uploaded this super simple example in to HDFS, wrote this 25 line Spark job and it took 15 minutes to do! You can do it too!". When working in the real world you spend 5 hours trying to debug some random issue because a NULL value is in a field or the script can't handle line endings properly. Thanks again for the feedback.

kvasko · ‎02-19-2016

Thank you! For the quick response. Not quite sure how a Hive UDF would help in this instance the XPATH is already a UDF I thought? The data I'm extracting is extremely simplistic (e.g. values from 1 to 100 and short strings). The problem is just getting to that data. I'm working through the process of trying the SerDe to dynamically extract the data but if that fails for dynamic extraction the only option is extracting the data. If I write a Spark/MR/Pig job to extract the information that opens up different options/tradeoffs such as, storing it in HBase, ORC backed Hive table etc. But also seems completely overkill as we could easily extract the information in the originating application and insert it into an ORC backed Hive table. What are my options for running a job/process automatically as soon as data is finished uploading to HDFS? 1)C# application that generates the XML uploads the data to HDFS 2.a) C# calls some type of spark/pig job to process data? (I do have to worry about authenticating with KNOX) 2.b) Use Oozie and Falcon to create some type of schedule workflow. 2.c) Apache NiFi/DataFlow?

kvasko · ‎02-19-2016

My comment about trying to use AVRO was based on that example you linked. It wouldn't work properly with a large schema. I am not really dealing with single "huge" XML files, just one with a ton of columns. I don't have to process them directly as XML. That is just the format I'm going to get them. Here was my logical process that I was trying to follow 1) Client uploads XML to HDFS. 2)Create external table on hdfs folder of where the XML data is stored. 3) create ORC based table. 4) Query external table extracting all of the data (SerDe, view on top of the table with xpath, AVRO, ???) and loading it into the ORC table. 5) query the ORC table as needed for analytics. I'm really having issue with step 4, extracting the data within the ecosystem. I can go write an external program to extract all of the information before it gets sent to hadoop...but that destroys the whole point of "processing unstructured data". The other issue I see is having to recreate the ORC table constantly. Since there is no way for me to know about "new data", I can't simply just append the new xml document to the ORC table without deleting the original documents from the EXTERNAL table.

kvasko · ‎02-19-2016

Thanks for the feedback. You are correct on your analysis. That is the way I was initially doing it. I just wanted something simple to get up and working. How would you suggest a way to "parse" the document and extract all the information I need within the Hadoop ecosystem? If I'm extracting all of the XML fields and information in the source application wouldn't that remove a major selling point of this ecosystem which is not doing traditional ETL on the datasets. You just simply work on the data "as is"? I've never had a solid answer on the best where, what, and how for ingesting data into this ecosystem. The other method you mentioned (of using SerDe), I don't see how it wouldn't run into the same exact issue I ran into. If I use this SerDe https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources I would end up having 400 xpaths in my "WITH SERDEPROPERTIES()". If the error im getting above is a too long of query request, I don't see how the problem wouldn't show up here as well.

kvasko · ‎02-19-2016

I am just really wanting to get a feeling from others in the community on if my issues are due to me lack of knowledge, really bad luck, or if my experiences are par for the course. For me, I seem to always be constantly running into issues setting up datasets to perform analysis on the data. So much so, that I'm not even working with any significant amount of "unstructured data" data (what this stuff is supposed to be designed to work with), but fighting with errors, exception, configuration issues, incompatabilities. The vast majority of my time I spend trying to get the stuff to work properly on my most simplistic case (e.g. 1 file). On top of that, the stuff I've been working with isn't exotic by any means (processing XML documents, JSON documents, etc.) I will get a decent start most of the time by finding a guide but thats about where it will stop. Take this for example. We are wanting to build a data warehouse for some simulation data that is stored in XML (1 simulation can generate 100s or even thousands of XML documents). We are throwing away GBs upon GBs of potentially good simulation data that could provide some feedback. So we want to setup a simplistic case of storing and retrieving data from these XML files. We setup HDFS + Hive to do some simplistic evaluation with HDP 2.3.2. Find an example that shows an easy way of doing this by creating an external table and storing the XML data of an entire file in a single column. Write a few selects on the data with the xpath to grab information. We were able to successfully achieve that (10 to 20 xml attributes). But that is about where it stops. Performance on a single file in the external table on a single 400KB XML file took well over 30 seconds (with defaults). But I figured I would come back to that. I wrote a program to grab all xpaths out of my XML documents so I could create a view on top of the data (instead of us having to write out all the xpaths each time we queried the data). The XML file ends up having 400ish xpaths. I put the 400 xpaths into my select, get an internal hive error, find out its a bug (and supposedly fixed in new version of HDP), I spend 3 hours tracking the issue down and then spend another 3ish hours trying to find a workaround to continue. Find a workaround and then run the query again, and then boom no surprise, another internal hive error. Looks to be barfing on the size of the select statement. The xpaths are fine. I can do SELECT <first 50 xpaths> FROM tbl no problem. I can then do the next 50...no problem. As soon as I put 75+ xpaths in the select, Hive blows up. Spend another 4 or 5 hours trying to figure out a work around for that and still haven't figured it out. I tried using AVRO and had constant problems with the schema erroring on some of the data in the fields. I used a tool I found to generate the schema based on the XSD. Tried to use it and it would error out with some error when I tried querying the data (I don't even remember the error now). Built a schema by hand on a few attributes/fields but would error out on others (and this was only a few fields not the entire 400+ fields. That was another 4-6 hours spent troubleshooting on the issues and not really getting anywhere. A SerDe is the next route im going to take but I have a very good feeling when I try to create a table defining the xpaths it will error out but I'll see. . Another issue, twitter data stored as JSON with multiple JSON objects per file (500 tweets or so). Go get a SerDe for JSON data, and get a tool to generate the JSON schema structure. The schema generator errors out on the JSON data as some of the fields are "NULL". Spend 4+ hours trying to find a different schema generator that works. Find one and then generate the external table, run a simple select `user` from twitterData. It worked! Yes! Add a few more columns thinking I have it working. Looking at the data now all of the columns are in random columns and what I have determined to be the issue is the JSONSerDe is incorrectly parsing characters in the JSON documents. Back to the drawing board...trying to get a JSONSerDe to work or spend time writing my own. I spent hours on this issue that probably total in the days getting to this point with only half-assed results that aren't even correct. These problems seem extremely trivial (XML and JSON processing) and I'm struggling with getting them working properly. Other than the most simplistic 1 file with 2 or 3 values and a select (that I hand pick) that would fit on a single page to work, but that's pointless. It seems that once anything even slightly complex comes into play, everything is a constant fight to try and get working. Muchless the selling point of "throw unstructure data at a wall and query it all!" For my XML problem I could have used the program that generated my XPaths but instead of outputting the actual xpaths grabbed the data along with it and shoved it into a PostgreSQL database that I know I wouldn't have these issues with. Yeah it might choke on 1TB of data...once we get there...but at least I could query on it in < 4 hours. Does anyone else have these problems or have similiar experience or is it just me?

kvasko · ‎02-18-2016

Using HDP 2.3.2 sandbox. This is the second error I got trying to get this working. The first error and "solution" can be found here. https://community.hortonworks.com/questions/18007/hive-fails-with-hive-internal-error-message-full-h.html I have an external table defined over a folder that contains XML documents. There is 1 column in this table with the column containing each documents data as a string. I am trying to create a view on top of the XML data with xpaths. So for example, CREATE VIEW myview (column1,...Column N) AS SELECT xpath_string(rawxml, '/my/xpath/to/value'), xpath_string(rawxml, '/another/xpath') FROM myxmltable; The XML document has 400+ xpaths that I want to grab and put into the view. I can do about 60 columns worth of xpaths before I get this error. FAILED: Hive Internal Error: com.sun.jersey.api.client.ClientHandlerException(java.io.IOException: java.io.IOException: Error writing to server) com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: java.io.IOException: Error writing to server at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) My cursory research seems to indicate that the query string is too long and is breaking something. I am writing these queries on the hiveCLI so not sure how else I can fix this. I also tried using beeline and get the same error.

Online	Offline
Last Visited	‎05-09-2018 03:27 PM

Member Since	‎01-23-2016 03:23 AM
Last Visited	‎05-09-2018 03:27 PM
Posts	51
Kudos received	41

Cloudera Community

Re: Hive fails with "Hive Internal Error message: ...

Re: Hive fails with IOException: Error writing to ...

Re: Hive fails with IOException: Error writing to ...

Re: Hive fails with IOException: Error writing to ...

Re: Hive fails with IOException: Error writing to ...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Re: Am I stupid or does anyone else have constant ...

Am I stupid or does anyone else have constant issu...

Hive fails with IOException: Error writing to serv...