Support Questions

Find answers, ask questions, and share your expertise

Am I stupid or does anyone else have constant issues settings up projects in the Hadoop ecosystem?

avatar
Expert Contributor

I am just really wanting to get a feeling from others in the community on if my issues are due to me lack of knowledge, really bad luck, or if my experiences are par for the course. For me, I seem to always be constantly running into issues setting up datasets to perform analysis on the data. So much so, that I'm not even working with any significant amount of "unstructured data" data (what this stuff is supposed to be designed to work with), but fighting with errors, exception, configuration issues, incompatabilities. The vast majority of my time I spend trying to get the stuff to work properly on my most simplistic case (e.g. 1 file). On top of that, the stuff I've been working with isn't exotic by any means (processing XML documents, JSON documents, etc.)

I will get a decent start most of the time by finding a guide but thats about where it will stop. Take this for example. We are wanting to build a data warehouse for some simulation data that is stored in XML (1 simulation can generate 100s or even thousands of XML documents). We are throwing away GBs upon GBs of potentially good simulation data that could provide some feedback. So we want to setup a simplistic case of storing and retrieving data from these XML files. We setup HDFS + Hive to do some simplistic evaluation with HDP 2.3.2. Find an example that shows an easy way of doing this by creating an external table and storing the XML data of an entire file in a single column. Write a few selects on the data with the xpath to grab information. We were able to successfully achieve that (10 to 20 xml attributes). But that is about where it stops. Performance on a single file in the external table on a single 400KB XML file took well over 30 seconds (with defaults). But I figured I would come back to that. I wrote a program to grab all xpaths out of my XML documents so I could create a view on top of the data (instead of us having to write out all the xpaths each time we queried the data). The XML file ends up having 400ish xpaths. I put the 400 xpaths into my select, get an internal hive error, find out its a bug (and supposedly fixed in new version of HDP), I spend 3 hours tracking the issue down and then spend another 3ish hours trying to find a workaround to continue. Find a workaround and then run the query again, and then boom no surprise, another internal hive error. Looks to be barfing on the size of the select statement. The xpaths are fine. I can do SELECT <first 50 xpaths> FROM tbl no problem. I can then do the next 50...no problem. As soon as I put 75+ xpaths in the select, Hive blows up. Spend another 4 or 5 hours trying to figure out a work around for that and still haven't figured it out.

I tried using AVRO and had constant problems with the schema erroring on some of the data in the fields. I used a tool I found to generate the schema based on the XSD. Tried to use it and it would error out with some error when I tried querying the data (I don't even remember the error now). Built a schema by hand on a few attributes/fields but would error out on others (and this was only a few fields not the entire 400+ fields. That was another 4-6 hours spent troubleshooting on the issues and not really getting anywhere.

A SerDe is the next route im going to take but I have a very good feeling when I try to create a table defining the xpaths it will error out but I'll see.

.

Another issue, twitter data stored as JSON with multiple JSON objects per file (500 tweets or so). Go get a SerDe for JSON data, and get a tool to generate the JSON schema structure. The schema generator errors out on the JSON data as some of the fields are "NULL". Spend 4+ hours trying to find a different schema generator that works. Find one and then generate the external table, run a simple select `user` from twitterData. It worked! Yes! Add a few more columns thinking I have it working. Looking at the data now all of the columns are in random columns and what I have determined to be the issue is the JSONSerDe is incorrectly parsing characters in the JSON documents. Back to the drawing board...trying to get a JSONSerDe to work or spend time writing my own. I spent hours on this issue that probably total in the days getting to this point with only half-assed results that aren't even correct.

These problems seem extremely trivial (XML and JSON processing) and I'm struggling with getting them working properly. Other than the most simplistic 1 file with 2 or 3 values and a select (that I hand pick) that would fit on a single page to work, but that's pointless. It seems that once anything even slightly complex comes into play, everything is a constant fight to try and get working. Muchless the selling point of "throw unstructure data at a wall and query it all!"

For my XML problem I could have used the program that generated my XPaths but instead of outputting the actual xpaths grabbed the data along with it and shoved it into a PostgreSQL database that I know I wouldn't have these issues with. Yeah it might choke on 1TB of data...once we get there...but at least I could query on it in < 4 hours.

Does anyone else have these problems or have similiar experience or is it just me?

1 ACCEPTED SOLUTION

avatar
Master Guru

You are definitely not stupid 🙂

Working with data is hard. There are some things that work really well now and the core advantage of Hadoop is that once written you can scale your application to infinity.

But in general working with data is hard. I remember once spending a day to export an XML table from DB2 and spending days to figure out the correct way to extract some key fields from JSON Tweets ( the user name can be in different fields, some fields are empty when I think they shouldn't, some records are just broken ... )

In general Hadoop uses some of the most commonly used open source Java libraries to handle XML and JSON processing but it is not a core feature like XML in postgres might be. For the JSON I would say if it breaks it most likely breaks in other tools as well. The Open source Java Json libraries are widely used.

But let's go back to your XML problem. So you have a pretty huge XML and want to extract hundreds of fields from them as a view? And you say you didn't use a Serde but stored it how? As a String? And then you used the following?

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

But that is terrible. He would read the 400KB XML string push it into the Xpath udf for every single one of your xpath expressions and parse the document over and over and over and over again. Not surprised that it is slow or kills itself if this is what you actually did. You need to find a way to parse the document once and extract all the information out of it that you need. Or use the Serde which does the same.

View solution in original post

12 REPLIES 12

avatar
Master Guru

Oh and finally. The learning curve is just harsh. Hadoop has so many possibilities. That is an amazing strength because you can do anything you want. And once you do something you can do it on humongous amounts of data But also its weaknesses because the feature space is so huge and it comes from so many different providers some areas will be more polished than others. So hang in there? The start is hard but once you got it its great.

avatar
Expert Contributor

Thank you. I agree that it has a lot of possibilities and I'm not giving up by any means. The post was just to get feedback on others real world experience. I ask because I look at a lot of examples/tutorials/blogs/forums etc. and it is made out to look so simple. "Hey look I just uploaded this super simple example in to HDFS, wrote this 25 line Spark job and it took 15 minutes to do! You can do it too!". When working in the real world you spend 5 hours trying to debug some random issue because a NULL value is in a field or the script can't handle line endings properly. Thanks again for the feedback.

avatar
Master Guru

Yeah I agree sometimes the marketing is a bit ahead of things 🙂