Support Questions
Find answers, ask questions, and share your expertise

Am I stupid or does anyone else have constant issues settings up projects in the Hadoop ecosystem?

Contributor

I am just really wanting to get a feeling from others in the community on if my issues are due to me lack of knowledge, really bad luck, or if my experiences are par for the course. For me, I seem to always be constantly running into issues setting up datasets to perform analysis on the data. So much so, that I'm not even working with any significant amount of "unstructured data" data (what this stuff is supposed to be designed to work with), but fighting with errors, exception, configuration issues, incompatabilities. The vast majority of my time I spend trying to get the stuff to work properly on my most simplistic case (e.g. 1 file). On top of that, the stuff I've been working with isn't exotic by any means (processing XML documents, JSON documents, etc.)

I will get a decent start most of the time by finding a guide but thats about where it will stop. Take this for example. We are wanting to build a data warehouse for some simulation data that is stored in XML (1 simulation can generate 100s or even thousands of XML documents). We are throwing away GBs upon GBs of potentially good simulation data that could provide some feedback. So we want to setup a simplistic case of storing and retrieving data from these XML files. We setup HDFS + Hive to do some simplistic evaluation with HDP 2.3.2. Find an example that shows an easy way of doing this by creating an external table and storing the XML data of an entire file in a single column. Write a few selects on the data with the xpath to grab information. We were able to successfully achieve that (10 to 20 xml attributes). But that is about where it stops. Performance on a single file in the external table on a single 400KB XML file took well over 30 seconds (with defaults). But I figured I would come back to that. I wrote a program to grab all xpaths out of my XML documents so I could create a view on top of the data (instead of us having to write out all the xpaths each time we queried the data). The XML file ends up having 400ish xpaths. I put the 400 xpaths into my select, get an internal hive error, find out its a bug (and supposedly fixed in new version of HDP), I spend 3 hours tracking the issue down and then spend another 3ish hours trying to find a workaround to continue. Find a workaround and then run the query again, and then boom no surprise, another internal hive error. Looks to be barfing on the size of the select statement. The xpaths are fine. I can do SELECT <first 50 xpaths> FROM tbl no problem. I can then do the next 50...no problem. As soon as I put 75+ xpaths in the select, Hive blows up. Spend another 4 or 5 hours trying to figure out a work around for that and still haven't figured it out.

I tried using AVRO and had constant problems with the schema erroring on some of the data in the fields. I used a tool I found to generate the schema based on the XSD. Tried to use it and it would error out with some error when I tried querying the data (I don't even remember the error now). Built a schema by hand on a few attributes/fields but would error out on others (and this was only a few fields not the entire 400+ fields. That was another 4-6 hours spent troubleshooting on the issues and not really getting anywhere.

A SerDe is the next route im going to take but I have a very good feeling when I try to create a table defining the xpaths it will error out but I'll see.

.

Another issue, twitter data stored as JSON with multiple JSON objects per file (500 tweets or so). Go get a SerDe for JSON data, and get a tool to generate the JSON schema structure. The schema generator errors out on the JSON data as some of the fields are "NULL". Spend 4+ hours trying to find a different schema generator that works. Find one and then generate the external table, run a simple select `user` from twitterData. It worked! Yes! Add a few more columns thinking I have it working. Looking at the data now all of the columns are in random columns and what I have determined to be the issue is the JSONSerDe is incorrectly parsing characters in the JSON documents. Back to the drawing board...trying to get a JSONSerDe to work or spend time writing my own. I spent hours on this issue that probably total in the days getting to this point with only half-assed results that aren't even correct.

These problems seem extremely trivial (XML and JSON processing) and I'm struggling with getting them working properly. Other than the most simplistic 1 file with 2 or 3 values and a select (that I hand pick) that would fit on a single page to work, but that's pointless. It seems that once anything even slightly complex comes into play, everything is a constant fight to try and get working. Muchless the selling point of "throw unstructure data at a wall and query it all!"

For my XML problem I could have used the program that generated my XPaths but instead of outputting the actual xpaths grabbed the data along with it and shoved it into a PostgreSQL database that I know I wouldn't have these issues with. Yeah it might choke on 1TB of data...once we get there...but at least I could query on it in < 4 hours.

Does anyone else have these problems or have similiar experience or is it just me?

1 ACCEPTED SOLUTION

You are definitely not stupid 🙂

Working with data is hard. There are some things that work really well now and the core advantage of Hadoop is that once written you can scale your application to infinity.

But in general working with data is hard. I remember once spending a day to export an XML table from DB2 and spending days to figure out the correct way to extract some key fields from JSON Tweets ( the user name can be in different fields, some fields are empty when I think they shouldn't, some records are just broken ... )

In general Hadoop uses some of the most commonly used open source Java libraries to handle XML and JSON processing but it is not a core feature like XML in postgres might be. For the JSON I would say if it breaks it most likely breaks in other tools as well. The Open source Java Json libraries are widely used.

But let's go back to your XML problem. So you have a pretty huge XML and want to extract hundreds of fields from them as a view? And you say you didn't use a Serde but stored it how? As a String? And then you used the following?

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

But that is terrible. He would read the 400KB XML string push it into the Xpath udf for every single one of your xpath expressions and parse the document over and over and over and over again. Not surprised that it is slow or kills itself if this is what you actually did. You need to find a way to parse the document once and extract all the information out of it that you need. Or use the Serde which does the same.

View solution in original post

12 REPLIES 12

Hi @Kevin Vasko, yes such things can happen, specially early on, and XML and JSON while kind of trivial are not so suitable for direct processing in Hadoop. Also, in case of your JSON efforts, it looks like you are having a classical "last 10%" problem, the main functionality is working early on, but then you realize that you have to handle many special cases.

Regarding the problem at hand I found an interesting presentation about XML, suggesting pre-parsing of XML (or JSON) files to Avro, and processing them using Pig. I think I agree with that approach. Also, when reading huge XML files instead of XPath you can extract desired elements using Sax. And these days instead of Avro you can also use ORC files which provide much better performance. Pre-parsing will take a while but it has to be done only once. After that you end up with Hadoop friendly input files which you can process repeatedly in many ways.

Just keep on going, and I'm sure you will overcome these early issues pretty quickly!

Contributor

My comment about trying to use AVRO was based on that example you linked. It wouldn't work properly with a large schema.

I am not really dealing with single "huge" XML files, just one with a ton of columns. I don't have to process them directly as XML. That is just the format I'm going to get them. Here was my logical process that I was trying to follow

1) Client uploads XML to HDFS. 2)Create external table on hdfs folder of where the XML data is stored. 3) create ORC based table. 4) Query external table extracting all of the data (SerDe, view on top of the table with xpath, AVRO, ???) and loading it into the ORC table. 5) query the ORC table as needed for analytics.

I'm really having issue with step 4, extracting the data within the ecosystem. I can go write an external program to extract all of the information before it gets sent to hadoop...but that destroys the whole point of "processing unstructured data".

The other issue I see is having to recreate the ORC table constantly. Since there is no way for me to know about "new data", I can't simply just append the new xml document to the ORC table without deleting the original documents from the EXTERNAL table.

You are definitely not stupid 🙂

Working with data is hard. There are some things that work really well now and the core advantage of Hadoop is that once written you can scale your application to infinity.

But in general working with data is hard. I remember once spending a day to export an XML table from DB2 and spending days to figure out the correct way to extract some key fields from JSON Tweets ( the user name can be in different fields, some fields are empty when I think they shouldn't, some records are just broken ... )

In general Hadoop uses some of the most commonly used open source Java libraries to handle XML and JSON processing but it is not a core feature like XML in postgres might be. For the JSON I would say if it breaks it most likely breaks in other tools as well. The Open source Java Json libraries are widely used.

But let's go back to your XML problem. So you have a pretty huge XML and want to extract hundreds of fields from them as a view? And you say you didn't use a Serde but stored it how? As a String? And then you used the following?

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

But that is terrible. He would read the 400KB XML string push it into the Xpath udf for every single one of your xpath expressions and parse the document over and over and over and over again. Not surprised that it is slow or kills itself if this is what you actually did. You need to find a way to parse the document once and extract all the information out of it that you need. Or use the Serde which does the same.

Contributor

Thanks for the feedback. You are correct on your analysis. That is the way I was initially doing it. I just wanted something simple to get up and working.

How would you suggest a way to "parse" the document and extract all the information I need within the Hadoop ecosystem? If I'm extracting all of the XML fields and information in the source application wouldn't that remove a major selling point of this ecosystem which is not doing traditional ETL on the datasets. You just simply work on the data "as is"? I've never had a solid answer on the best where, what, and how for ingesting data into this ecosystem.

The other method you mentioned (of using SerDe), I don't see how it wouldn't run into the same exact issue I ran into. If I use this SerDe https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources I would end up having 400 xpaths in my "WITH SERDEPROPERTIES()". If the error im getting above is a too long of query request, I don't see how the problem wouldn't show up here as well.

First regarding Serde. Not sure if it would work. However the XPath is not provided during the query but during table creation. So its a bit different. You will need to have less than 8 or 16KB ( I think ) of TBLProperties however. ( the hcat table storing the properties has a length limit its possible to alter that ( log in to mysql and change the length of the column ) but that is obviously not really clean ). 400 columns might be pushing it.

The thing where the SERDE is better is that the XML document is only parsed ONCE. Not 400 times.

Second regarding how should you do it. Honestly if the dynamic extraction fails I would extract your data once. If you can write a Java program that extracts what you want you can put it in MapReduce or Pig . In Hadoop space is cheap. Write some Spark/MapReduce/Pig and extract the fields once, after that you can query the columns you want. Hive or Pig UDFs work too they are surprisingly easy to write.

I know that goes against the "analyze the data just as it is thing" but to be fair that is never completely true in reality. ORCs are strongly typed ( and much faster than text files) , many users transform their data into Avro etc. So if you do something that pushes it like running 400 xpaths on big XML documents well you might have to do a transformation.

Contributor

Thank you! For the quick response. Not quite sure how a Hive UDF would help in this instance the XPATH is already a UDF I thought? The data I'm extracting is extremely simplistic (e.g. values from 1 to 100 and short strings). The problem is just getting to that data.

I'm working through the process of trying the SerDe to dynamically extract the data but if that fails for dynamic extraction the only option is extracting the data. If I write a Spark/MR/Pig job to extract the information that opens up different options/tradeoffs such as, storing it in HBase, ORC backed Hive table etc. But also seems completely overkill as we could easily extract the information in the originating application and insert it into an ORC backed Hive table.

What are my options for running a job/process automatically as soon as data is finished uploading to HDFS? 1)C# application that generates the XML uploads the data to HDFS 2.a) C# calls some type of spark/pig job to process data? (I do have to worry about authenticating with KNOX) 2.b) Use Oozie and Falcon to create some type of schedule workflow. 2.c) Apache NiFi/DataFlow?

The trick with a UDF would be similar to a Serde that you could write a GenericUDTF that returns all columns of the document. I.e. Instead of parsing the document 400 times it would parse it once. You could also decide how to add the XPATH information to it. As an input parameter through a config file in HDFS, hard coded in the code. Similar case would be pig. Essentially if you can write a standalone Java program that can parse your doc you can put that program into the UDF ( or pig function ) and execute it in parallel.

Regarding automatic execution. This is normally done with Oozie. Its the standard workflow scheduler in hadoop. You can for example define an input folder that is partitioned by timestamp.

/input/2016-12-01

/input/2016-12-02

...

And have oozie schedule a job for ever one of these folders when available. I am currently having a project which does that in 15 minute increments. Lower is not that efficient. You however still need to write an upload that puts data into hdfs first. This can be done in flume,nifi or manually ( cron job that puts file in HDFS in the timestamped folder.

Or you do the transformations directly in nifi,storm,flume if you have to have it real real

Contributor

I'm going to accept your answer for this question as I ended up writing a UDF to solve the potential slow issue doing all the XPaths multiple times. But the general gist of the thread still applies just different problems. I ended up partially "solving" the issue with having 300 columns (in HiveCLI) in a table by disabling Apache Atlas in HDP. Apparently Atlas was intercepting the queries and blowing up when the query become too long. I would venture to guess this is a bug in Atlas.

After fixing that, I worked on writing the UDF and making it permanent so it could be used by the application using an ODBC connection. I used the CREATE FUNCTION statement and that works....except it only made the function permanent in the HiveCLI context, an ODBC or even Hue context the function doesn't exist. Ended up having to just run the CREATE FUNCTION statement in the Hue/ODBC Application context. Unless im missing a configuration setting that I'm not aware of I assume this is another bug.

Once I did that I was able to get the HiveCLI to work with all 400+ columns with the UDF. I thought I was done but unfortunately, ran into another issue when I tried to run the same query that worked in the HiveCLI in Hue/ODBC App. This issue is a similar issue with the first error...if I only have ~250 columns in the query it works in Hue/ODBC application. Currently investigating this problem.

But these are examples of the original sentiment of the original post.

2016-03-04 10:47:55,417 WARN  [HiveServer2-HttpHandler-Pool: Thread-34]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) - Error fetching results:
org.apache.hive.service.cli.HiveSQLException: Expected state FINISHED, but found ERROR
        at org.apache.hive.service.cli.operation.Operation.assertState(Operation.java:161)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:334)
        at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:221)
        at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
        at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
        at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
        at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
        at com.sun.proxy.$Proxy19.fetchResults(Unknown Source)
        at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.thrift.server.TServlet.doPost(TServlet.java:83)
        at org.apache.hive.service.cli.thrift.ThriftHttpServlet.doPost(ThriftHttpServlet.java:171)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:565)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:479)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1031)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:965)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:349)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:449)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:925)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:76)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:609)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:45)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Oh and finally. The learning curve is just harsh. Hadoop has so many possibilities. That is an amazing strength because you can do anything you want. And once you do something you can do it on humongous amounts of data But also its weaknesses because the feature space is so huge and it comes from so many different providers some areas will be more polished than others. So hang in there? The start is hard but once you got it its great.

Contributor

Thank you. I agree that it has a lot of possibilities and I'm not giving up by any means. The post was just to get feedback on others real world experience. I ask because I look at a lot of examples/tutorials/blogs/forums etc. and it is made out to look so simple. "Hey look I just uploaded this super simple example in to HDFS, wrote this 25 line Spark job and it took 15 minutes to do! You can do it too!". When working in the real world you spend 5 hours trying to debug some random issue because a NULL value is in a field or the script can't handle line endings properly. Thanks again for the feedback.

Yeah I agree sometimes the marketing is a bit ahead of things 🙂

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.