Member since
01-23-2016
51
Posts
41
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
847 | 02-18-2016 04:34 PM |
03-21-2018
02:01 PM
I ended up not using NiFi for this. Looking back I tried forcing a solution out of NiFi thst wasn’t a good fit. I spent several weeks and entirely too long trying to solve the most simple case of this project (formatting some text and dumping it to a db). I could certainly see NiFi being useful for moving source data files around from the folders I’m working with (copying, moving etc.) but doing any amount of logic or manipulation of anything but a happy path is extremely tedious and seemingly difficult to do. Knowing that I was going to have to do a lot more work on the data to make it even close to usable, I just scrapped NiFi and implement it in Python. After dealing with this data and running into edge cases over and over again that I wasn’t even aware about when I wrote this topic, the data IMO was just too dirty and had too many exceptions to deal with, with NiFi. On top of that this wasn’t just the import of the data, not even using it so I would have had to have another tool to actually process the data to put it into a usable form anyways. Appreciate the response. You took the time to respond so I figured it was reasonable to respond even though I didn’t end up using the solution.
... View more
01-05-2018
08:51 PM
I've googled everywhere for this and everything I run across its super complicated. It should be relatively simple to do. The recommendations show to look at the "Example_With_CSV.xml" from here. So given a flowfile thats a CSV. 2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)" I need $date = 2017-09-20 23:49:49:38.637 $id = 162929511757 ... $instanceid = 36095 $comment = "Failed, to fit max attempts (1=>3), fit failing, entirely (Fit Failure=True)" OR $csv.date = ... $csv.id = ... ... $csv.instanceid = ... $csv.comment = .. Is there another easier option to do this besides RegEx? I can't stand to do anything with RegEx as how unreadable, and overly complicated they are. To me there should be a significantly easier way of doing this than with RegEx. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates but it doesn't have anything in there related to actually getting the columns of each value out.
... View more
Labels:
- Labels:
-
Apache NiFi
01-05-2018
08:34 PM
There is no example in the "Working_With_CSV" template of how to extract each individual field into attributes.
... View more
01-04-2018
05:25 PM
1 Kudo
Thanks! That seems to work correctly. I'll mark this as the answer as it produces the answer I'm looking for.
... View more
01-03-2018
10:25 PM
2 Kudos
@Shu Thank you for the great detailed response. The first part does work but I don't think the regex will work for my
case. (Side bit, no fault of yours, I just absolutely despise regex as its
unreadable to me and extremely difficult to debug (if at all).) I should have mentioned this, but the only thing I know about the CSV
file is that there are X number of columns before the string. So I could see something like..
23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max, attempts,(1=>3), fit failing entrely,(FitFailure=True),
The only thing I know is that there are 13 columns (commas) before
the string and the string will always have a trailing "," (It has always
been the last column in the row from what I have seen). The other issue is I tried doing
(.*),
for all of the columns so I could then put it into a database query
to insert the data but the regex seems to blowup and not function with
so many columns (the original data has about 150 columns in it and I
just truncated it down here).
... View more
01-03-2018
04:46 PM
1 Kudo
I have a CSV file that is messy. I need to: 1. Get the date from the filename and use that as my date and append that to one of the columns. 2. Parse the CSV file to get the columns as the very last column is a string which has separators in the string ",". The data looks like this. Filename: ExampleFile_2017-09-20.LOG Content: 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True), 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,Command Measure, Targets complete - Elapsed: 76064 ms, The following is what will need to be inserted into the database: 2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max
attempts (1=>3), fit failing entirely (Fit Failure=True)" 2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms" Would I need to do this inside of NiFi or some external script by calling some type of ExecuteScript?
... View more
Labels:
- Labels:
-
Apache NiFi
10-17-2017
02:57 PM
So
the picture is getting quite blurry between all of the pipeline/etl tools
available. Specifically: * NiFi * StreamSets * Kafka (?) * Luigi * Airflow * Falcon * Oozie * A Microsoft solution? I've got several projects that I
could see a use for a pipeline/flow tool where ETLing is the point of the
entire project. So what are the strengths and weaknesses of each? Where should
I be using one or the other? Where does one shine where the other would be
difficult to manage or be overkill for the project? Which would be the most
light-weight of the tools? I have several projects but have two
stick out in my mind. They are completely unrelated to each other at all. They
do NOT overlap at all. 1) The project is a simple ETL for
XML data. In simple terms, 20 or so machines write out XML log data to their
local drive that is shared on the network. A python application connects to
each machine's share, copies the data to the local system for archival purposes
of the raw data. The same application reads the XML data from the files,
extracts all of the relevant content from the XML files and stores it into a
Microsoft SQL Server database. Currently the application gets run every 20
minutes through a Huey cronjob task in Python to look for new data on the
share. This is a Windows-only application/ecysystem so using something in the
MS world isn't out of the question either (hence why I included it). 2) The second project is more
"pipeline". We have about 2 million files that will need to run
through a process of a) Original Format --> b)Converted to an industry
standard format --> c) data massaged to fit our need --> d)Data converted
--> e) intermediate results are written out to disk --> f)data use to
train deep learning model to train model. For inference of a file, steps a),
b), c), d), e), f) would be performed. Step f) would be replaced with the
inference of the model and then f) would pass results down to g) (another
application). This is initially going to be done on Linux that they want to end
up (potentially) on Windows with so that could be a consideration. So
for these two items what would you end up choosing? From everything I have read and researched NiFi would be able to handle the get and put of the data files easily, but calling the python code to extract the data and put it in the database, how would NiFi handle that? I also looks to me that NiFi/StreamSet are a lot more heavy weighted and are usually operating within the Hadoop ecosystem. I'm not working with Hadoop/HDFS on either of these two applications. Any input on the strengths/weaknesses/specific use case for these examples would be greatly appreciated!
... View more
Labels:
- Labels:
-
Apache NiFi
05-25-2016
05:34 AM
what do I need to set in hive-env.sh? It seems that anything I touch it gets overwritten. This has to be a bug in ambari where it won't save the hive.heapsize value. How can I get it to persist?
... View more
05-25-2016
04:49 AM
The configuration of hive.heapsize does not exist in my hive-site.xml for some reason and whenever I add it to the file it keeps getting overwritten.
... View more
05-25-2016
04:09 AM
@Divakar AnnapureddyCorrect but if you look at my comments i posted a picture and it shows, it is changed to 12GB in the UI. The services have been restarted (complete server has been restarted).
... View more
05-25-2016
03:27 AM
hive 24964 0.2 1.7 2094636 566148 ? Sl 17:03 0:56 /usr/lib/jvm/ja va-1.7.0-openjdk-1.7.0.91.x86_64/bin/java -Xmx1024m -Dhdp.version=2.3.2.0-2950 - Djava.net.preferIPv4Stack=true -Dhdp.version=2.3.2.0-2950 -Dhadoop.log.dir=/var/ log/hadoop/hive -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.3.2.0- 2950/hadoop -Dhadoop.id.str=hive -Dhadoop.root.logger=INFO,console -Djava.librar y.path=:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/usr/hdp/2.3.2. 0-2950/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.prefe rIPv4Stack=true -Xmx1024m -XX:MaxPermSize=512m -Dhadoop.security.logger=INFO,Nul lAppender org.apache.hadoop.util.RunJar /usr/hdp/2.3.2.0-2950/hive/lib/hive-serv ice-1.2.1.2.3.2.0-2950.jar org.apache.hive.service.server.HiveServer2 --hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hca talog-core.jar -hiveconf hive.metastore.uris= -hiveconf hive.log.file=hiveserve r2.log -hiveconf hive.log.dir=/var/log/hive So I can see that it is set at 1024m, however it is set to some really large value. http://imgur.com/3oXfpPj
... View more
05-25-2016
01:48 AM
I am issuing a command that is executing about 1500 xpaths on a single XML file (it is about 10MB in size). I am getting the error in the title. I have tried increasing just about every configuration setting I know related to Hive/Tez's java heap space. e.g. https://community.hortonworks.com/questions/5780/hive-on-tez-query-map-output-outofmemoryerror-java.html Nothing seems to work. I restart the server after every configuration change. I also went and changed hive-env.sh to -Xmx8g and it still doesn't seem to fix the issue. I ran -verbose:gc and see that the gc stops at ~1000MB. Why wouldn't that go on up to 8G if I changed -Xmx to be 8g? Is there anyway to tell if it is the client breaking and needing more heap or the map jobs?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
03-05-2016
02:46 AM
1 Kudo
I can't seem to reply to your last comment but that was exactly the problem.
... View more
03-05-2016
02:45 AM
1 Kudo
Thanks, found it and it was already set to true and that still wasn't the issue. I went into hue and ran the create function command (same command as I did in HiveCLI ) and the command worked and I was able to run the function within hue. this to me looks like some type of context issue where the persistent function that is added in the CLI doesn't work in the other contexts (ODBC and Hue). I have no idea how to solve that.
... View more
03-04-2016
05:47 PM
1 Kudo
I'm going to accept your answer for this question as I ended up writing a UDF to solve the potential slow issue doing all the XPaths multiple times. But the general gist of the thread still applies just different problems.
I ended up partially "solving" the issue with having 300 columns (in HiveCLI) in a table by disabling Apache Atlas in HDP. Apparently Atlas was intercepting the queries and blowing up when the query become too long. I would venture to guess this is a bug in Atlas. After fixing that, I worked on writing the UDF and making it permanent so it could be used by the application using an ODBC connection. I used the CREATE FUNCTION statement and that works....except it only made the function permanent in the HiveCLI context, an ODBC or even Hue context the function doesn't exist. Ended up having to just run the CREATE FUNCTION statement in the Hue/ODBC Application context. Unless im missing a configuration setting that I'm not aware of I assume this is another bug. Once I did that I was able to get the HiveCLI to work with all 400+ columns with the UDF. I thought I was done but unfortunately, ran into another issue when I tried to run the same query that worked in the HiveCLI in Hue/ODBC App. This issue is a similar issue with the first error...if I only have ~250 columns in the query it works in Hue/ODBC application. Currently investigating this problem. But these are examples of the original sentiment of the original post. 2016-03-04 10:47:55,417 WARN [HiveServer2-HttpHandler-Pool: Thread-34]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) - Error fetching results:
org.apache.hive.service.cli.HiveSQLException: Expected state FINISHED, but found ERROR
at org.apache.hive.service.cli.operation.Operation.assertState(Operation.java:161)
at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:334)
at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:221)
at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy19.fetchResults(Unknown Source)
at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TServlet.doPost(TServlet.java:83)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.doPost(ThriftHttpServlet.java:171)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:565)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:479)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1031)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:965)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:349)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:449)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:925)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:76)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:609)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:45)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
... View more
03-04-2016
03:21 PM
1 Kudo
It was not created with a specific database. If I run SHOW FUNCTION within the HiveCLI it shows up as default.<myfunction>. If I run SHOW FUNCTION in Hue, the function does NOT show up even though I'm using the "default" database. Is there a way I can make it not be under "default." and just "<function>"? Hue/App using ODBC has no problem using those functions (e.g. count()). If I add the jar file in Hue (one the left sidebar) and the function/class information it all works.
... View more
03-04-2016
03:19 PM
1 Kudo
I don't even see hive.server2.enable.doAs. Would it be under the Hive configuration settings?
... View more
03-01-2016
03:55 AM
1 Kudo
I'll check this. I am using HDP 2.3.2 (sandbox) which I believe comes with Hive 1.2.1 so that defect *shouldn't* be the problem.
... View more
02-26-2016
10:56 PM
1 Kudo
Logging in with the same username with Hue as I am with HiveCLI. Getting this error Error occurred executing hive query: Error while compiling
statement: FAILED: SemanticException [Error 10011]: Line 1:155 Invalid
function
... View more
02-26-2016
10:51 PM
1 Kudo
Sorry, yeah using HUE or through my ODBC application it says it can't find the function. I'm logging into the application in HUE with the same username I am with through the HiveCLI. To be specific: Error occurred executing hive query: Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 1:155 Invalid function
... View more
02-26-2016
06:17 PM
2 Kudos
I'm using the Hortonworks Hive ODBC driver in my application. I did: CREATE FUNCTION MyFunc as 'com.my.udf.class' USING JAR 'hdfs:///user/location/to/my.jar'; That worked. Ehen I close my HiveCLI session and open it back up, I
can immediately run SELECT myfunc(data) FROM tbl; and it loads the class
and functions correctly. However it doesn't work inside of HUE or in my
ODBC connection within my app.
... View more
Labels:
- Labels:
-
Apache Hive
02-25-2016
02:12 PM
@Neeraj Sabharwal There are two errors I've been fighting with on getting access to all of these columns in the same query. The second one I *thought* I had a work around for by disabling security (unchecking the security box in Ambari for Hive) but it keeps showing back up. Here is the defect which I think I'm running into for the FULL head issue. UPDATE: I'm about 99.99% sure I figured out the problem! I started looking further into the ERROR logs. This line here "at org.apache.atlas.security.SecureClientUtils$1$1.run(SecureClientUtils.java:103)" tipped me off that in some way ATLAS was being interacted with. I disabled ATLAS by turning off the atlas service and removing hive.exec.failure.hooks=org.apache.atlas.hive.hook.HiveHook I ran my entire query and it worked without issue! I would venture to say that this is an issue with ATLAS not being able to handle really long queries. https://issues.apache.org/jira/browse/HIVE-11720 1. Error writing to server https://gist.github.com/kur1j/513e5a1499eef6c727a1 2. FULL head https://gist.github.com/kur1j/217eae2065c7953d9cf7
... View more
02-25-2016
12:55 AM
1 Kudo
since there isn't really any hard limit, and 400 columns I shouldn't be enough to cause oom memory issues I'm not quite sure on what else to do. This issue to me purely looks like configuration/bugs in Hive or its dependencies. I posted this issue on the user mailing list but I haven't heard anything. Any suggestions?
... View more
02-24-2016
11:46 PM
1 Kudo
Thanks but same issue. How can I increase the value of how long of string Hive can take as a query? I created a SimpleUDF that takes an input of the XML string and does all the xpath parsing on that file and returns a map type. I was hoping that getting rid of all the xpath calls would eliminate the issue but didn't work. I can now do SELECT m["key"] FROM (SELECT myfunc(xmldata) FROM xmlSource). But when I do SELECT m["key1"]....m[key400" FROM ...(...) I'm back at the "full HEAD" issue for some reason.
... View more
02-23-2016
02:05 PM
1 Kudo
I have not tried. I'll try it and see.
... View more
02-19-2016
06:39 PM
1 Kudo
Thank you. I agree that it has a lot of possibilities and I'm not giving up by any means. The post was just to get feedback on others real world experience. I ask because I look at a lot of examples/tutorials/blogs/forums etc. and it is made out to look so simple. "Hey look I just uploaded this super simple example in to HDFS, wrote this 25 line Spark job and it took 15 minutes to do! You can do it too!". When working in the real world you spend 5 hours trying to debug some random issue because a NULL value is in a field or the script can't handle line endings properly. Thanks again for the feedback.
... View more
02-19-2016
06:33 PM
2 Kudos
Thank you! For the quick response. Not quite sure how a Hive UDF would help in this instance the XPATH is already a UDF I thought? The data I'm extracting is extremely simplistic (e.g. values from 1 to 100 and short strings). The problem is just getting to that data. I'm working through the process of trying the SerDe to dynamically extract the data but if that fails for dynamic extraction the only option is extracting the data. If I write a Spark/MR/Pig job to extract the information that opens up different options/tradeoffs such as, storing it in HBase, ORC backed Hive table etc. But also seems completely overkill as we could easily extract the information in the originating application and insert it into an ORC backed Hive table. What are my options for running a job/process automatically as soon as data is finished uploading to HDFS? 1)C# application that generates the XML uploads the data to HDFS 2.a) C# calls some type of spark/pig job to process data? (I do have to worry about authenticating with KNOX) 2.b) Use Oozie and Falcon to create some type of schedule workflow. 2.c) Apache NiFi/DataFlow?
... View more
02-19-2016
03:27 PM
1 Kudo
My comment about trying to use AVRO was based on that example you linked. It wouldn't work properly with a large schema. I am not really dealing with single "huge" XML files, just one with a ton of columns. I don't have to process them directly as XML. That is just the format I'm going to get them. Here was my logical process that I was trying to follow 1) Client uploads XML to HDFS. 2)Create external table on hdfs folder of where the XML data is stored. 3) create ORC based table. 4) Query external table extracting all of the data (SerDe, view on top of the table with xpath, AVRO, ???) and loading it into the ORC table. 5) query the ORC table as needed for analytics. I'm really having issue with step 4, extracting the data within the ecosystem. I can go write an external program to extract all of the information before it gets sent to hadoop...but that destroys the whole point of "processing unstructured data". The other issue I see is having to recreate the ORC table constantly. Since there is no way for me to know about "new data", I can't simply just append the new xml document to the ORC table without deleting the original documents from the EXTERNAL table.
... View more
02-19-2016
03:00 PM
1 Kudo
Thanks for the feedback. You are correct on your analysis. That is the way I was initially doing it. I just wanted something simple to get up and working. How would you suggest a way to "parse" the document and extract all the information I need within the Hadoop ecosystem? If I'm extracting all of the XML fields and information in the source application wouldn't that remove a major selling point of this ecosystem which is not doing traditional ETL on the datasets. You just simply work on the data "as is"? I've never had a solid answer on the best where, what, and how for ingesting data into this ecosystem. The other method you mentioned (of using SerDe), I don't see how it wouldn't run into the same exact issue I ran into. If I use this SerDe https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources I would end up having 400 xpaths in my "WITH SERDEPROPERTIES()". If the error im getting above is a too long of query request, I don't see how the problem wouldn't show up here as well.
... View more
02-19-2016
02:29 AM
5 Kudos
I am just really wanting to get a feeling from others in the community on if my issues are due to me lack of knowledge, really bad luck, or if my experiences are par for the course. For me, I seem to always be constantly running into issues setting up datasets to perform analysis on the data. So much so, that I'm not even working with any significant amount of "unstructured data" data (what this stuff is supposed to be designed to work with), but fighting with errors, exception, configuration issues, incompatabilities. The vast majority of my time I spend trying to get the stuff to work properly on my most simplistic case (e.g. 1 file). On top of that, the stuff I've been working with isn't exotic by any means (processing XML documents, JSON documents, etc.) I will get a decent start most of the time by finding a guide but thats about where it will stop. Take this for example. We are wanting to build a data warehouse for some simulation data that is stored in XML (1 simulation can generate 100s or even thousands of XML documents). We are throwing away GBs upon GBs of potentially good simulation data that could provide some feedback. So we want to setup a simplistic case of storing and retrieving data from these XML files. We setup HDFS + Hive to do some simplistic evaluation with HDP 2.3.2. Find an example that shows an easy way of doing this by creating an external table and storing the XML data of an entire file in a single column. Write a few selects on the data with the xpath to grab information. We were able to successfully achieve that (10 to 20 xml attributes). But that is about where it stops. Performance on a single file in the external table on a single 400KB XML file took well over 30 seconds (with defaults). But I figured I would come back to that. I wrote a program to grab all xpaths out of my XML documents so I could create a view on top of the data (instead of us having to write out all the xpaths each time we queried the data). The XML file ends up having 400ish xpaths. I put the 400 xpaths into my select, get an internal hive error, find out its a bug (and supposedly fixed in new version of HDP), I spend 3 hours tracking the issue down and then spend another 3ish hours trying to find a workaround to continue. Find a workaround and then run the query again, and then boom no surprise, another internal hive error. Looks to be barfing on the size of the select statement. The xpaths are fine. I can do SELECT <first 50 xpaths> FROM tbl no problem. I can then do the next 50...no problem. As soon as I put 75+ xpaths in the select, Hive blows up. Spend another 4 or 5 hours trying to figure out a work around for that and still haven't figured it out. I tried using AVRO and had constant problems with the schema erroring on some of the data in the fields. I used a tool I found to generate the schema based on the XSD. Tried to use it and it would error out with some error when I tried querying the data (I don't even remember the error now). Built a schema by hand on a few attributes/fields but would error out on others (and this was only a few fields not the entire 400+ fields. That was another 4-6 hours spent troubleshooting on the issues and not really getting anywhere. A SerDe is the next route im going to take but I have a very good feeling when I try to create a table defining the xpaths it will error out but I'll see. . Another issue, twitter data stored as JSON with multiple JSON objects per file (500 tweets or so). Go get a SerDe for JSON data, and get a tool to generate the JSON schema structure. The schema generator errors out on the JSON data as some of the fields are "NULL". Spend 4+ hours trying to find a different schema generator that works. Find one and then generate the external table, run a simple select `user` from twitterData. It worked! Yes! Add a few more columns thinking I have it working. Looking at the data now all of the columns are in random columns and what I have determined to be the issue is the JSONSerDe is incorrectly parsing characters in the JSON documents. Back to the drawing board...trying to get a JSONSerDe to work or spend time writing my own. I spent hours on this issue that probably total in the days getting to this point with only half-assed results that aren't even correct. These problems seem extremely trivial (XML and JSON processing) and I'm struggling with getting them working properly. Other than the most simplistic 1 file with 2 or 3 values and a select (that I hand pick) that would fit on a single page to work, but that's pointless. It seems that once anything even slightly complex comes into play, everything is a constant fight to try and get working. Muchless the selling point of "throw unstructure data at a wall and query it all!" For my XML problem I could have used the program that generated my XPaths but instead of outputting the actual xpaths grabbed the data along with it and shoved it into a PostgreSQL database that I know I wouldn't have these issues with. Yeah it might choke on 1TB of data...once we get there...but at least I could query on it in < 4 hours. Does anyone else have these problems or have similiar experience or is it just me?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive