About kvasko

kvasko · ‎03-21-2018

I ended up not using NiFi for this. Looking back I tried forcing a solution out of NiFi thst wasn’t a good fit. I spent several weeks and entirely too long trying to solve the most simple case of this project (formatting some text and dumping it to a db). I could certainly see NiFi being useful for moving source data files around from the folders I’m working with (copying, moving etc.) but doing any amount of logic or manipulation of anything but a happy path is extremely tedious and seemingly difficult to do. Knowing that I was going to have to do a lot more work on the data to make it even close to usable, I just scrapped NiFi and implement it in Python. After dealing with this data and running into edge cases over and over again that I wasn’t even aware about when I wrote this topic, the data IMO was just too dirty and had too many exceptions to deal with, with NiFi. On top of that this wasn’t just the import of the data, not even using it so I would have had to have another tool to actually process the data to put it into a usable form anyways. Appreciate the response. You took the time to respond so I figured it was reasonable to respond even though I didn’t end up using the solution.

kvasko · ‎01-05-2018

I've googled everywhere for this and everything I run across its super complicated. It should be relatively simple to do. The recommendations show to look at the "Example_With_CSV.xml" from here. So given a flowfile thats a CSV. 2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)" I need $date = 2017-09-20 23:49:49:38.637 $id = 162929511757 ... $instanceid = 36095 $comment = "Failed, to fit max attempts (1=>3), fit failing, entirely (Fit Failure=True)" OR $csv.date = ... $csv.id = ... ... $csv.instanceid = ... $csv.comment = .. Is there another easier option to do this besides RegEx? I can't stand to do anything with RegEx as how unreadable, and overly complicated they are. To me there should be a significantly easier way of doing this than with RegEx. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates but it doesn't have anything in there related to actually getting the columns of each value out.

kvasko · ‎01-05-2018

There is no example in the "Working_With_CSV" template of how to extract each individual field into attributes.

kvasko · ‎01-04-2018

Thanks! That seems to work correctly. I'll mark this as the answer as it produces the answer I'm looking for.

kvasko · ‎01-03-2018

@Shu Thank you for the great detailed response. The first part does work but I don't think the regex will work for my case. (Side bit, no fault of yours, I just absolutely despise regex as its unreadable to me and extremely difficult to debug (if at all).) I should have mentioned this, but the only thing I know about the CSV file is that there are X number of columns before the string. So I could see something like.. 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max, attempts,(1=>3), fit failing entrely,(FitFailure=True), The only thing I know is that there are 13 columns (commas) before the string and the string will always have a trailing "," (It has always been the last column in the row from what I have seen). The other issue is I tried doing (.*), for all of the columns so I could then put it into a database query to insert the data but the regex seems to blowup and not function with so many columns (the original data has about 150 columns in it and I just truncated it down here).

kvasko · ‎01-03-2018

I have a CSV file that is messy. I need to: 1. Get the date from the filename and use that as my date and append that to one of the columns. 2. Parse the CSV file to get the columns as the very last column is a string which has separators in the string ",". The data looks like this. Filename: ExampleFile_2017-09-20.LOG Content: 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True), 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,Command Measure, Targets complete - Elapsed: 76064 ms, The following is what will need to be inserted into the database: 2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)" 2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms" Would I need to do this inside of NiFi or some external script by calling some type of ExecuteScript?

kvasko · ‎05-25-2016

what do I need to set in hive-env.sh? It seems that anything I touch it gets overwritten. This has to be a bug in ambari where it won't save the hive.heapsize value. How can I get it to persist?

kvasko · ‎05-25-2016

The configuration of hive.heapsize does not exist in my hive-site.xml for some reason and whenever I add it to the file it keeps getting overwritten.

kvasko · ‎05-25-2016

@Divakar AnnapureddyCorrect but if you look at my comments i posted a picture and it shows, it is changed to 12GB in the UI. The services have been restarted (complete server has been restarted).

kvasko · ‎05-25-2016

hive 24964 0.2 1.7 2094636 566148 ? Sl 17:03 0:56 /usr/lib/jvm/ja va-1.7.0-openjdk-1.7.0.91.x86_64/bin/java -Xmx1024m -Dhdp.version=2.3.2.0-2950 - Djava.net.preferIPv4Stack=true -Dhdp.version=2.3.2.0-2950 -Dhadoop.log.dir=/var/ log/hadoop/hive -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.3.2.0- 2950/hadoop -Dhadoop.id.str=hive -Dhadoop.root.logger=INFO,console -Djava.librar y.path=:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/usr/hdp/2.3.2. 0-2950/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.prefe rIPv4Stack=true -Xmx1024m -XX:MaxPermSize=512m -Dhadoop.security.logger=INFO,Nul lAppender org.apache.hadoop.util.RunJar /usr/hdp/2.3.2.0-2950/hive/lib/hive-serv ice-1.2.1.2.3.2.0-2950.jar org.apache.hive.service.server.HiveServer2 --hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hca talog-core.jar -hiveconf hive.metastore.uris= -hiveconf hive.log.file=hiveserve r2.log -hiveconf hive.log.dir=/var/log/hive So I can see that it is set at 1024m, however it is set to some really large value. http://imgur.com/3oXfpPj

Online	Offline
Last Visited	‎05-09-2018 03:27 PM

Member Since	‎01-23-2016 03:23 AM
Last Visited	‎05-09-2018 03:27 PM
Posts	51
Kudos received	41

Cloudera Community

Re: Hive fails with "Hive Internal Error message: ...

Re: Using ExtractText to extract attributes for ea...

Using ExtractText to extract attributes for each c...

Re: From CSV to Hive via NiFi

Re: Using Nifi to do processing on CSV file before...

Re: Using Nifi to do processing on CSV file before...

Using Nifi to do processing on CSV file before ins...

Re: Hive query OutOfMemoryError: Java heap space

Re: Hive query OutOfMemoryError: Java heap space

Re: Hive query OutOfMemoryError: Java heap space

Re: Hive query OutOfMemoryError: Java heap space