Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3133 | 08-25-2017 03:09 PM | |
1996 | 08-22-2017 06:52 PM | |
3454 | 08-09-2017 01:10 PM | |
8131 | 08-04-2017 02:34 PM | |
8174 | 08-01-2017 11:35 AM |
11-24-2016
12:47 PM
1 Kudo
Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines. Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc) There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you. https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/piggybank/storage/class-use/CSVExcelStorage.Multiline.html You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together. Suggestions are:
Spark map-reduce program where you implement your own InputFormat or RecordReader NiFi (using ExtractText processor and regex, where Enable Multiline Mode = false), typically outside of hadoop awk or sed (outside of hadoop) java or groovy (outside of hadoop) python, R, etc (outside of hadoop) These look like good solutions for you (using Spark): http://stackoverflow.com/questions/32408123/how-to-parse-log-lines-using-spark-that-could-span-multiple-lines http://apache-spark-user-list.1001560.n3.nabble.com/multi-line-elements-td51.html If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.
... View more
11-24-2016
12:20 PM
1 Kudo
Not sure if this helps -- just tossing this out there -- but a few things which you probably already know: From https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html In order for Hive Streaming to work the following has to be in place:
Table is stored as ORC Transactional Property is set to “True” The Table is Bucketed When I ran that demo I noticed: InferAvroSchema processor:
City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal - produced nulls in the Hive table for columns in caps I made them all lower case and got the values in the Hive table Finally ... a recent post: https://community.hortonworks.com/questions/68068/hive-streamaing.html
... View more
11-24-2016
01:00 AM
Answers to questions:
Templates are stored on the NiFi cluster where they are saved or uploaded. See article visuals on how to access them. Correct ... down/uploaded via UI or API. (I believe the community is working on a central repository for templates, accessible by all UI instances ... in the works however) It is similar to code. Keep in mind when discussing templates you should think of them as either reusable assets (checked out to be used across the team or enterprise for reuse) and full flows (checked out after passing testing, to be promoted to new envt, e.g QA to UAT or Prod). See diagram in SDLC section of article. From reading NiFi literature, I think system and OS properties are splitting hairs around properties we can retrieve directly from the OS (system property like line separator in java, OS property like what you set with export) Good question ... will test and update this answer. No, but you could handle this by giving your property names namespaces for a process group, etc. E.g. you could prepend the process group name or last 10 digits of uid in front of property name. E.g 034g345d2.filepath, 034g345d2.threshold and 034f423ee1.filepath, 034f423ee1.threshold to specify the same properties per process group. No, that is the really powerful part about this ... as the article states, each processor, process group or connection has a UUID with the first part a global id and the second part an instance id. When you download the template, the instance id is replaced by all 0s. When you upload to the canvas, the instance id is given a unique sequence. In that way you can reuse these templates as many times and at as many hierarchy levels as you want. (Works the same way with copy-paste ... you can copy any processor, or connections of processors (subflow), or process group and paste into flows as many times as you wish. Works because each paste creates a new instance id for each component).
... View more
11-23-2016
10:27 PM
Given that constraint, it looks like your best option is to call it on source system, dump results to a table in source db and then sqoop from there. If you cannot dump to a table, you could call it from a java program and dump results to a local file system and get this to hadoop using linux shell hdfs commands. The java program would have to be on an edge node connected to hadoop. linux shell hdfs commands: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html edge node: http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/ If java program cannot be on an edge node, you would need to ftp to the edge node and the put to hadoop, or transfer results straight to hadoop via WebHdfs REST api. These last two are least favorable for performance reasons. java hdfs api: https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html WebHdfs REST api: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
... View more
11-23-2016
10:18 PM
One approach is to build the notification process group as a single reusable asset that is checked out from a repository and implemented in new flows at whatever hierarchy level and whatever number of times used among all flows. As such, it is change-managed and and preconfigured with properties that are dynamic and ready to go for each environment. Additionally, configurations can be modified for each instance it is used in a flow. See the following link: https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html Let me know if this is possibly what you are looking for; else, follow up with additional requirements/needs.
... View more
11-23-2016
10:10 PM
2 Kudos
(Assuming you are running nifi locally and putting to the sandbox) I had the same issue when putting anything to the sandbox (PutHiveStreaming, PutHDFS -- for PutHDFS NiFi began writing the file but would immediate suffer broken pipe, leaving 0 byte files in HDFS). I solved this by following Simon Ball's article http://www.simonellistonball.com/technology/nifi-sandbox-hdfs-hdp/ which simplifies communicating with the vm by using remote processor groups on each side. Very straightfoward and worked the first time.
... View more
11-23-2016
07:17 PM
2 Kudos
Definitely not advisable nor worth considering. It also would not be supported by Hortonworks support license. The minimum cluster size for a production environment is typically seen as 3 management nodes that hold master services like namenode, zookeepers, etc + 4 data nodes that hold data in hdfs and also slave services. The sandbox is a single node with everything, great for installing quickly, learning skills and perhaps doing simple demos ... but not production high availability, throughput, processing etc See this post for a discussion of minimal deployment: https://community.hortonworks.com/questions/48572/physical-layout-of-architecture.html
... View more
11-23-2016
07:00 PM
2 Kudos
@bala krishnan It works for me when I set Replacement Strategy to "Literal Replace": my input file has control-a (but no \001) and my output file has control-a followed by test. When I use the default Replacement Value ("Regex Replace") my output file has \001test
... View more
11-23-2016
05:00 PM
I am not clear on the access parts here. Typically the offloading is done by rewriting the logic with a hadoop tool as mentioned. It seems like your choice is between easiest and most scalable. Easiest is to simply process the stored proc wherever/however as long as the result lands in hdfs. Most scalable is the rewrite on hadoop alternatives. (Not sure if this is clear ... let me know if not)
... View more
11-23-2016
04:03 PM
2 Kudos
Unfortunately you cannot. See the following for alternatives, which must offload stored procedure logic to hadoop. http://stackoverflow.com/questions/39217329/sql-stored-procedure-to-scala-spark-streaming https://community.hortonworks.com/questions/68083/data-ingestion-from-mssql-server-to-hdfs.html Note: there are advantages to offloading the stored proc processing to hadoop:
it typically takes much less time on hadoop (parallel processing) it frees resources on your source system and thus improves performance on that side
... View more