Member since
07-10-2017
78
Posts
6
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1821 | 10-17-2017 12:17 PM | |
2411 | 09-13-2017 12:36 PM | |
3069 | 07-14-2017 09:57 AM | |
1058 | 07-13-2017 12:52 PM |
07-03-2018
02:53 PM
Hi @Ya ko, Why not considering the new ORC. https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487 Then you will get the best performance when querying from hive. And yes you have to define your table with all heh fields. The slide 20 show how to specify the new orc library, you will have to just all the location setting to point where your data will be stored in hdfs. Michel
... View more
06-13-2018
02:38 PM
Hi @Oleg Parkhomenko, The following link describe how you can secure yarn queue to be sure that only specific user can submit job to specific queue, it done with Ranger: https://community.hortonworks.com/articles/10797/apache-ranger-and-yarn-setup-security.html Normaly if you are in a kerberos environment, you should not have job running as dr who Miche
... View more
06-13-2018
02:29 PM
Hi @rajat puchnanda, Based on your example, you are trying to do a "join". Nifi is not an ETL tool but more a flow manager, it allow to move data accros system and to do some very simple transformation like csv to avro. You should not do computation or join with Nifi. For you usecase it would be better to use another tools like hive, spark,... Best regards, Michel
... View more
06-13-2018
02:22 PM
Hi @Zack Riesland, Indeed increasing the the number of bucket will increase the parallelism to write to hdfs (then to the disk). If I was you I would have a look at the disk/iops usage, if you try to load a lot of data and you have only one disk it can take a long time. generally its recommended to have multiple disk per node to avoid iops congestion. Whats the exact query that you are doing to insert the data? does it contain some casting? whats the size of your data? Also as a good optimisation is to use ORC table and not avro. you the loading face it should not change a lot but when you are going to query your data that will make the difference, Michel
... View more
06-13-2018
02:09 PM
Hi @rajat puchnanda, If by merging you means doing an union, you can use the processor mergecontent if the two csv have the same structure. Best regards, Michel
... View more
06-13-2018
02:07 PM
Hi @Oleg Parkhomenko, You should be able to kill al the queue job with this script: for app in `yarn application -list | awk '$6 == "ACCEPTED" { print $1 }'`; do yarn application -kill "$app"; done Just put in a scri[t .sh and run it wit ha user that are allow to kill app Best regards, Michel
... View more
06-13-2018
02:01 PM
HI, Usually timeout happen because the cluster is undersized or no dedicated node for hbase or the ingestion is so quick that hbase need to do a lot of split of region. - Do you manage a lot of data with hbase? if yes, idd you pre-split your table? - If i was you I would also have a look to the cpu, memory and io disk usage. If you dont have anydedicated nodes for hbase other hadoop component like spark, hive, etc can have an impact. As a general best practice, you should have dedicated node for hbase with enough cpu and several disk Mest regards, Michel
... View more
12-10-2017
06:18 PM
Hi @Ashish Singh, Can you show the command that you used to submit your spark application? Michel
... View more
11-15-2017
10:16 AM
@Arti Wadhwani Do you have the answer to your question? I'm trying to do that, connection with zookeeper discovery and specifying the tez queue but it doesn't work
... View more
11-06-2017
03:51 PM
Hi @Ennio Sisalli, Before running your query that save the result in HDFS, can you try to set the following parameter: set hive.cli.print.header=true; Best regards, Michel
... View more
11-01-2017
03:19 PM
Hi @Simon Jespersen, Did you restarted the nifi once you added the new nifi property or when you modified the file? Nifi need to be restarted in order to load the new parameter. Michel
... View more
10-17-2017
12:25 PM
Hello, I try to ingest data in hive with nifi
(from json data =>convert jsontosql => puthiveql) and I got this error message from the puthiveql processor: Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:221 cannot recognize input near '?' ',' '?' in value row constructor if I look at the input flowfile of the puthiveql it has the correct insert query INSERT INTO nifilog (objectid, platform, bulletinid, bulletincategory, bulletingroupid, bulletinlevel, bulletinmessage, bulletinnodeid, bulletinsourceid, bulletinsourcename, bulletinsourcetype, bulletintimestamp) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) each flowfile has all the needed attribute: sql.args.N.type & . sql.args.N.value Any idea how to debug/solve this?
... View more
Labels:
10-17-2017
12:17 PM
The solution is used the "SiteToSiteBulletinReportingTask" in the reporting task. It can send all the bulletin to a nifi instance. It can be the same instance the nifi that generate it. It will send it to a specific input port in json, then you will be able to process it. It has all the attributed needed: Here an example [{"objectId":"9c8e75e6-eb5a-4a52-9d4a-a3d3b7f0c80f",
"platform":"nifi",
"bulletinId":305,
"bulletinCategory":"Log Message",
"bulletinGroupId":"24a8726b-015f-1000-ffff-ffffae66ea1c",
"bulletinLevel":"ERROR",
"bulletinMessage":"PutHDFS[id=24b463f8-015f-1000-ffff-ffffd09bd856] PutHDFS[id=24b463f8-015f-1000-ffff-ffffd09bd856] failed to invoke @OnScheduled method due to java.lang.RuntimeException: Failed while executing one of processor's OnScheduled task.; processor will not be scheduled to run for 30 seconds: java.lang.RuntimeException: Failed while executing one of processor's OnScheduled task.",
"bulletinNodeId":"ede4721c-30fe-4879-b22e-20bfe602c615",
"bulletinSourceId":"24b463f8-015f-1000-ffff-ffffd09bd856",
"bulletinSourceName":"PutHDFS",
"bulletinSourceType":"PROCESSOR",
"bulletinTimestamp":"2017-10-17T08:16:48.945Z"},
... View more
10-16-2017
12:50 PM
Hi @Abdelkrim Hadjidj, Thanks for your reply, my objective is to get the error message which can be many thing, hostnotfound, parsing error, connection refuse, etc for the same failure relationship. Michel
... View more
10-16-2017
12:33 PM
Hi @Gayathri Devi, I can't give you more idea than in my previous comment. Because It depends on the system specification that you have, the other load on the cluster, size of the data. size of each lines, etc... The % that I gave you is based on benchmark that I made in previous project and blog/forum that I read in the past. The best that you can do is a test. I would recommend you to do a test with compression and another without to see the impact that it have on your environnement. Moreover, be careful with Hive on top of hbase. You might have bad performance because often it start a full scan of the hbase table, which is an expensive operation. Michel
... View more
10-16-2017
12:25 PM
Hi, If a processor failed and routed the flowfile to the relationship failure. Is there an attribute "error"? If it's the case for some processor how to know if they have it? For example, the puthdfs, I don't see anything in the documentation. doc puthdfs Is there another way to have the reason of the failure attached to the flowfile, Thanks, Michel
... View more
Labels:
10-13-2017
12:59 PM
1 Kudo
Hi @Gayathri Devi, The operation of compression/decompression will increase the cpu load of around 5-10 % For the memory, is will decrease the disk space by around 70%, more over the size on the disk will be smaller then you will need less iops. Because of that you should see a general improvement of your performance. Michel
... View more
10-06-2017
08:53 AM
Hi @Gobi Subramani, Is it normal that in your code?: String node = "x.x.x.x:6667"; I think it should be an ip or hostname. Michel
... View more
09-27-2017
01:48 PM
@Hemant, You said that you were able to interact with hdfs from the host that has nifi. How did you get the ticket to interact wit hdfs? Are you able to create a ticket with the user and keytab mentionned in the configuration or the processor? (Just to be sure that the key tab is working well
... View more
09-26-2017
03:25 PM
@Hemant for the user do you have this structure: hive/FQDN@MY_REALM ?
... View more
09-26-2017
02:55 PM
For info, I think that once you configure that property, you need to restart nifi
... View more
09-26-2017
02:54 PM
Hi @Hemant, Did you configure the nifi.kerberos.krb5.file in your nifi.properties?
... View more
09-26-2017
02:39 PM
Hi @Hemant, No Nifi doesn't need to be kerberized but you need to install the kerberos client on the os (where nifi is installed) in order to be able to request a ticket. Michel
... View more
09-21-2017
07:44 AM
@nallen The pcap_replay is install as a service by default with HCP 1.2? If not, how to install it manually? Thanks
... View more
09-16-2017
12:20 PM
Hi @Rahul Gupta, Did you managed this? if yes can you accept the answer? 🙂 Thanks, Michel
... View more
09-16-2017
12:18 PM
Hi @n c, You are welcome! 🙂 . I don't think there's other object in hive (but not sure) there's the UDF for that you need to export the jar and you use for UDF in you first cluster. May I ask you to accept my answer? 🙂 Thanks! Michel
... View more
09-16-2017
12:12 PM
hi @Piyush Chauhan, Did you need more info? If not can you accept the answer? 🙂 Michel
... View more
09-15-2017
01:20 PM
Hi, I saw that it's possible to use pycapa script in order to capture data and send it to kafka. Do you know if there's an easy way to directly ingest pcap file that has been generated by another system? Like a program that read the pcap file and send it to kafka? Or another manner to do it? Thanks Michel
... View more
- Tags:
- CyberSecurity
- Metron
Labels:
09-14-2017
09:51 AM
Hi @Nagesh Gollapudi, Can you do a screenshot of the configuration of your different processor in Nifi? 🙂 Michel
... View more
09-14-2017
09:45 AM
Hi @Piyush Chauhan, In the hortonworks stack you have the following: - HDFS ACL: you manage by your self the right access on HDFS, it can be very quickly become a huge amount of work if you have a lot of user. More over it protect the access only to hdfs - HDFS TDE (encryption): This is a feature of HDFS that encrypt in a completly transparent manner all the file in a folder. This provide strong protection to any data store on HDFS. It can be from Hive, HBase, etc... - Ranger: The most interesting part! It's a tool that help you to configure that to manage the access to the different hadoop component. For example, you can create policy for a specific group of user define in your enterprise AD to have only read or write or access denied to HDFS folder. It can also restrict access to: Hive, Hbase, Solr, kafka, etc... Ranger is really powerfull and help to manage the security by reduce the time needed to do it. More over it provide an audi feature, if it's enable you can see who had access to what and when (you can also see if the permission was denied). - Knox: you can see Know as a kind of proxy. Every request from every user will be sent to the Knox server and he will redirect the request to the correct service/server. It's useful if you don't want that your user know the network topology of your cluster and if you don't want that they have a direct access to the server hosting the service (like Hive database) I would recommend to use the combination ranger+ Knox and if you have the need to use TDE enrcyption. Best regards, Michel
... View more