Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4126 | 10-18-2017 10:19 PM | |
| 4360 | 10-18-2017 09:51 PM | |
| 14905 | 09-21-2017 01:35 PM | |
| 1859 | 08-04-2017 02:00 PM | |
| 2431 | 07-31-2017 03:02 PM |
09-06-2016
12:42 AM
@Andrew Grande So, I have done step 1 and 2 as you suggested. I am not sure how to implement step 3. Using jolt json, I have flattened my json by moving following objects from array to now as another attribute of the record. "event_1_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} "event_2_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} "event_3_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} and so on. it's going to be different for each json record. I have created flow-file attribute using EvalueJson processor which gets the description value. Now out of the elements shown above, I need to drop everything and retain only the one where "Description":"<the value I want to retain>" this means of "event_x_detail" above only the one which satisfies my criteria for description should be retained. the rest of json of course still remains there. just rest of "event_x_detail" are dropped. How do i do this using RouteOnAttribute?
... View more
09-04-2016
07:09 PM
3 Kudos
Hi @gkeys I think Nifi and Sqoop are two different tools serving two different use cases and cannot directly be compared, at least not yet. Sqoop is bundled with bulk loading adapters developed by database vendors and/or Hortonworks together. The purpose of Sqoop is bulk loading of data to and from the RDBMS. It uses fast connectors designed for bulk loading. Sqoop's performance is measured based on the bulk loading tool it is using. Since these are specialized, bulk loading tools designed for batch jobs, Sqoop really shines with these use cases. Nifi on the other hand is a system designed to move data within the organization as well as bring data in from outside sources and also facilitate data movement between data centers. The data Nifi helps move across is usually live data from applications, logs, devices and other sources producing event data. Given Nifi is so rich in its features, you can also use it to fetch data from lot of other sources including databases and files. For reading data from databases, Nifi uses a JDBC adapter. This would enable you to move x number of records at a time from some database. The bottleneck being the JDBC adapter. When we measure nifi's performance, we are not including the performance of fetching data from the source. What we are measuring is how fast Nifi is able to move data across as soon as it gets it. That performance is documented here and it's about 50 MB/s of read/write on a typical server. Can a JDBC source deliver data at this rate? Honestly, I doubt it but this has nothing to do with Nifi. It's more a function of the driver and the database and a lot of other variables just like in any other jdbc program.
... View more
09-03-2016
02:29 AM
2 Kudos
@Mike Thomsen hbas user is usually hbase/_HOST@REALM.COM. I don't see the host part of the principal. Is this how you have setup your hbase principal? What are the permissions on your /etc/security/keytabs/hbase.service.keytab file?
... View more
09-02-2016
04:54 PM
Thank you @Andrew Grande Quick question. If I don't need to do any filtering in jolt, why do I need it? To flatten it?
... View more
09-02-2016
04:44 AM
I have a complex json and I am using jolttransformjson processor to flatten out the json. I have looked at following links which are very helpful. https://docs.google.com/presentation/d/1sAiuiFC4Lzz4-064sg1p8EQt2ev0o442MfEbvrpD1ls/edit#slide=id.g9798b391_00 https://github.com/bazaarvoice/jolt/blob/master/jolt-core/src/main/java/com/bazaarvoice/jolt/Shiftr.java But what I still can't find is an example that shows transforming some attributes and not others. So, I have a json and the only thing I want to do really is keep records which contain certain value. Rest will be filtered out. but everything else in the json remains the way it is. Can someone share a quick example of this?
... View more
Labels:
- Labels:
-
Apache NiFi
09-01-2016
02:51 AM
1 Kudo
@Farhad Heybati There are a number of components in Hadoop and its ecosystem. Each of them have their own high availability/failover strategy and different implications in case of a failure. You already mentioned Namenode and YARN Resource Manager. But there are others like HiveServer2, Hive Metastore, HMaster if you are using HBase and other components. Each have it's own documentation available on Hortonworks website. 1. Namenode also known as Master node is the linchpin of Hadoop. If namenode fails, your cluster is officially lost. To avoid this scenario, you must configure standby namenode. Instructions to setup Namenode HA can be found here. 2. YARN Resource Manager. YARN manages your cluster resources. Basically what job/application should get how much memory/cpu resources is allocated using YARN. So that's pretty important. While YARN has concept of Application Master, Node Manager and Container but what is really a single point of failure is YARN resource manager. So you need HA for Resource Manager. Check these two links. link 1 and link 2. 3. Hive Server2. What if you are using Hive (SQL) to query structured data in Hadoop? Assume you have multiple concurrent jobs running or adhoc users running their queries and they connect to Hive using HiveServer2. What if HiveServer2 goes down? Well, you need redundancy for that. Here is how you do it. 4. Are you using HBase? HBase has a component called HMaster. While not quite as crucial as HiveServer2 or Resource Manager but if HMaster goes down, you might see an impact specially if a region server also goes down before you are able to bring HMaster up. So you need to setup HA for HMaster. Check this link.
I hope this helps. If you have any followup question, please feel free to ask.
... View more
09-01-2016
12:37 AM
1 Kudo
@Madhu B I think you need to use HiveContext and not SQLContext. https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/hive/HiveContext.html
... View more
08-31-2016
02:42 AM
1 Kudo
@Cameron Warren You need to first scp your file to Azure. Once that's done, you can do "copyFromLocal" to copy file to your hdfs. hdfs dfs -copyFromLocal /path/to/file /dest/path/on/hdfs
... View more
08-30-2016
08:37 PM
@mkataria so I am assuming here is the sequence of your commands. 1. Assuming you are root, you do "su hdfs". So now you are hdfs. 2. kinit -k -t hdfs-prod@ABC.NET 3. now you run "hdfs balancer -threshold <your threshold>" If my assumptions are right then my question would be if your principal "hdfs-prod@REALM.COM" has authority to impersonate in core-site.xml. something like below? <property>
<name>hadoop.proxyuser.hdfs-prod.hosts</name>
<value>host1,host2</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs-prod.groups</name>
<value>group1,group2,supergroup,hdfs</value>
</property>
... View more
08-30-2016
08:05 PM
@mkataria your error says you are running it as "hdfs-prod". You might have made this user a member of supergroup but that still doesn't quite make him the superuser. There is only one superuser and that's the guy who started the namenode like @emaxwell pointed out (likely username "hdfs"). Think about root user in linux. You can create other users and make them member of root group but still there is only one root user. That user in Hadoop is hdfs and no other user. More details here.
... View more