About mqureshi

mqureshi · ‎09-06-2016

@Andrew Grande So, I have done step 1 and 2 as you suggested. I am not sure how to implement step 3. Using jolt json, I have flattened my json by moving following objects from array to now as another attribute of the record. "event_1_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} "event_2_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} "event_3_detail":{"type":"comma separated values","Description":"<important value>",<many other attributes>} and so on. it's going to be different for each json record. I have created flow-file attribute using EvalueJson processor which gets the description value. Now out of the elements shown above, I need to drop everything and retain only the one where "Description":"<the value I want to retain>" this means of "event_x_detail" above only the one which satisfies my criteria for description should be retained. the rest of json of course still remains there. just rest of "event_x_detail" are dropped. How do i do this using RouteOnAttribute?

mqureshi · ‎09-04-2016

Hi @gkeys I think Nifi and Sqoop are two different tools serving two different use cases and cannot directly be compared, at least not yet. Sqoop is bundled with bulk loading adapters developed by database vendors and/or Hortonworks together. The purpose of Sqoop is bulk loading of data to and from the RDBMS. It uses fast connectors designed for bulk loading. Sqoop's performance is measured based on the bulk loading tool it is using. Since these are specialized, bulk loading tools designed for batch jobs, Sqoop really shines with these use cases. Nifi on the other hand is a system designed to move data within the organization as well as bring data in from outside sources and also facilitate data movement between data centers. The data Nifi helps move across is usually live data from applications, logs, devices and other sources producing event data. Given Nifi is so rich in its features, you can also use it to fetch data from lot of other sources including databases and files. For reading data from databases, Nifi uses a JDBC adapter. This would enable you to move x number of records at a time from some database. The bottleneck being the JDBC adapter. When we measure nifi's performance, we are not including the performance of fetching data from the source. What we are measuring is how fast Nifi is able to move data across as soon as it gets it. That performance is documented here and it's about 50 MB/s of read/write on a typical server. Can a JDBC source deliver data at this rate? Honestly, I doubt it but this has nothing to do with Nifi. It's more a function of the driver and the database and a lot of other variables just like in any other jdbc program.

mqureshi · ‎09-03-2016

@Mike Thomsen hbas user is usually hbase/_HOST@REALM.COM. I don't see the host part of the principal. Is this how you have setup your hbase principal? What are the permissions on your /etc/security/keytabs/hbase.service.keytab file?

mqureshi · ‎09-02-2016

Thank you @Andrew Grande Quick question. If I don't need to do any filtering in jolt, why do I need it? To flatten it?

mqureshi · ‎09-02-2016

I have a complex json and I am using jolttransformjson processor to flatten out the json. I have looked at following links which are very helpful. https://docs.google.com/presentation/d/1sAiuiFC4Lzz4-064sg1p8EQt2ev0o442MfEbvrpD1ls/edit#slide=id.g9798b391_00 https://github.com/bazaarvoice/jolt/blob/master/jolt-core/src/main/java/com/bazaarvoice/jolt/Shiftr.java But what I still can't find is an example that shows transforming some attributes and not others. So, I have a json and the only thing I want to do really is keep records which contain certain value. Rest will be filtered out. but everything else in the json remains the way it is. Can someone share a quick example of this?

mqureshi · ‎09-01-2016

@Farhad Heybati There are a number of components in Hadoop and its ecosystem. Each of them have their own high availability/failover strategy and different implications in case of a failure. You already mentioned Namenode and YARN Resource Manager. But there are others like HiveServer2, Hive Metastore, HMaster if you are using HBase and other components. Each have it's own documentation available on Hortonworks website. 1. Namenode also known as Master node is the linchpin of Hadoop. If namenode fails, your cluster is officially lost. To avoid this scenario, you must configure standby namenode. Instructions to setup Namenode HA can be found here. 2. YARN Resource Manager. YARN manages your cluster resources. Basically what job/application should get how much memory/cpu resources is allocated using YARN. So that's pretty important. While YARN has concept of Application Master, Node Manager and Container but what is really a single point of failure is YARN resource manager. So you need HA for Resource Manager. Check these two links. link 1 and link 2. 3. Hive Server2. What if you are using Hive (SQL) to query structured data in Hadoop? Assume you have multiple concurrent jobs running or adhoc users running their queries and they connect to Hive using HiveServer2. What if HiveServer2 goes down? Well, you need redundancy for that. Here is how you do it. 4. Are you using HBase? HBase has a component called HMaster. While not quite as crucial as HiveServer2 or Resource Manager but if HMaster goes down, you might see an impact specially if a region server also goes down before you are able to bring HMaster up. So you need to setup HA for HMaster. Check this link. I hope this helps. If you have any followup question, please feel free to ask.

mqureshi · ‎09-01-2016

@Madhu B I think you need to use HiveContext and not SQLContext. https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/hive/HiveContext.html

mqureshi · ‎08-31-2016

@Cameron Warren You need to first scp your file to Azure. Once that's done, you can do "copyFromLocal" to copy file to your hdfs. hdfs dfs -copyFromLocal /path/to/file /dest/path/on/hdfs

mqureshi · ‎08-30-2016

@mkataria so I am assuming here is the sequence of your commands. 1. Assuming you are root, you do "su hdfs". So now you are hdfs. 2. kinit -k -t hdfs-prod@ABC.NET 3. now you run "hdfs balancer -threshold <your threshold>" If my assumptions are right then my question would be if your principal "hdfs-prod@REALM.COM" has authority to impersonate in core-site.xml. something like below? <property> <name>hadoop.proxyuser.hdfs-prod.hosts</name> <value>host1,host2</value> </property> <property> <name>hadoop.proxyuser.hdfs-prod.groups</name> <value>group1,group2,supergroup,hdfs</value> </property>

mqureshi · ‎08-30-2016

@mkataria your error says you are running it as "hdfs-prod". You might have made this user a member of supergroup but that still doesn't quite make him the superuser. There is only one superuser and that's the guy who started the namenode like @emaxwell pointed out (likely username "hdfs"). Think about root user in linux. You can create other users and make them member of root group but still there is only one root user. That user in Hadoop is hdfs and no other user. More details here.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: jolttransformjson - transform some elements bu...

Re: Are there benchmark results available for NiFi...

Re: HBase client failing to connect to Kerberized ...

Re: jolttransformjson - transform some elements bu...

jolttransformjson - transform some elements but ke...

Re: Hadoop cluster failover

Re: Spark sqlContext - unable to access hbase tabl...

Re: How to upload data from local machine to HDFS ...

Re: HDFS Balancer - Access denied for user hdfs-p...

Re: HDFS Balancer - Access denied for user hdfs-p...