About sunile_manjee

sunile_manjee · ‎08-29-2016

Does Knox support HAWQ (HDB)?

sunile_manjee · ‎08-26-2016

@Pierre Villard can you please try the --create myjobname sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \ --table mytable

sunile_manjee · ‎08-25-2016

Often a timestamp is found within a phoenix table to support various business taxonomy requirements. Generally I have seen a pattern simply adding a timestamp datatype field. Many may not realize part of the hbase data model rowkey design is the timestamp. When a cell is written in hbase by default it is versioned using timestamp when the cell was created. This is out of the box. Apache Phoenix provides a way of mapping HBase’s native row timestamp to a Phoenix column. This leverages various optimizations which HBase provides for time ranges on the store files as well as various query optimization capabilities built within Phoenix. - https://phoenix.apache.org/rowtimestamp.html Based on your use case, set this "column" to a value of your liking. Take advantage of the built in data model design without adding yet another timestamp field. Commonly I find many end up creating secondary index on these additional timestamp fields due to time always being a variable at query time. Lets take a look at a simple data model. We have customer entity with various attributes. Now typically you would see the following create table statement on phoenix: CREATE TABLE IF NOT EXISTS customer ( firstName VARCHAR, lastName VARCHAR, address VARCHAR, ssn integer NOT NULL, effective_date TIMESTAMP, ACTIVE_IND CHAR(1) CONSTRAINT pk PRIMARY KEY (ssn) ) KEEP_DELETED_CELLS=false; Often once table is created/populated SQL queries start pouring in. Soon a pattern is established where effective_date is most commonly used during query time. Thereafter DBA would create a secondary index. CREATE INDEX my_idx ON CUSTOMER(effective_date DESC); Now determine where this needs to be global or local index. I won't go into details about that now. However my point is there may be a easier way. Leverage the rowkey timestamp! Instead of creating additional column this time I will assign effective_date to the rowkey timestamp. The rowkey timestamp is baked into the hbase data model. This is how it is done: CREATE TABLE IF NOT EXISTS customer ( firstName VARCHAR, lastName VARCHAR, address VARCHAR, ssn integer NOT NULL, effective_date TIMESTAMP NOT NULL, ACTIVE_IND CHAR(1) CONSTRAINT pk PRIMARY KEY (ssn, effective_date ROW_TIMESTAMP) ) KEEP_DELETED_CELLS=false; Now you can start querying customer table using the effective_date within your queries avoiding secondary index. There may be use cases where secondary index may make more sense then leveraging the core rowkey timestamp. Your use cases will drive that decision. The flexibility is there and you have choices.

sunile_manjee · ‎08-25-2016

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

sunile_manjee · ‎08-25-2016

@Donna Suddeth it is the password you set during the launch of the vm.

sunile_manjee · ‎08-25-2016

@kishore sanchina you will need to use a protocol. If you simply want to "push" local files to nifi, you can use the ListenHTTP processor. Then simply curl the file to nifi.

sunile_manjee · ‎08-25-2016

@Rajib Mandal for example on redhat you can run ipa krbtpolicy-mod theuser --maxlife=3600

sunile_manjee · ‎08-25-2016

In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out. Started the testing with dfsio which is a distributed IO benchmark tool. Here are results: From 3 to 5 data nodes IO read performance increased approx. 36% From 3 to 5 data nodes IO write performance increased approx. 49% With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures! Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data. From 3 to 5 data nodes TeraGen performance increased approx. 65% Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order. From 3 to 5 data nodes TeraSort performance increased approx. 54%. Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted. From 3 to 5 data nodes TeraValidate performance increased approx. 64%. DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations. All my test results are here.

sunile_manjee · ‎08-24-2016

@Ayub Pathan I assume created the topic above is using non-embedded mode. What if I am using embbeded?

sunile_manjee · ‎08-24-2016

@Sushant Bharti you have options: The first configuration option is the Scheduling Strategy. There are three possible options for scheduling components: Timer driven: This is the default mode. The Processor will be scheduled to run on a regular interval. The interval at which the Processor is run is defined by the ‘Run schedule’ option (see below). Event driven: When this mode is selected, the Processor will be triggered to run by an event, and that event occurs when FlowFiles enter Connections feeding this Processor. This mode is currently considered experimental and is not supported by all Processors. When this mode is selected, the ‘Run schedule’ option is not configurable, as the Processor is not triggered to run periodically but as the result of an event. Additionally, this is the only mode for which the ‘Concurrent tasks’ option can be set to 0. In this case, the number of threads is limited only by the size of the Event-Driven Thread Pool that the administrator has configured. CRON driven: When using the CRON driven scheduling mode, the Processor is scheduled to run periodically, similar to the Timer driven scheduling mode. However, the CRON driven mode provides significantly more flexibility at the expense of increasing the complexity of the configuration. This value is made up of six fields, each separated by a space. more info here

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Does Knox support HAWQ (HDB)?

Re: Sqoop job name

Phoenix Timestamp - Leverage the core ROWKEY data ...

Re: How one should handle de-duplication of data?

Re: Hortonworks sandbox 2.4 root password does not...

Re: how to copy files from local desktop lo hdfs l...

Re: User Specific ticket lifetime.

More Hadoop nodes = faster IO and processing time?

Re: Atlas rest api timeout

Re: How can we schedule nifi data flow ?