Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4058 | 08-20-2018 08:26 PM | |
| 1954 | 08-15-2018 01:59 PM | |
| 2380 | 08-13-2018 02:20 PM | |
| 4116 | 07-23-2018 04:37 PM | |
| 5026 | 07-19-2018 12:52 PM |
08-26-2016
03:43 PM
@Pierre Villard can you please try the --create myjobname sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
... View more
08-25-2016
09:12 PM
4 Kudos
Often a timestamp is found within a phoenix table to support various business taxonomy requirements. Generally I have seen a pattern simply adding a timestamp datatype field. Many may not realize part of the hbase data model rowkey design is the timestamp. When a cell is written in hbase by default it is versioned using timestamp when the cell was created. This is out of the box. Apache Phoenix provides a way of mapping HBase’s native row timestamp to a Phoenix column. This leverages various optimizations which HBase provides for time ranges on the store files as well as various query optimization capabilities built within Phoenix. - https://phoenix.apache.org/rowtimestamp.html Based on your use case, set this "column" to a value of your liking. Take advantage of the built in data model design without adding yet another timestamp field. Commonly I find many end up creating secondary index on these additional timestamp fields due to time always being a variable at query time. Lets take a look at a simple data model. We have customer entity with various attributes. Now typically you would see the following create table statement on phoenix: CREATE TABLE IF NOT EXISTS customer (
firstName VARCHAR,
lastName VARCHAR,
address VARCHAR,
ssn integer NOT NULL,
effective_date TIMESTAMP,
ACTIVE_IND CHAR(1)
CONSTRAINT pk PRIMARY KEY (ssn) ) KEEP_DELETED_CELLS=false;
Often once table is created/populated SQL queries start pouring in. Soon a pattern is established where effective_date is most commonly used during query time. Thereafter DBA would create a secondary index. CREATE INDEX my_idx ON CUSTOMER(effective_date DESC); Now determine where this needs to be global or local index. I won't go into details about that now. However my point is there may be a easier way. Leverage the rowkey timestamp! Instead of creating additional column this time I will assign effective_date to the rowkey timestamp. The rowkey timestamp is baked into the hbase data model. This is how it is done: CREATE TABLE IF NOT EXISTS customer (
firstName VARCHAR,
lastName VARCHAR,
address VARCHAR,
ssn integer NOT NULL,
effective_date TIMESTAMP NOT NULL,
ACTIVE_IND CHAR(1)
CONSTRAINT pk PRIMARY KEY (ssn, effective_date ROW_TIMESTAMP) ) KEEP_DELETED_CELLS=false;
Now you can start querying customer table using the effective_date within your queries avoiding secondary index. There may be use cases where secondary index may make more sense then leveraging the core rowkey timestamp. Your use cases will drive that decision. The flexibility is there and you have choices.
... View more
08-25-2016
08:16 PM
1 Kudo
@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.
... View more
08-25-2016
08:14 PM
2 Kudos
@Donna Suddeth it is the password you set during the launch of the vm.
... View more
08-25-2016
02:58 PM
@kishore sanchina you will need to use a protocol. If you simply want to "push" local files to nifi, you can use the ListenHTTP processor. Then simply curl the file to nifi.
... View more
08-25-2016
02:43 PM
@Rajib Mandal for example on redhat you can run ipa krbtpolicy-mod theuser --maxlife=3600
... View more
08-25-2016
02:08 PM
6 Kudos
In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out. Started the testing with dfsio which is a distributed IO benchmark tool. Here are results: From 3 to 5 data nodes IO read performance increased approx. 36% From 3 to 5 data nodes IO write performance increased approx. 49% With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures! Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data. From 3 to 5 data nodes TeraGen performance increased approx. 65% Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order. From 3 to 5 data nodes TeraSort performance increased approx. 54%. Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted. From 3 to 5 data nodes TeraValidate performance increased approx. 64%. DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations. All my test results are here.
... View more
Labels:
08-24-2016
03:58 PM
@Ayub Pathan I assume created the topic above is using non-embedded mode. What if I am using embbeded?
... View more
08-24-2016
02:59 PM
1 Kudo
@Sushant Bharti you have options: The first configuration option is the Scheduling Strategy. There are three possible options for scheduling components:
Timer driven: This is the default mode. The Processor will be scheduled to run on a regular interval. The interval at which the Processor is run is defined by the ‘Run schedule’ option (see below). Event driven: When this mode is selected, the Processor will be triggered to run by an event, and that event occurs when FlowFiles enter Connections feeding this Processor. This mode is currently considered experimental and is not supported by all Processors. When this mode is selected, the ‘Run schedule’ option is not configurable, as the Processor is not triggered to run periodically but as the result of an event. Additionally, this is the only mode for which the ‘Concurrent tasks’ option can be set to 0. In this case, the number of threads is limited only by the size of the Event-Driven Thread Pool that the administrator has configured. CRON driven: When using the CRON driven scheduling mode, the Processor is scheduled to run periodically, similar to the Timer driven scheduling mode. However, the CRON driven mode provides significantly more flexibility at the expense of increasing the complexity of the configuration. This value is made up of six fields, each separated by a space. more info here
... View more