Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
01-24-2018
03:28 PM
Hi @Ron Lee. Thanks for this writeup. One question - On a kerberized cluster will the keytabs need to be regenerated if the service account name is changed?
... View more
07-19-2017
07:54 PM
Thanks @Wynner!
... View more
07-19-2017
07:05 PM
Additional information from @Matt Clarke: There are three files crucial to a new node being able to successfully join an existing cluster....(flow.xml.gz, users.xml, and authorizations.xml). All three of these files (flow, users, authorizations) must match before a node will be allowed to join an existing cluster. The flow.xml.gz file contains everything you have added while interfacing with the UI. All nodes must have matching flow.xml.gz files in order to join cluster. All you need to do is copy flow.xml.gz file from the original cluster node to the new node, make sure
ownership is correct, and restart new node. Normally these files will be given out by the cluster to any new node who has none of them; however, if Ambari metrics are enabled and a flow.xml.gz does not exist, Ambari generates a flow.xml.gz file that contains only the Ambari reporting task. Because of this the new node will not match and will be unable to join the cluster. A NiFi cluster will never overwrite an existing flow.xml.gz file on a new node with its own. Secured NiFi clusters also requires that the users.xml and authorizations.xml file match if file based authorization is used. The users and authorizations XML files only come in to play when NiFi is secured and using the local file based authorization. If secured, the only time a cluster will hand out the users and authorizations XML files is if they don't exist as well.
Bottom line... If you add a new NiFi host via ambari, it will try to join cluster. If it fails and shuts back down, copy the above the files from one of the existing nodes to the new node and restart via Ambari.
... View more
11-08-2016
09:56 PM
Hi @vamsi valiveti. The example above uses the exact same source file in the exact same location for both external tables. Both test_csv_serde_using_CSV_Serde_reader and test_csv_serde tables read an external file(s) stored in the directory called '/user/<uname>/elt/test_csvserde/'. The file I used was pipe delimited and contains 62,000,000 rows - so I didn't attach it . 😉 It would look like Option 2 above, but of course with 4 columns: 121|Hello World|4567|34345
232|Text|5678|78678
343|More Text|6789|342134
... View more
10-19-2016
01:31 PM
Hi @Laurent Edel - Nice Article! I do have a question - are there performance issues when using this method (HCatalog integration) to go from Sqoop directly to an ORC format? In other words, is Option A: Sqoop -> directly to ORC format table via HCatalog integration equivalent performance (or better performance) when compared to Option B? Sqoop -> text files/external Hive table -> Hive CTAS/Insert into ORC format table Would like to ensure the best possible Sqoop performance. Thanks!
... View more
11-19-2015
12:30 AM
Excellent! Not only is it a great feature but it shows how quickly Ambari views are improving & adding functionality.
... View more
11-05-2015
12:49 AM
Nice writeup, and very timely too. My current client is looking for this info right now - will bring it to them tomorrow.
... View more
11-04-2015
09:46 PM
9 Kudos
I've seen some postings (including This one) where people are using CSVSerde for processing input data. CSVSerde is a magical piece of code, but it isn't meant to be used for all input CSVs.
tl;dr - Use CSVSerde only when you have quoted text or really strange delimiters (such as blanks) in your input data - otherwise you will take a rather substantial performance hit... When to use CSVSerde:
For example: If we have a text file with the following data: col1 col2 col3
----------------------
121 Hello World 4567
232 Text 5678
343 More Text 6789 Pipe delimited it would look like: 121|Hello World|4567|
232|Text|5678|
343|More Text|6789| but Blank delimited with quoted text would look like (don't laugh - Progress database dumps are blank delimited and text quoted in this exact format):
121 'Hello World' 4567
232 Text 5678
343 'More Text' 6789 Notice that the text may or may not have quote marks around it - text only needs to be quoted if it contains a blank. This is a particularly nasty set of data. You need custom coding - unless you use CSVSerde. CSVSerde can handle this data with ease. Blank delimited/Quoted text files are parsed perfectly without any coding when you use the following table declaration: .
CREATE TABLE my_table(col1 string, col2, string, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = " ",
"quoteChar" = "'")
Performance Hit when Using CSVSerde on conventional CSV data
tl;dr Using CSVSerde for conventional CSV files is about 3X slower... The following code shows timings encountered when processing a simple pipe-delimited csv file. One Hive table definition uses conventional delimiter processing, and one uses CSVSerde. The input timings were on a small cluster (28 data nodes). The file used for testing had 62,825,000 rows. Again, rather small. Table DDL using conventional delimiter definition: CREATE external TABLE test_csv_serde (
`belnr` string,
`bukrs` string,
`budat` string,
`bstat` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/user/<uname>/elt/test_csvserde/';
-- Load the data one-time
insert overwrite table test_csv_serde
select * from <large table>; Table DDL using CSVSerde (same file/source data as the other table): CREATE external TABLE test_csv_serde_using_CSV_Serde_reader (
`belnr` string,
`bukrs` string,
`budat` string,
`bstat` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = "|")
location '/user/<uname>/elt/test_csvserde/'; Results: hive> select count(*) from test_csv_serde;
Time taken: 8.683 seconds, Fetched: 1 row(s)
hive> select count(*) from test_csv_serde_using_CSV_Serde_reader;
Time taken: 27.442 seconds, Fetched: 1 row(s)
hive> select count(*) from test_csv_serde;
Time taken: 8.707 seconds, Fetched: 1 row(s)
hive> select count(*) from test_csv_serde_using_CSV_Serde_reader;
Time taken: 27.41 seconds, Fetched: 1 row(s)
hive> select min(belnr) from test_csv_serde;
Time taken: 10.267 seconds, Fetched: 1 row(s)
hive> select min(belnr) from test_csv_serde_using_CSV_Serde_reader;
Time taken: 29.271 seconds, Fetched: 1 row(s)
... View more
Labels: