About sunile_manjee

sunile_manjee · ‎12-16-2016

@milind pandit is right. if you do not have ranger enabled then add nifi user to the linux group which owns /tmp. This is ACL security. if you have ranger enabled then don't do this.

sunile_manjee · ‎12-15-2016

A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep. Objective Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage. AWS EMR The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance. Hardware Instance Type vCPU Ram Disk i2.4xlarge 16 122 4x800SSD 1 master and 3 data nodes Observation I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset. TeraGen Results: 26 Minutes, 45 Seconds TeraSort Results: 2 Hours, 57 Minutes, 49 seconds TeraValidate Results: 23 minutes, 55 Seconds Performance Numbers IaaS TeraGen TeraSort TeraValidate AWS EMR (EMRFS/s3) 26 Minutes, 45 Seconds 2 Hours, 57 Minutes, 49 seconds 23 minutes, 55 Seconds BigStep/HDP (DAS) 11 Mins 49 Secs 51 Mins 12 secs 4 mins 42 seconds Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here. The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider. All my scripts are located here. The EMR specific scripts are here.

sunile_manjee · ‎12-15-2016

Convert xsl to json using this https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html and then convert json to csv using this https://github.com/sunileman/NiFi-Json2Csv

sunile_manjee · ‎12-15-2016

Here is what i suggest. the code is simple. The processor calls HiveJdbcCommon.convertToCsvStream which is a custom class built public void process(final OutputStream out) throws IOException { try { logger.debug("Executing query {}", new Object[]{selectQuery}); final ResultSet resultSet = st.executeQuery(selectQuery); if (AVRO.equals(outputFormat)) { nrOfRows.set(HiveJdbcCommon.convertToAvroStream(resultSet, out)); } else if (CSV.equals(outputFormat)) { nrOfRows.set(HiveJdbcCommon.convertToCsvStream(resultSet, out)); } else { nrOfRows.set(0L); throw new ProcessException("Unsupported output format: " + outputFormat); } The custom class is here public static long convertToCsvStream(final ResultSet rs, final OutputStream outStream, String recordName, ResultSetRowCallback callback) throws SQLException, IOException { final ResultSetMetaData meta = rs.getMetaData(); final int nrOfColumns = meta.getColumnCount(); List<String> columnNames = new ArrayList<>(nrOfColumns); for (int i = 1; i <= nrOfColumns; i++) { String columnNameFromMeta = meta.getColumnName(i); // Hive returns table.column for column name. Grab the column name as the string after the last period int columnNameDelimiter = columnNameFromMeta.lastIndexOf("."); columnNames.add(columnNameFromMeta.substring(columnNameDelimiter + 1)); } // Write column names as header row outStream.write(StringUtils.join(columnNames, ",").getBytes(StandardCharsets.UTF_8)); outStream.write("\n".getBytes(StandardCharsets.UTF_8)); // Iterate over the rows long nrOfRows = 0; while (rs.next()) { if (callback != null) { callback.processRow(rs); } List<String> rowValues = new ArrayList<>(nrOfColumns); for (int i = 1; i <= nrOfColumns; i++) { final int javaSqlType = meta.getColumnType(i); final Object value = rs.getObject(i); switch (javaSqlType) { case CHAR: case LONGNVARCHAR: case LONGVARCHAR: case NCHAR: case NVARCHAR: case VARCHAR: String valueString = rs.getString(i); if (valueString != null) { rowValues.add("\"" + StringEscapeUtils.escapeCsv(valueString) + "\""); } else { rowValues.add(""); } break; case ARRAY: case STRUCT: case JAVA_OBJECT: String complexValueString = rs.getString(i); if (complexValueString != null) { rowValues.add(StringEscapeUtils.escapeCsv(complexValueString)); } else { rowValues.add(""); } break; default: if (value != null) { rowValues.add(value.toString()); } else { rowValues.add(""); } } } // Write row values outStream.write(StringUtils.join(rowValues, ",").getBytes(StandardCharsets.UTF_8)); outStream.write("\n".getBytes(StandardCharsets.UTF_8)); nrOfRows++; } return nrOfRows; } The code is very simple. Basically where it add a comma, you an replace with your delimiter. build the nar (very simple) and there you go. in reality I believe this code can be easily enhanced by accepting a input parameter ie delimiter, and the processor would spit out the result set in the delimiter you have identified. If I have a hour or so next week I will post the custom nar info here.

sunile_manjee · ‎12-14-2016

In this post https://community.hortonworks.com/questions/71481/finding-and-purging-flow-files.html it shows how to purge flowfiles from nifi's ui. is the content purged as well? I assume yes, but would like confirmation.

sunile_manjee · ‎12-13-2016

You def use hive but you are not using the easy button. "best practice" is a abused term in our industry. I say a best practice for customer A may not be best practice for customer B. Its all about cluster size, hardware config, and use case which applies the "best practice" for again your specific use case. if you want to transform data the entire industry is moving to Spark. Spark is nice since it has multipule api for the same dataset. I recommend you open another HCC question if you are looking for a "best practice" on a specfic use case. I recommend NiFi for what you have identified.

sunile_manjee · ‎12-13-2016

No problem. enjoy atlas!

sunile_manjee · ‎12-13-2016

HWX has pig tuturial with come with data and script. I recommend you try this http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/

sunile_manjee · ‎12-13-2016

I experienced this before. Please verify ambari-infra, hbase, kafka is up and running. restart if it is up. After restart try again, if issue continues please provide error from log.

sunile_manjee · ‎12-13-2016

I have written articles in the past benchmarking Hadoop cloud environments such at BigStep and AWS. What I didn't dive into those article is how I ran the script. I built scripts to rapidly launch TeraGen, TeraSort, and TeraValidate. Why? I found myself running the same script over and over and over again. Why not make it easier by simply executing a shell script. All scripts I mentioned are located here. Grab the following files teragen.sh terasort.sh validate.sh To run TeraGen, TeraSort, and TeraValidate a determination of the volume of data and number of records is required. For example you can generate 500GB of data with 5000000000 rows. The script comes with the following predefined sets #SIZE=500G #ROWS=5000000000 #SIZE=100G #ROWS=1000000000 #This will be used as it only value uncommented out SIZE=1T ROWS=10000000000 # SIZE=10G #ROWS=100000000 # SIZE=1G # ROWS=10000000 Above 1T (for terabyte) and rows 10000000000 are uncommented out. Meaning this script will generate 1TB of data with 10000000000 rows. If you want to use different dataset size and rows, simply comment out all other size and rows. Essentially using only the one you want. Only 1 SIZE and ROWS should be set (uncommented out). This applies to all scripts (teragen.sh, terasort.sh, validate.sh). All scripts must have same SIZE & ROWS setting. Logs A log directory is created based on where you run the script. Run output and stats are stored in the logs directory. For example if you run /home/sunile/teragen.sh It will create the logs directory here, /home/sunile/logs. All the logs from teragen, sort, and validate will reside here. Parameters This is an important piece for tuning. To benchmark your environments parameters should be configured. Much of this is trial and error. I would say experience is required here. ie How each parameter impacts a MapReduce job. Get help here. For tuning change/add parameters here: For ease of first time execution, use the ones set in the script. Run it as is and grab your stats. If the stats are acceptable then move on. What is acceptable? Take a look the articles I published on BigStep and AWS. If stats not acceptable, starting tuning. Run the jobs in the following order TeraGen (teragen.sh) TeraSort (terasort.sh) TeraValidate (validate.sh) Hope these scripts help you quickly benchmark your environment. Now go build some cool stuff!

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Re: Hive: INSERT OVERWRITE does not work

TeraGen, TeraSort, and TeraValidate Performance te...

Re: XSL to CSV using NiFi

Re: NiFi: Write to custom delimitted file

Is content purged when flow files are deleted?

Re: Best Way to Transform & Process Data

Re: How to import hive metadata in Apache Atlas ??

Re: Like Example.jar is there any sample pig scrip...

Re: Atals Taxonomy not showing catalog created.

Benchmarking Hadoop with TeraGen, TeraSort, and Te...