About bmathew

rebuyerfl · ‎02-06-2017

Thank you. Will review the links.

Jim_B · ‎10-25-2017

One thing I wish I had known when starting with python UDF's is that you can write to stderr to assist in debugging. Then look in the Yarn RM for the logs. import sys sys.stderr.write('>>>> Read a line \n' + line + '\n')

maykiwogno · ‎06-28-2017

Hi thanks for tuto. Do you know if it is possible to use "StoreInKiteDataset" with kerberos to write to HDFS ?

marshall_felder · ‎08-12-2018

Great post Binu! What storage format would you suggest if you plan on storing the hive table into a dataframe and running an iterative process (machine learning algorithm x) against the data? I’m hard pressed to find any kind of discussions on this concept.

rajkumar_singh · ‎11-30-2016

RM store some application states to render the UI which is controlled by yarn.resourcemanager.max-completed-applications, the default value for this is 10000, so at any time RM need somewhere ~1G memory to store these applications in memory. you can try to lower the value for yarn.resourcemanager.max-completed-applications to see the drop in RM heap utilization.

rkovacs · ‎11-22-2016

Hi @Binu Mathew, We have two different kind of external database support in Cloudbreak for provisioned cluster. One for Ambari (technical preview) and the other for Hive metastore, but unfortunately to use both of them you have to use Cloudbreak shell because the first one is officially available only in Cloudbreak and the second is HDCloud exclusive feature. To configure RDS for Ambari you have to execute database configure shell command before executing cluster create command: database configure --vendor --host --port --name --username --password Hive metastore RDS configuration available in cluster create command: cluster create --databaseType --connectionUserName --connectionPassword --connectionURL --hdpVersion [update] RDS configuration for Cloudbreak is configurable in Cloudbreak deployer's Profile file by adding the environment variables below: CB_DB_PORT_5432_TCP_ADDR CB_DB_PORT_5432_TCP_PORT CB_DB_ENV_USER CB_DB_ENV_PASS CB_DB_ENV_DB

hadoopdataanaly · ‎12-31-2016

@Binu Mathew : Thanks for sharing the awesome article. Do you mind to share the sample data?

debananda_sahoo · ‎07-10-2018

Will it work for spark version 2.3.0 ? Could you please update it as per this version.

bmathew · ‎08-25-2016

SYNOPSIS Which is faster when analyzing data using Spark 1.6.1: HDP with HDFS for storage, or EMR? Testing shows that HDP using HDFS has performance gains over using EMR. HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data. HDP/HDFS EMR Time Elapsed 3 mins, 29 sec 5 mins, 5 sec * See below at end of article for validation and screen prints showing Resource Manager logs HDP Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution. The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP. EMR Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances. S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system. Objects are not stored in file systems; instead, users create objects and associate keys with them. Object storage also has the option of tagging metadata with your data. TEST Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems. Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster. Test Data COMPANY X is a global online marketplace connecting consumers with merchants. Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs). 22.3 Million page view records from 13 countries in North America, Latin America, and Asia. Data is in JSON format and uncompressed. 143 files totaling 37 GB. Each file averages 256 MB. All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR. Platform Versions HDP 2.3.0 - Hadoop version 2.7.1 EMR 4.5.0 - Hadoop version 2.7.2 AWS HDP and EMR Clusters were sized/configured similarly m4.2xlarge Instances 1 master and 4 worker nodes TEST RESULTS Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46% Total elapsed time for HDP/HDFS: 3 minutes 29 seconds Total elapsed time for EMR: 5 minutes 5 seconds TESTING VALIDATION Sample JSON record Total disk usage in HDFS consumed by all files is 37 G Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB Output produced. Operating system and total page view count: HDP Resource Manager log EMR Resource Manager log

jwiden · ‎09-05-2016

The Databricks CSV library skips using Core Spark. The map function in Pyspark is run through a Python subprocess on each executor. When using Spark SQL with Databricks CSV library, everything goes through the catalyst optimizer and the output is java byte code. Scala/Java is about 40% faster than Python when using core Spark. I would guess that is the reason the 2nd implementation is much faster. The CSV library probably is much more efficient at breaking up the records, probably applying the split partition by partition as opposed to record by record.

Online	Offline
Last Visited	‎06-18-2017 01:08 AM

Member Since	‎02-26-2016 07:42 PM
Last Visited	‎06-18-2017 01:08 AM
Posts	100
Kudos received	111

Cloudera Community

Re: Hive databse backup

Re: Best approach to ingest CSV with changing sche...

Re: which way is the best when using hive to analy...

Re: Will there be any performance issues if we sel...

Re: Comparing sql data with text file using Spark.

Re: Hive profiling and query performance tuning to...

Re: How to create a custom UDF for Hive using Pyth...

Re: HDF/NiFi to convert row-formatted text files t...

Re: Row vs Columnar Storage For Hive

Re: Resource Manager Heap usage when cluster is id...

Re: Cloudbreak for AWS and specifying RDS for Hive...

Re: NiFi for Clickstream Log Ingestion into HBase ...

Re: Spark (PySpark) to extract from SQL Server

Performance of Spark on HDP/HDFS vs Spark on EMR

Re: Hive on Tez vs PySpark for weblogs parsing