Member since
02-26-2016
100
Posts
111
Kudos Received
12
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2591 | 05-04-2017 12:38 PM | |
| 5210 | 03-21-2017 06:18 PM | |
| 16004 | 03-21-2017 01:14 AM | |
| 5464 | 02-14-2017 06:21 PM | |
| 8867 | 02-09-2017 03:49 AM |
02-06-2017
07:25 PM
Thank you. Will review the links.
... View more
10-25-2017
05:03 PM
One thing I wish I had known when starting with python UDF's is that you can write to stderr to assist in debugging. Then look in the Yarn RM for the logs. import sys sys.stderr.write('>>>> Read a line \n' + line + '\n')
... View more
06-28-2017
09:19 AM
Hi thanks for tuto. Do you know if it is possible to use "StoreInKiteDataset" with kerberos to write to HDFS ?
... View more
08-12-2018
01:50 PM
Great post Binu! What storage format would you suggest if you plan on storing the hive table into a dataframe and running an iterative process (machine learning algorithm x) against the data? I’m hard pressed to find any kind of discussions on this concept.
... View more
11-30-2016
06:45 AM
1 Kudo
RM store some application states to render the UI which is controlled by yarn.resourcemanager.max-completed-applications, the default value for this is 10000, so at any time RM need somewhere ~1G memory to store these applications in memory. you can try to lower the value for yarn.resourcemanager.max-completed-applications to see the drop in RM heap utilization.
... View more
11-22-2016
08:41 AM
3 Kudos
Hi @Binu Mathew, We have two different kind of external database support in Cloudbreak for provisioned cluster. One for Ambari (technical preview) and the other for Hive metastore, but unfortunately to use both of them you have to use Cloudbreak shell because the first one is officially available only in Cloudbreak and the second is HDCloud exclusive feature. To configure RDS for Ambari you have to execute database configure shell command before executing cluster create command: database configure --vendor --host --port --name --username --password Hive metastore RDS configuration available in cluster create command: cluster create --databaseType --connectionUserName --connectionPassword --connectionURL --hdpVersion [update] RDS configuration for Cloudbreak is configurable in Cloudbreak deployer's Profile file by adding the environment variables below: CB_DB_PORT_5432_TCP_ADDR
CB_DB_PORT_5432_TCP_PORT
CB_DB_ENV_USER
CB_DB_ENV_PASS
CB_DB_ENV_DB
... View more
12-31-2016
03:25 AM
@Binu Mathew : Thanks for sharing the awesome article. Do you mind to share the sample data?
... View more
07-10-2018
10:42 AM
Will it work for spark version 2.3.0 ? Could you please update it as per this version.
... View more
08-25-2016
02:08 PM
7 Kudos
SYNOPSIS
Which is faster when analyzing data using Spark 1.6.1: HDP with HDFS for storage, or EMR?
Testing shows that HDP using HDFS has performance gains over using EMR. HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data. HDP/HDFS EMR Time Elapsed 3 mins, 29 sec 5 mins, 5 sec
* See below at end of article for validation and screen prints showing Resource Manager logs
HDP
Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution. The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.
EMR
Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
Objects are not stored in file systems; instead, users create objects and associate keys with them.
Object storage also has the option of tagging metadata with your data.
TEST
Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems. Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.
Test Data
COMPANY X is a global online marketplace connecting consumers with merchants. Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs). 22.3 Million page view records from 13 countries in North America, Latin America, and Asia. Data is in JSON format and uncompressed. 143 files totaling 37 GB. Each file averages 256 MB. All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.
Platform Versions
HDP 2.3.0 - Hadoop version 2.7.1 EMR 4.5.0 - Hadoop version 2.7.2
AWS HDP and EMR Clusters were sized/configured similarly
m4.2xlarge Instances 1 master and 4 worker nodes
TEST RESULTS
Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46% Total elapsed time for HDP/HDFS: 3 minutes 29 seconds Total elapsed time for EMR: 5 minutes 5 seconds
TESTING VALIDATION
Sample JSON record
Total disk usage in HDFS consumed by all files is 37 G
Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB
Output produced. Operating system and total page view count:
HDP Resource Manager log
EMR Resource Manager log
... View more
Labels:
09-05-2016
08:10 PM
The Databricks CSV library skips using Core Spark. The map function in Pyspark is run through a Python subprocess on each executor. When using Spark SQL with Databricks CSV library, everything goes through the catalyst optimizer and the output is java byte code. Scala/Java is about 40% faster than Python when using core Spark. I would guess that is the reason the 2nd implementation is much faster. The CSV library probably is much more efficient at breaking up the records, probably applying the split partition by partition as opposed to record by record.
... View more
- « Previous
-
- 1
- 2
- Next »