About agauthier

agauthier · ‎04-17-2017

Basic but have you tried restarting the history-server? ./sbin/start-history-server.sh restart

agauthier · ‎03-14-2017

@Sachin Ambardekar, The doc above may be slightly dated. Rule of thumb, 4GB per core seems to be the sweet spot for memory intensive workloads which are getting more common nowadays.

agauthier · ‎03-14-2017

Thanks man, It wasn't clear if CB 1.6 could supports ADLS. A better option than WASB shards with DASH for sure..

agauthier · ‎03-10-2017

Quick update, DASH is a package available from MSFT that allows "sharding" accross multiple accounts: https://github.com/MicrosoftDX/Dash/tree/master/DashServer

agauthier · ‎03-10-2017

Hi, I have a use case where an HDP cluster on Azure is used to dev and test. Ideally, we would like to separate the dev and test data in 2 different WASB storage accounts. Is there a way to define multiple account and keys in core-site.xml? And how would it map on the file system? Would it simply be wasb://mybucket[1-2]? Thanks!

agauthier · ‎02-09-2017

I had a similar use case recently. You have to approach this understanding that it's different paradigm: You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this: spark = SparkSession\ .builder\ .appName("CheckData")\ .getOrCreate() lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0]) You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function: def virtualPortFunction(line): #Do something, return output process of a line virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: virtualPortFunction(x)) This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route. Also look at the pyspark samples part of the Spark distribution. Good place to start. https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py

agauthier · ‎02-09-2017

Assuming you're using HDP 2.5 sandbox, another option would be to deploy Zeppelin service in Amabari. The modules above are also included in Zeppelin.

agauthier · ‎10-24-2016

Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.

agauthier · ‎10-24-2016

Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.

agauthier · ‎09-07-2016

Great article cduby. Thanks!

Online	Offline
Last Visited	‎06-07-2017 06:39 PM

Member Since	‎02-29-2016 09:02 PM
Last Visited	‎06-07-2017 06:39 PM
Posts	23
Kudos received	5

Cloudera Community

Re: Spark-history reports application as incomplet...

Re: Namenode and Datanode capacity planning

Re: Multiple WASB "volumes" on a single Azure clu...

Re: Multiple WASB "volumes" on a single Azure clu...

Multiple WASB "volumes" on a single Azure cluster...

Re: Need to convert a python code to pyspark scrip...

Re: ImportError: ruamel.yaml.comments while trying...

Re: What is Impala alternative in HDP?

Re: What is Impala alternative in HDP?

Re: How to Use Hortonworks Cloud to provision a cl...