Member since
02-29-2016
23
Posts
6
Kudos Received
0
Solutions
04-17-2017
11:39 PM
Basic but have you tried restarting the history-server? ./sbin/start-history-server.sh restart
... View more
03-14-2017
05:47 AM
@Sachin Ambardekar, The doc above may be slightly dated. Rule of thumb, 4GB per core seems to be the sweet spot for memory intensive workloads which are getting more common nowadays.
... View more
03-14-2017
05:31 AM
Thanks man, It wasn't clear if CB 1.6 could supports ADLS. A better option than WASB shards with DASH for sure..
... View more
03-10-2017
07:51 PM
Quick update, DASH is a package available from MSFT that allows "sharding" accross multiple accounts: https://github.com/MicrosoftDX/Dash/tree/master/DashServer
... View more
03-10-2017
06:56 PM
1 Kudo
Hi, I have a use case where an HDP cluster on Azure is used to dev and test. Ideally, we would like to separate the dev and test data in 2 different WASB storage accounts. Is there a way to define multiple account and keys in core-site.xml? And how would it map on the file system? Would it simply be wasb://mybucket[1-2]? Thanks!
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-09-2017
02:44 AM
1 Kudo
I had a similar use case recently. You have to approach this understanding that it's different paradigm:
You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this: spark = SparkSession\
.builder\
.appName("CheckData")\
.getOrCreate()
lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function: def virtualPortFunction(line):
#Do something, return output process of a line
virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: virtualPortFunction(x))
This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route. Also look at the pyspark samples part of the Spark distribution. Good place to start. https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
... View more
02-09-2017
01:59 AM
Assuming you're using HDP 2.5 sandbox, another option would be to deploy Zeppelin service in Amabari. The modules above are also included in Zeppelin.
... View more
10-24-2016
10:03 PM
Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.
... View more
10-24-2016
10:02 PM
Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.
... View more
09-07-2016
06:13 PM
Great article cduby. Thanks!
... View more