Member since
02-29-2016
23
Posts
6
Kudos Received
0
Solutions
04-17-2017
11:39 PM
Basic but have you tried restarting the history-server? ./sbin/start-history-server.sh restart
... View more
03-14-2017
09:50 PM
I have a 12 nodes cluster of D12v2 (4 cores, 28GB, 200GB) images. We're copying about 100GB from the cluster's local HDFS to ADLS or WASB. I believe Blob Storage is being capped at 1000 IOPS per account, I was wondering what to expect copying to ADLS with a conservative number of mappers (36-48). Thanks!
... View more
03-14-2017
05:47 AM
@Sachin Ambardekar, The doc above may be slightly dated. Rule of thumb, 4GB per core seems to be the sweet spot for memory intensive workloads which are getting more common nowadays.
... View more
03-14-2017
05:31 AM
Thanks man, It wasn't clear if CB 1.6 could supports ADLS. A better option than WASB shards with DASH for sure..
... View more
03-10-2017
07:51 PM
Quick update, DASH is a package available from MSFT that allows "sharding" accross multiple accounts: https://github.com/MicrosoftDX/Dash/tree/master/DashServer
... View more
03-10-2017
06:56 PM
1 Kudo
Hi, I have a use case where an HDP cluster on Azure is used to dev and test. Ideally, we would like to separate the dev and test data in 2 different WASB storage accounts. Is there a way to define multiple account and keys in core-site.xml? And how would it map on the file system? Would it simply be wasb://mybucket[1-2]? Thanks!
... View more
Labels:
02-10-2017
07:02 PM
Correct. However these are not necessarily physical disk
... View more
02-09-2017
02:44 AM
1 Kudo
I had a similar use case recently. You have to approach this understanding that it's different paradigm:
You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this: spark = SparkSession\
.builder\
.appName("CheckData")\
.getOrCreate()
lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function: def virtualPortFunction(line):
#Do something, return output process of a line
virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: virtualPortFunction(x))
This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route. Also look at the pyspark samples part of the Spark distribution. Good place to start. https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
... View more
02-09-2017
01:59 AM
Assuming you're using HDP 2.5 sandbox, another option would be to deploy Zeppelin service in Amabari. The modules above are also included in Zeppelin.
... View more
02-06-2017
09:32 PM
Vitaly, Looks like these are the maven artifacts for packaging the HDP distribution. I suggest using those from Maven Central: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22
... View more
02-02-2017
07:27 AM
Vishal, you should be able to access VORA's with HANA's JDBC driver. You just need to then map it to a PySpark or SParkR context. Look it up, it's easy to find online... or let me know if you're stuck.
... View more
12-14-2016
08:03 PM
Houssam, dynamic allocation should help. How many executors are you currently using?
... View more
10-24-2016
10:34 PM
1 Kudo
This is pretty cool Simon, thank you. I've been playing with the idea of building a "podcast indexer" using Nifi, Voice-to-text, Solr. What if I could search all podcasts of the last week mentioning "election" and get audio link and seek position.... 🙂
... View more
10-24-2016
10:03 PM
Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.
... View more
10-24-2016
10:02 PM
Good point Tim. Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs: SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries. Phoenix --> A good way to interact with HBase tables, good with time series, good indexing Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.
... View more
09-07-2016
06:13 PM
Great article cduby. Thanks!
... View more
06-09-2016
06:45 PM
Rajkumar, Have you tried connecting directly with the hive jdbc driver? I'm suspecting it's a jar conflict somewhere. Here's my hive driver config in IntelliJ, obviously took the shotgun approach and added all client jar but the main required are hive-common, hive-jdbc.
... View more
06-09-2016
06:33 PM
Rupak, Unfortunately we don't; main reason being the 4GB limitation for 32bit OS, HDP requires at least 8GB of RAM to running smoothly. Are you using VMWare workstation or Virtual Box? Remember even if your physical machine has a x86-64 cpu, the OS on which you run the virtual host must also be 64bit, which I'm suspecting is why you still have the issue even after turning on the BIOS cpu options. I would suggest looking to Azure Market place, HDP 2.4 sandbox is available and you can get a 30 day free trial.
... View more
06-09-2016
06:27 PM
I have not recently but you raised my curiosity :). Doing some research, their sandbox appears to built on top of HDP 2.2, so the functionaries HDP sandbox provides should also be available (Ambari, Hive etc..). I'm going to download it and take a look, I'll keep you posted. If you're looking for an entity discovery/analytics/lineage product in general, I would suggest looking into Novetta, another partner of ours. https://hortonworks.com/wp-content/uploads/2014/05/Novetta-Entity-Analytics-and-Hortonworks-Solution-Overview.pdf They have a sandbox available on AWS marketplace also, built on HDP. It comes with a comprehensive tutorial as well.
... View more
06-09-2016
02:41 AM
1 Kudo
Manoj, here's a helpful webcast on HDP and WaterLine can work together for implementing a data lake with data governance. http://bit.ly/1XbjOzq
... View more
05-13-2016
03:22 AM
2 Kudos
@rbalam, what you're trying to do looks good on paper but remember that cloud provisioning is always "instance" based (yes, the obvious :)), without getting too philosophical here think about how would Docker manage a pool of physical hardware and know where to deploy those containers? So long story short, I would strongly suggest setting up OpenStack; it's relatively straightforward and will save you time. Been there, done that a few times. Feel free to reach out if you need any help.
... View more