About vvaks

vvaks · ‎03-30-2016

@Attila Kanto So if you have Object storage and use cloud break to install Hadoop, HDFS will sit on top of the object storage once everything is installed, I get that. But will the data show up on HDFS if the cluster is taken down and then brought back up with Cloud Break or will it have to be reloaded?

vvaks · ‎03-30-2016

I have recently built a demo that does something like this. Keep in mind that as soon as you add a CEP tool like Storm or a queuing system like Kafka or JMS into the architecture your solution becomes asynch. For the synch portion just setup a basic REST web service that makes a synch call to the backend. For the asynch with analytics part try this: Nifi listening on HTTP to receive the rest web service call from the mobile app --> use Nifi to shape the request and route it to the correct Kafka queue in case you need multiple points of entry. In either case it gives you the flexibility to change the data model of the request and/or the response without having to change both the client and the server --> Kafka queues the request to support once and only once delivery --> Storm to consume request and apply whatever analytics you need. Get whatever data is required from data serving layer --> you can use cache or a data grid like Ignite or Gemfire, however, unless you need sub millisecond response or you are doing 250K TPS or more I would just go with Hbase as the data serving layer (HBase can handle 250K + TPS but you need more region servers and some tuning) --> At this point Storm should have the response and can either post it back to Kafka or HTTP where Nifi can consume it and deliver it to the mobile app through something like Google Cloud Messaging. This architecture will give you a very flexible, very near real time asynch analytics platform that will scale up as far as you want to go.

vvaks · ‎03-30-2016

@Sridhar Babu M You can see the details of what Spark is doing by clicking on the application master in Resource Manager UI. When you click on the application master link for the Spark job in Resource Manager UI it will take you to the Spark UI and show you the job in detail. You may just have to make sure that the Spark History Server is running in Ambari or the page may come up blank. If you actually need to change the value in the file then you will need to export the resulting Data Frame to file. The save function that is part of DF class creates a files for each partition. If you need a single file you convert back to an RDD and use coalesce(1) to get everything down to a single partition so you get one file. Make sure that you add the dependency in Zeppelin %dep z.load("com.databricks:spark-csv_2.10:1.4.0") or spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 import org.apache.spark.sql.SQLContext import org.apache.spark.sql.SaveMode case class Person(name: String, age: Int) var personRDD = sc.textFile("/user/spark/people.txt") var personDF = personRDD.map(x=>x.split(",")).map(x=>Person(x(0),(x(1).trim.toInt))).toDF() personDF.registerTempTable("people") var personeDF = sqlContext.sql("SELECT * FROM people") var agedPerson = personDF.map(x=>if(x.getAs[String]("name")=="Justin"){Person(x.getAs[String]("name"), x.getAs[Int]("age")+2)}else{Person(x.getAs[String]("name"), x.getAs[Int]("age"))}).toDF() agedPerson.registerTempTable("people") var agedPeopleDF = sqlContext.sql("SELECT * FROM people") agedPeopleDF.show agedPeopleDF.select("name", "age").write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).save("agedPeople") var agedPeopleRDD = agedPeopleDF.rdd agedPeopleRDD.coalesce(1).saveAsTextFile("agedPeopleSingleFile")

vvaks · ‎03-29-2016

Atlas is a great data governance tool as it provides visibility into what data is and how it got there. What use case is Atlas being used for in production?

vvaks · ‎03-29-2016

Spark provide a lot of powerful capabilities for working with Graph data structures. What Graph oriented database is best to use in combination with Spark GraphX and why?

vvaks · ‎03-29-2016

Just looked through the Metron project and none of the POM files seem to have a reference to Spark. I bet they are either using PMML or exporting weights. I did a bit more reading as well and the more I think about it the more it seems like that pattern is just not such a great idea. Thanks for your input.

vvaks · ‎03-29-2016

+1 azeltov. We can take that one step further an make the Hive tables available through Spark SQL via JDBC from outside the cluster. SPARK_HOME/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port={port to listen} --hiveconf hive.server2.thrift.bind.host={host to bind to} --master yarn-client This will start a HiveServer2 instance that has access to the meta store but will turn SQL into Spark instruction sets and RDDs under the covers. You should now be able to use a HiveServer compliant JDBC driver to connect and access the power of SparkSQL and yet leverage all of the existing investment and assets in Hive.

vvaks · ‎03-29-2016

@Neeraj Sabharwal I have built systems on IMDGs before and have read how the HDFS acceleration is supposed to work. I am asking if you yourself have tried it and have any insight. Does it live up to what the marketing material claims it can do? Do you yourself see potential for it in the field?

vvaks · ‎03-29-2016

@azeltov How large is the implementation (How many tables, how much is cached vs read through to source)? In the case where they are caching the entire table, how do they ensure the data is not stale? Are the tables temporary or saved in the hive meta store?

vvaks · ‎03-29-2016

Just wanted to point out that Hazelcast is an in-memory data grid not a data store. With standard configuration, the data is entirely in volatile memory. You can use a backing store to ensure that data is persisted between restarts but the purpose of Hazelcast and IMDGs in general is for application acceleration not data storage. IMDGs are also capable of recieveing and distributing instruction sets across the cluster (send compute to data) similar to Hadoop. IMDGs can also execute instructions on every individual get/put/delete operation that hits the cluster. At the moment, IMDGs are not designed to scale past several TB and so would generally be used to augment a big data architecture, not replace it. However, the potential acceleration provided by and IMDG to an OLTP use case can be in the n^x realm.

Online	Offline
Last Visited	‎05-08-2018 09:31 PM

Member Since	‎03-24-2016 01:35 PM
Last Visited	‎05-08-2018 09:31 PM
Posts	184
Kudos received	165

Cloudera Community

Re: Why doesn't Atlas draw lineage?

Re: Unable to add phoenix application using Cloudb...

Re: Running a Spark Job with NiFi using Execute Pr...

Re: CreditCardTransactionMonitor Demo - Transactio...

Re: List hbase tables Spark sql

Re: HDFS over object storage for Hadoop on demand?

Re: Using Storm and Kafka for System of Insight (H...

Re: SPARK SQL query to modify values

What are the most common use case for Atlas?

What Graph Database is best to use with Spark Grap...

Re: Local Apache Spark Context from Apache Storm

Re: query hive tables with spark sql

Re: Has anyone tried to use Apache Ignite on Yarn ...

Re: Spark SQL as a Federated DB in Production?

Re: Has anyone tried Hazelcast?