About vvaks

vvaks · ‎04-28-2016

Modern data driven applications require a "Connected Platform" capable of bringing data in and out to/from the Internet of Things, mobile users, and social media in real time. In order to monetize all of that real time data the platform must have the ability to process Petabytes of data to create adaptive learning algorithms and apply those algorithms in real time as the data streams in and out of the platform. However, the modern data application cannot be effectively utilized or operated without an application tier that allows the business to visualize, interact, and act on the massive volumes of data and insight coming in and out in real time and accumulating in storage. The Hortonworks Connected Platform "HDP+HDF" has the capability to act as a PaaS that can host the application tier of the modern data application along side of all of the data processing. It is possible to use Slider to run a dockerized application managed by Yarn inside of the Hadoop cluster similar to an application PaaS. This can be accomplished as follows: 1. Create a web application project using that includes the application server embedded in the package. The resulting package should be runnable something like a Java runnable jar. This can be accomplished using Maven. Here is an example oft he application packaging: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>BigData</groupId> <artifactId>ShopFloorUI</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <properties> <docker.registry.name></docker.registry.name> <docker.repository.name>${docker.registry.name}vvaks/biologicsmanufacturingui</doc ker.repository.name> <tomcat.version>7.0.57</tomcat.version> </properties> <dependencies> <dependency> <groupId>org.apache.tomcat.embed</groupId> <artifactId>tomcat-embed-core</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.apache.tomcat.embed</groupId> <artifactId>tomcat-embed-logging-juli</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.apache.tomcat.embed</groupId> <artifactId>tomcat-embed-jasper</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.apache.tomcat</groupId> <artifactId>tomcat-jasper</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.apache.tomcat</groupId> <artifactId>tomcat-jasper-el</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.apache.tomcat</groupId> <artifactId>tomcat-jsp-api</artifactId> <version>${tomcat.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-client</artifactId> <version>9.3.6.v20151106</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-util</artifactId> <version>9.3.6.v20151106</version> </dependency> <dependency> <groupId>org.cometd.java</groupId> <artifactId>cometd-api</artifactId> <version>1.1.5</version> </dependency> <dependency> <groupId>org.cometd.java</groupId> <artifactId>cometd-java-client</artifactId> <version>3.0.7</version> </dependency> <dependency> <groupId>javax.el</groupId> <artifactId>el-api</artifactId> <version>2.2</version> </dependency> <dependency> <groupId>javax.servlet</groupId> <artifactId>javax.servlet-api</artifactId> <version>3.1.0</version> </dependency> <dependency> <groupId>javax.servlet.jsp</groupId> <artifactId>jsp-api</artifactId> <version>2.2</version> </dependency> <dependency> <groupId>javax.servlet.jsp.jstl</groupId> <artifactId>javax.servlet.jsp.jstl-api</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-core-asl</artifactId> <version>1.9.13</version> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-mapper-asl</artifactId> <version>1.9.13</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.13</version> </dependency> </dependencies> <build> <finalName>${project.artifactId}</finalName> <sourceDirectory>src/</sourceDirectory> <resources> <resource> <directory>src/main/webapp</directory> <targetPath>META-INF/resources</targetPath> </resource> <resource> <directory>src/main/resources</directory> <targetPath>META-INF/resources</targetPath> </resource> </resources> <outputDirectory>classes/</outputDirectory> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.hortonworks.iot.shopfloorui.ShopFloorUIMain</mainClass> </manifest> </archive> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-war-plugin</artifactId> <version>2.1.1</version> <configuration> <webappDirectory>webapp/</webappDirectory> <finalName>ShopFloorUI</finalName> </configuration> </plugin> <plugin> <groupId>org.jolokia</groupId> <artifactId>docker-maven-plugin</artifactId> <version>0.13.3</version> <configuration> <images> <image>  <alias>biologicsmanufacturingui</alias> <name>${docker.repository.name}</name> <build> <from>java:8-jre</from> <maintainer>vvaks</maintainer> <assembly> <descriptor>docker-assembly.xml</descriptor> </assembly> <ports> <port>8090</port> </ports> <cmd> <shell>java -jar \ /maven/ShopFloorUI-jar-with-dependencies.jar server \ /maven/docker-config.yml</shell> </cmd> </build> </image> </images> </configuration> </plugin> </plugins> </build> </project> 2. Create a docker container with that contains the runnable package and has the command to start that package on startup. Maven has a docker maven plugin that automates the creation of a docker container using the runnable created by the Maven assembly plugin. In order for the plugin to work, docker must be installed and assessable by the Eclipse session. 3. Create an account on https://hub.docker.com/. You will need this account to publish the docker container that you created locally. This is important as the Slider client will attempt to download the docker container from the docker hub not from the local repository. This is essential since otherwise it would necessary to distribute the docker container to every single node in the cluster since Yarn can decide to start it up using any node manager agent. 4. Create the Slider configuration files: appConfig.json - This file contains the command that the node manager will execute to start the docker container locally as well as the command to run periodically to check the health of the container. The example below starts two docker containers, one called MAPUI and another called COMETD. { "schema": "http://example.org/specification/v2.0.0", "metadata": {}, "global": {}, "components": { "MAPUI": { "mapui.commandPath": "/usr/bin/docker", "mapui.options":"-d --net=host", "mapui.statusCommand":"docker inspect -f {{.State.Running}} ${CONTAINER_ID} | grep true" }, "COMETD": { "cometd.commandPath": "/usr/bin/docker", "cometd.options":"-d --net=host", "cometd.statusCommand":"docker inspect -f {{.State.Running}} ${CONTAINER_ID} | grep true" } } } metainfo.json - This file contains the image to download from docker hub as well as the ports that the container is listening on. The component names must match up across all three configuration files. { "schemaVersion": "2.1", "application": { "name": "MAPUI", "components": [ { "name": "MAPUI", "type": "docker", "dockerContainers": [ { "name": "mapui", "commandPath": "/usr/bin/docker", "image": "vvaks/mapui", "ports": [{"containerPort" : "8091", "hostPort" : "8091"}] } ] }, { "name": "COMETD", "type": "docker", "dockerContainers": [ { "name": "cometd", "commandPath": "/usr/bin/docker", "image": "vvaks/cometd", "ports": [{"containerPort" : "8090", "hostPort" : "8090"}] } ] } ] } } resources.json - This file contains the resources required by the application. Slider will use these specifications to request the required resources from Yarn. The component names must match up across all three configuration files. { "schema": "http://example.org/specification/v2.0.0", "metadata": { }, "global": { }, "components": {"slider-appmaster": { }, "MAPUI": { "yarn.role.priority": "1", "yarn.component.instances": "1", "yarn.memory": "256" }, "COMETD": { "yarn.role.priority": "2", "yarn.component.instances": "1", "yarn.memory": "256" } } } 5. Make sure that a Slider client is available on the host from which you will launch the request and that the Slider client is configured to point at the target Yarn cluster's Resource Manager. slider create mapui --template /home/docker/dockerbuild/mapui/appConfig.json --metainfo /home/docker/dockerbuild/mapui/metainfo.json --resources /home/docker/dockerbuild/mapui/resources.json Slider will reach out to Yarn, request the containers specified in resources.json and then instruct Yarn to run the command specified in appInfo.json with the details specified in metainfo.json. At this point you should see the application listed as a Slider type application in Yarn Resource Manager UI. You should be able to click on the application link and view the logs being generated by the containers as the application starts up. Of course, Docker must be installed and running on the nodes that make up the queue where slider will request the application to start. It should be noted that this approach does not solve all of the problems that a PaaS does. The issue of application instance registry still has to be dealt with. There is no, out of the box approach, that allows discovery and routing of the client to the application after it starts or upon container failure. The following link addresses how to deal with this issue: https://slider.incubator.apache.org/design/registry/a_YARN_service_registry.html All of these issues will be solved by the Yarn.Next initiative. The HDP engineering team is hard at work making this happen. Yarn.Next will embedded all of the capabilities described above as part of core Yarn. This will allow the creation of a Modern Data Application, including all components like Storm, HBase, and the Application tier by simply providing Yarn with a JSON descriptor. The application start with all of the required components pre-integrated and discoverable via standard DNS resolution. Stay tuned for the next installment. For working examples, check out these Repos. Each of these is a working example of a modern data application running on the Hortonworks Connected Platform, including the application tier. https://community.hortonworks.com/content/repo/27236/credit-fraud-prevention-demo.html https://community.hortonworks.com/content/repo/29196/biologics-manufacturing-optimization-demo.html https://community.hortonworks.com/content/repo/26288/telecom-predictive-maintenance.html

vvaks · ‎04-28-2016

@Jeeva Jeeva Your library dependency for SparkSQL looks to be 1.0.0. libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0" libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.1.0" org.apache.spark.rdd.RDD did not acquire the toJson method until 1.2. Load Spark-SQL 1.2 and then give it another try.

vvaks · ‎04-27-2016

Data Federation discussions are becoming more and more common place as organizations embark on their Big Data Journey. New data platforms like the Hortonworks Connected platform (HDP+HDF), NoSQL, and NewSQL data stores are reducing the cost and difficulty of storing and working with vast volumes of data. This is empowering organizations to leverage and monetize their data like never before. However, legacy data infrastructures still play an important role in the overall technology architecture. In order to fully realize the power of the new and traditional data platforms, it is often necessary to integrate the data. One obvious approach is to simply move the data from where it sits in the existing data platform over to the target data platform. However, in many cases it is desirable to leave the data in place and enable a "Federation" tier to act as a single point of access to data from multiple sources. For details on the concepts and implementation of Data Federation see https://community.hortonworks.com/articles/27387/virtual-integration-of-hadoop-with-external-system.html. This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier". First, it is important to point out that SparkSQL is not a pure Data Federation tool and hence does not have some of the really advanced capabilities generally associated with Data Federation. SparkSQL does not facilitate predicate push down to the source system beyond the query that defines what data from the underlying source should be made available through SparkSQL. Also, because it was not designed to be a true "Data Federation" engine, there is no "user friendly" interface to easily setup the external sources, the schemas associated with the target data, or the ingest of the target data. All of this work has to be done through the SparkSQL API and requires relatively advanced knowledge Spark and data architecture principles in general. For these reasons, SparkSQL will not be the right solution in every Data Federation scenario. However, what SparkSQL lacks in terms of an "easy button" it makes up for in versatility, relatively low cost, sheer processing potential, and in-memory capabilities. SparkSQL exposes most of it's capabilities via the Data Frame API and the SQL context. Data can be ingested into Spark's native data structure (RDD) from an RDBMS, from HDFS (supports Hive/HBase/Phoenix), and generally any source that has an API that Spark can access (HTTP/JDBC/ODBC/NoSQL/Cloud Storage). The Data Frame allows the definition of a schema and then the application of that schema to the RDD containing the target data. Once the data has been transformed into a Data Frame with a schema, it is a single line of code away from becoming what looks exactly like a relational table. That table can then be stored in Hive (assuming Hive context was created) if it needs to be accessed on a regular basis or registered as a temp table that will exist only as long as the parent Spark application and it's executors (the application can run indefinitely). If a enough resources are available and really fast query response are required, any or all of the tables can cached and made available in-memory. Assuming a properly tuned infrastructure, and a clear understanding of how and when the data changes, this can make query response times extremely fast. Imagine caching the main fact table and leveraging map joins for the dimension tables. All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server. The Spark thrift server supports virtually the same API and many of the features supported by the battle tested Hive thrift server. At this point, OLAP and reporting BI tools can be used to display data from far and wide across the organization's data enterprise architecture. As stated earlier, it is certainly not the right choice in every situation and must be thought out carefully. However, it should be noted that this very design pattern is being used by large traditional software vendors to enhance their existing product sets. One great example of this is SAP Vora which extends the capabilities of Spark to enable an organization to greatly augment the processing and storage capabilities of HANA by leveraging Spark on Hadoop. There is definitely value in the work that vendors are doing to make SparkSQL more accessible. However, because Spark is open source, it can also be implemented without a capital acquisition cost. In general, SparkSQL is an excellent option for data processing and data federation. It can greatly improve BI performance and range of available data. This design pattern is not for the fait of heart but when implemented properly can lead to great progress for an organization on the Big Data Journey. For a working example of using SparkSQL for Data Federation check out: https://community.hortonworks.com/content/repo/29883/sparksql-data-federation-demo.html

vvaks · ‎04-26-2016

@Randy Gelhausen You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.

vvaks · ‎04-26-2016

@vadivel sambandam On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job.

vvaks · ‎04-23-2016

@Adam Doyle In Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC connection owns a copy of their own SQL configuration and temporary function registry. Cached tables are still shared. You are registering a temp table and so in order to see the temp table, you need to run the Thrift server in single-session mode. In spark-default.conf set spark.sql.hive.thriftServer.singleSession to true. When you call for an instance of the Thrift server in you code, it should start up in single session mode. When you initialize and register the temp table, it should show up when you connect and issue show tables command. You can create a permanent table in which case it should show up in multi session mode and from Hive (You have the code to do that but it's commented out).

vvaks · ‎04-20-2016

@Sridhar Babu M Since cores per container are controlled by Yarn configuration, I believe you will need to set the number of executors and the number of cores per executor based on your Yarn configuration to control how many executors and cores get scheduled. So if you set Yarn to allocate 1 core per container and you want two cores for the job then ask for 2 executors with 1 core each from Spark submit. That should give you two containers with 1 executor each. I don't think Yarn will give you an executor with 2 cores if a container can only have 1 core. But if you can have 8 cores per container then you can have 8 executors with 1 core or 4 executors with 2 cores per container. Of course, you can continue to add executors as long as you your Yarn queue has capacity for more containers. # Run on a YARN cluster ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 2 --executor-cores 1 /path/to/examples.jar

vvaks · ‎04-17-2016

@mike harding What about cores? The Yarn RM UI should show the number of cores that Yarn has available to allocate. Are there any cores still available? Are there any other jobs in the running state? If you click on the application master link on the Yarn RM UI, that should take you to the Spark UI, is it showing any jobs as incomplete?

vvaks · ‎04-13-2016

@mike harding this looks like Yarn is not able to allocate containers for the executors. When you look at Yarn Resource Manager Ui, is there a job from zeppelin in accepted mode? If so, how much memory is available for Yarn to allocate (should be on the same ui)? If the job is in accepted state and there is no memory or not enough memory available the job will not start until Yarn gets resources freed up. If this is the case, try adding more memory for Yarn in ambari.

vvaks · ‎04-07-2016

@hoda moradi If I understand your question correctly, you could try to use a state management function with UpdateStateByKey (http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams) where the key is the schema type field (I am assuming this is a String). Create a global map with the schema type field as the key and the corresponding data frame as the value. The function itself would look up the data frame object in the map you created earlier and then operate on that data frame, the data you want to save should now also be passed to the function. The stateful function is typically used to keep a running aggregate. However, because it actually partitions the DStream (I believe by creating separate DStreams) based on the key you provide it should allow you to write generic logic where you lookup the specifics (like target table and columns) at run time. Let me know if that makes sense, I can post some code if not.

Online	Offline
Last Visited	‎05-08-2018 09:31 PM

Member Since	‎03-24-2016 01:35 PM
Last Visited	‎05-08-2018 09:31 PM
Posts	184
Kudos received	165

Cloudera Community

Re: Why doesn't Atlas draw lineage?

Re: Unable to add phoenix application using Cloudb...

Re: Running a Spark Job with NiFi using Execute Pr...

Re: CreditCardTransactionMonitor Demo - Transactio...

Re: List hbase tables Spark sql

Hadoop as an Application PaaS with Slider and Dock...

Re: toJSON is not a member of org.apache.spark.sql...

Using Spark to Virtually Integrate Hadoop with Ext...

Re: Can I use NiFi to launch Spark (or other YARN)...

Re: How split calculate in Spark ?

Re: I can't find my tables in Spark SQL using Beel...

Re: Spark Submit Multiple Jobs in Cluster Environm...

Re: Spark Job hangs when run on zeppelin

Re: Spark Job hangs when run on zeppelin

Re: Create different schemas at run time for diff...