Member since
03-24-2016
184
Posts
239
Kudos Received
39
Solutions
12-21-2017
08:53 AM
Also I've tried spark-llap on HDP-2.6.2.0 with Spark 1.6.3 and http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.5.5-2/spark-llap-1.0.0.2.5.5.5-2-assembly.jar, but unfortunately, when I tried to execute a simple "select count" query in beeline, got the following error messages: 0: jdbc:hive2://node-05:10015/default> select count(*) from ods_order.cc_customer;
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#56L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#59L])
+- Scan LlapRelation(org.apache.spark.sql.hive.llap.LlapContext@690c5838,Map(table -> ods_order.cc_customer, url -> jdbc:hive2://node-01.hdp.wiseda.com.cn:10500))[] (state=,code=0) thriftserver-err-msg.txt and the log messages in thriftserver as shown in attached "thriftserver-err-msg.txt".
... View more
12-29-2016
06:40 PM
6 Kudos
This tutorial is a follow on to the Apache Spark Fine Grain Security with LLAP Test Drive tutorial. These two articles cover the entire range of security authroization capabilities available for Spark on the Hortonworks Data Platform. Getting Started Install an HDP 2.5.3 Cluster via Ambari. Make sure the following components are installed: Hive Spark Spark Thrift Server Hbase Ambari Infra Atlas Ranger Enable LLAP Navigate to the Hive Configuration Page and click Enable Interactive Query. Ambari will ask what host group to put the Hiveserver2 service into. Select the Host Group with the most available resources. With Interactive Query enabled, Ambari will display new configurations options. These options provide control of resource allocation for the LLAP service. LLAP is a set of long lived daemons that facilitate interactive query response times and fine grain security for Spark. Since the goal of this tutorial is to test out fine grain security for Spark, LLAP only needs a minimal allocation of resources. However, if more resources are available, feel free to crank up the allocation and run some Hive queries against the Hive Interactive server to get a feel for how LLAP improves Hive's performance. Save configurations, confirm and proceed. Restart all required services. Navigate to Hive Summary tab and ensure that Hiveserver2 Interactive is started Download Spark-LLAP Assembly From the command line as root: wget -P /usr/hdp/current/spark-client/lib/ http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.3.0-37/spark-llap-1.0.0.2.5.3.0-37-assembly.jar Copy the assembly to the same location on each host where Spark may start an executor. If queues are not enabled, this likely means all hosts running a node manager service. Make sure all users have read permissions to that location and the assembly file Configure Spark for LLAP - In Ambari, navigate to the Spark service configuration tab: - Find Custom-spark-defaults, - click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Custom spark-thrift-sparkconf, - click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Advanced-spark-env - Set spark_thrift_cmd_opts attribute to --jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - Save all configuration changes - Restart all components of Spark - Make sure Spark-Thrift server is started Enable Ranger for Hive - Navigate to Ranger Service Configs tab - Click on Ranger Plugin Tab - Click the switch labeled "Enable Ranger Hive Plugin" - Save Configs - Restart All Required Services Create Stage Sample Data in External Hive Table - From Command line cd /tmp
wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip
unzip data.zip
sudo -u hdfs hadoop fs -mkdir /tmp/FactSales
sudo -u hdfs hadoop fs -chmod 777 /tmp/FactSales
sudo -u hdfs hadoop fs -put /tmp/data/FactSales.csv /tmp/FactSales
beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "CREATE TABLE factsales_tmp (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int,LoadDate timestamp, UpdateDate timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tmp/FactSales'" Move data into Hive Tables - From Command line beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "CREATE TABLE factsales (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int, LoadDate timestamp, UpdateDate timestamp) clustered by (saleskey) into 7 buckets stored as ORC"
beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "INSERT INTO factsales SELECT * FROM factsales_tmp" View Meta Data in Atlas - Navigate to the Atlas Service - Click on Quicklinks --> Atlas Dashboard - user: admin password: admin - Create a new Tag called "secure" - Click on Search --> Flip the Switch to "DSL" --> Select "hive_table" and submit the search - When we created the sample Hive tables earlier, the Hive Hook updated Atlas with meta data representing the newly created data sets - Click on Factsales to see details including lineage and schema information for Factsales Hive table - Scroll down and click on the Schema tab - Click on the Plus sign next to the Storekey column to add tag and add the "secure" tag we created earlier - The storekey column of the factsales hive table is now tagged as "secure". We can now configure Ranger to secure access to the storekey field based on meta data in Atlas. Configure Ranger Security Policies - Navigate to the Ranger Service - Click on Quicklinks --> Ranger Admin UI - user: admin password: admin - Click on Access Manager --> Tag Based Polices -Click the Plus Sign to add a new Tag service -Click Add New Policy, name and add the new service - The new tag service will show up as a link. Click the link to enter the tag service configuration screen. - Click Add New Policy - Name the policy and enter "secure" in the TAG field. This tag refers to the tag we created in Atlas. Once the policy is configured, The Ranger Tag-Synch service will look far notification from Atlas that the "secure" tag was added to an entity. When it sees that notification, it will update Authorization as described by the Tag based policies. - Scroll down and click on the link to expand the Deny Condition section - Set the User field to User hive and the component Permission section to Hive - Click Add to finalize and create the policy. Now Atlas will notify Ranger whenever an entity is tagged as "secure" or the "secure" tag is removed. The "secure" tag policy permissions will apply to any entity tagged with the "secure" tag. - Click on Access Manager and select Resource Based Policies - Next to the {clustername}_hive service link, click the edit icon (looks like a pen on paper). Make sure to click the icon and not the link. - Select the Tag service we created earlier from the drop down and click save. This step is important as this is how Ranger will associate the tag notifications coming from Atlas the Hive security service. - You should find yourself at Resource Based Policies screen again. This tim click on {clustername}_hive service link, under the Hive section - Several default Hive security policies should be visible. - User hive is allowed access to all tables and all columns - The cluster is now secured with Resource and Tag based policies. Let's test out how these work together using Spark. Test Fine Grain Security with Spark - Connect to Spark-Thrift server using beeline as hive User and verify sample tables are visible beeline -u jdbc:hive2://{spark-thrift-server-host}:10015 -n hive
Connecting to jdbc:hive2://{spark-thrift-server-host}:10015
Connected to: Spark SQL (version 1.6.2)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive
0: jdbc:hive2://{spark-thrift-server-host}:10015> show tables;
+----------------+--------------+--+
| tableName | isTemporary |
+----------------+--------------+--+
| factsales | false |
| factsales_tmp | false |
+----------------+--------------+--+
2 rows selected (0.793 seconds)
- Get the Explain Plan for a simple query 0: jdbc:hive2://sparksecure01-195-1-0:10015> explain select storekey from factsales;
| == Physical Plan == |
| Scan LlapRelation(org.apache.spark.sql.hive.llap.LlapContext@44bfb65b,Map(table -> default.factsales, url -> jdbc:hive2://sparksecure01-195-1-0.field.hortonworks.com:10500))[storekey#66] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
2 rows selected (1.744 seconds)
- The explain plan should show that the table will be scanned using the LlapRelation class. This confirms that Spark is using LLAP to read from HDFS. - Recall that the User hive should have complete access to all databases, tables, and columns per the Ranger resource based policy. - Attempt to select storekey from factsales as the User hive - Even though User hive should have full access to the factsales table, we were able to restrict access to the storekey column by designating it as "secure" using a tag in Atlas. - Attempt to select saleskey from factsales as the User hive. The saleskey column is not designated as secure via tag. - Access to the saleskey field is allowed since the User hive has acess and the field is not designated as secure. - Return to the Factsales page in Atlas and remove the "secure" tag from the storekey column. - Wait 30-60 seconds for the notification from Atlas to be picked up, processed, and propagated. - Attempt to select storekey from factsales as the User hive once again. - This time access is allowed since the secured tag has been removed from the storekey column of the factsales table in Atlas. - Back in the Ranger UI, Click on Audit to see all of the access attempts that have been recorded by Ranger. - Notice that the first access attempt was denied based on the tag [secure]. Ranger already provides extremely fine grain security for both Hive and Spark. However, in combination with Atlas, yet another level of security can be added. Tag based security for Spark provides additional flexibility in controlling access to datasets.
... View more
Labels:
01-19-2017
02:30 PM
This is a great article..I have a question around the ThriftServer. The article description says "SparkSQL, Ranger, and LLAP via Spark Thrift Server.." but the implementation uses HiveServer2? so can ranger work with Spark Thrift server? Is there a ranger plugin for Spark ThriftServer?
... View more
08-15-2016
03:47 AM
6 Kudos
This article is a companion to the article "Avro Schema Registry with Apache Atlas for Streaming Data Management". https://community.hortonworks.com/articles/51379/avro-schema-registry-with-apache-atlas-for-streami.html The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.
Download HDP 2.5 Sandbox modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1 SSH to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) ambari-admin-password-reset make sure to set the Ambari password to "admin" Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
HBase Log Search Kafka Atlas From the SSH console:
git clone https://github.com/vakshorton/AvroSchemaShredder cd /root/AvroSchemaShredder chmod 755 install.sh ./install.sh java -jar AvroSchemaShredder-jar-with-dependencies.jar Open a second SSH session to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) cd /root/AvroSchemaShredder curl -u admin:admin -d @schema/schema_1.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:8090/schemaShredder/storeSchema
Curl will make a REST API call to the AvroSchemaShredder service to parse the sample Avro schema and store it in Atlas. Log into Atlas: http://sandbox.hortonworks.com:21000 (usr:admin, pass:admin) Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
Click into one of the schemas, notice the available information about the top level record The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages. To retrieve the Avro compliant schema notation: get the GUID that the curl command returned after the sample schema was registered curl -u admin:admin -X GET http://sandbox.hortonworks.com:8090/schemaShredder/getSchema/{GUID}
The response should be an Avro compliant schema descriptor This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved. Repo: https://community.hortonworks.com/content/repo/51366/avro-schema-shredder.html
... View more
Labels:
08-13-2016
08:59 PM
6 Kudos
Data is the becoming the new precious resource. As the world produces more and more data,
business units find increasingly more ways to monetize that data. This means that data that
used to be retained for a short time or not at all, is now being persisted long term. This
data is being gathered from more and more sources and not necessarily from within the
organization that uses it. It is also increasingly being generated by machines, meaning
that the volume, velocity, and variety of the data proliferate at an overwhelming rate.
There are now lots of tools that enable an organization to address the challenges imposed by
the proliferation of data. However, many organizations have been focused on dealing with
volume and velocity while not focusing on the challenges created by the lack or
inconsistency of structure. In order to truly unlock the power of all that data, an organization must first apply a
consistent set of guidelines for governance of the data. Getting value from new data
sources often requires imposing schemas on unstructured or semi-structured data.
This is because the new data often has to be combined with existing structured data in order for it
to be useful. Schemas can also be important for security as sensitive bits of data are
often mixed in data sets that are generally considered non-sensitive. Finally, business
units generally do not create the technologies that monetize the data. That job falls to
many different engineering groups that are often decentralized. In order to effectively
create the tools that enable harvesting value from data, engineering teams need to agree on
how that data should be used, modified, and enriched. Consider a scenario where two
different engineering teams are working on requirements from two different business units
and have no knowledge of the other's work. When team A wants to evolve the schema of some
data set, they must be sure that the change will not disrupt the work of team B. This is
challenging since team A may not know that team B is using the same data or what they are
doing with it. In addition, team B will likely derive a new data set from the existing
data. That new data set may be exactly what team A needs to deliver what the business has
asked for. Team A needs to be able to discover the fact that team B has produced a new data
set from the one that both teams were using. It used to be that data was primarily stored in silo-ed relational databases in a
structured format. The very existence of data was predicated on the existence of a well defined schema.
In the new world of Big Data plaforms, data is often stored without a schema and in some cases
the data is a stream of messages in a queueing system. Data Governance tools like
Apache Atlas can help with management of data sets and processes that evolve them. The flexibility of
Atlas enables creation of new managed Types that can be used to govern
data sets form just about any data source. In fact, as of Hortonworks Data Platform 2.5, Atlas is used to visualize
and track cross component lineage of data ingested via Apache Hive, Apache Sqoop, Apache Falcon,
Apache Storm, Apache Kafka, and in the future, Apache Nifi. Schemas for Hive tables are stored and
governed, thus covering many data at rest use cases. It makes a lot of sense to manage schemas for streaming
data sources within Atlas as well. Kafka topics are captured as part of Storm
topologies but currently, only configuration information is available. The concept of an Avro Schema Registry
combined with existing governance capabilities of Atlas, would extend the benefits of data governance
to streaming data sets.
In order to extend concept of schema to streaming data sets, a serialization format with a built in the concept of schema is required.
Apache Avro is a commonly used serialization format for streaming data.
It is extremely efficient for writes and includes self describing schema as part of its specification.
Avro schema specification allows for schema evolution that is backward or forward compatible.
Each message can be serialized with its schema so that an independent down stream
consumer is able to deserialize the message. Instead of the full schema, it is also possible
to pass a "fingerprint" that uniquely identifies the schema. This is useful when the
schema is very large. However, using a fingerprint with messages that will travel through
multiple Kafka topics requires that the consumer is able to reference the schema that the
fingerprint refers to. Atlas can be used to not only store Avro schemas but to make them
searchable, and useful for data governance, discovery, and security.
The first step to using Atlas as an Avro Schema Registry is to add new Types that align to
the Avro Schema specification. Avro Schema supports the following types: Records Enums Arrays Maps Unions Fixed Primitives Using the Atlas API, it is possible to create types that exhibit the same kinds of attributes
and nesting structure. The second required component is a service that is capable of parsing
an Avro Schema JSON representation and translating it the new Atlas Avro Types. After registering
the schema, the service should return a fingerprint (GUID) that will act as the claim check for that schema on deserialization.
The service should also handle schema validation and compatibility enforcement. This set
of capabilities would allow automatic deserialization of messages from a Kafka topic. While just having an Avro Schema Registry is valuable for streaming use cases, using Atlas
as the underlying store provides substantial value. Data discovery becomes much easier
since all of the fields in each Avro Schema can be individually indexed. This means that a user
can search for the name of a field and determine the schema and Kafka topic where it can be found.
In many use cases the messages flowing through the Kafka topic flow into a Hive table,
HDFS location, or some NoSQL store. Engineering teams can use the cross component lineage
visualization in Atlas to understand the effects that schema evolution will have downstream.
Atlas also provides the ability to apply tags and business taxonomies. These capabilities
make it really easy to curate, understand, and control how streaming data is deployed and secured.
For example, Apache Atlas integrates with Apache Ranger (Authorization system) to enable tag based
policies. This capability allows column level authorization for data managed by Apache Hive
based on tags applied to the meta data in Atlas. Apache Ranger is also currently able to secure
Kafka topics based on source IP or user name (in Kerberized clusters). Tag based policies
are not yet available for Kafka topics. However, it should be possible to reuse the same
tag synch subsystem used to implement tag based policies in Hive. Tags can also be used
to ensure to deprecate older schemas or prevent evolution of certain schemas through the Registry API.
Finally, because Atlas uses HBase and Solr under the covers, enterprise requirements like HA
and DR capabilities do not need to be re-implemented. It is clear that data governance is becoming absolutely essential component of an enterprise
data management platform. Whether the data is streaming or at rest, both business and
technology organizations need to discover, understand, govern, and secure that data. Combining
capabilities of existing data governance tools like Apache Atlas with schema aware data formats
like Apache Avro (Kafka) and Apache ORC (Hive/Pig/Spark), can help managing Big Data that
much easier.
... View more
Labels:
09-12-2017
01:53 PM
Excellent Article! Thanks for sharing your thoughts.
... View more
01-29-2018
04:48 AM
Step 1 : Check Service Status: should use get request curl -u admin:admin -H "X-Requested-By:ambari"-i -X GET http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI
... View more
05-08-2016
02:01 AM
@Benjamin Leonhardi With the release of Yarn.Next, the containers will receive their own IP address and get registered in DNS. The FQDN will be available via a rest call to Yarn. If the current Yarn container die, the docker container will start in a different Yarn container somewhere in the cluster. As long as all clients are pointing at the FQDN of the application, the outage will be nearly transparent. In the mean time, there are several options using only slider but it requires some scripting or registration in Zookeeper. If you run: slider lookup --id application_1462448051179_0002
2016-05-08 01:55:51,676 [main] INFO impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2016-05-08 01:55:53,847 [main] WARN shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2016-05-08 01:55:53,868 [main] INFO client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
{
"applicationId" : "application_1462448051179_0002",
"applicationAttemptId" : "appattempt_1462448051179_0002_000001",
"name" : "biologicsmanufacturingui",
"applicationType" : "org-apache-slider",
"user" : "root",
"queue" : "default",
"host" : "sandbox.hortonworks.com",
"rpcPort" : 1024,
"state" : "RUNNING",
"diagnostics" : "",
"url" : "http://sandbox.hortonworks.com:8088/proxy/application_1462448051179_0002/",
"startTime" : 1462454411514,
"finishTime" : 0,
"finalStatus" : "UNDEFINED",
"origTrackingUrl" : "http://sandbox.hortonworks.com:1025",
"progress" : 1.0
}
2016-05-08 01:55:54,542 [main] INFO util.ExitUtil - Exiting with status 0
You do get the host the container is currently bound to. Since the instructions bind the docker container to the host IP, this would allow URL discovery but as I said, not out of the box. This article is merely the harbinger to Yarn.Next as that will integrate the PaaS capabilities into Yarn itself, including application registration and discovery.
... View more
04-27-2016
06:57 PM
8 Kudos
Data Federation discussions are becoming more and more common place as organizations embark on their Big Data Journey. New data platforms like the Hortonworks Connected platform (HDP+HDF), NoSQL, and NewSQL data stores are reducing the cost and difficulty of storing and working with vast volumes of data. This is empowering organizations to leverage and monetize their data like never before. However, legacy data infrastructures still play an important role in the overall technology architecture. In order to fully realize the power of the new and traditional data platforms, it is often necessary to integrate the data. One obvious approach is to simply move the data from where it sits in the existing data platform over to the target data platform. However, in many cases it is desirable to leave the data in place and enable a "Federation" tier to act as a single point of access to data from multiple sources. For details on the concepts and implementation of Data Federation see https://community.hortonworks.com/articles/27387/virtual-integration-of-hadoop-with-external-system.html. This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier". First, it is important to point out that SparkSQL is not a pure Data Federation tool and hence does not have some of the really advanced capabilities generally associated with Data Federation. SparkSQL does not facilitate predicate push down to the source system beyond the query that defines what data from the underlying source should be made available through SparkSQL. Also, because it was not designed to be a true "Data Federation" engine, there is no "user friendly" interface to easily setup the external sources, the schemas associated with the target data, or the ingest of the target data. All of this work has to be done through the SparkSQL API and requires relatively advanced knowledge Spark and data architecture principles in general. For these reasons, SparkSQL will not be the right solution in every Data Federation scenario. However, what SparkSQL lacks in terms of an "easy button" it makes up for in versatility, relatively low cost, sheer processing potential, and in-memory capabilities. SparkSQL exposes most of it's capabilities via the Data Frame API and the SQL context. Data can be ingested into Spark's native data structure (RDD) from an RDBMS, from HDFS (supports Hive/HBase/Phoenix), and generally any source that has an API that Spark can access (HTTP/JDBC/ODBC/NoSQL/Cloud Storage). The Data Frame allows the definition of a schema and then the application of that schema to the RDD containing the target data. Once the data has been transformed into a Data Frame with a schema, it is a single line of code away from becoming what looks exactly like a relational table. That table can then be stored in Hive (assuming Hive context was created) if it needs to be accessed on a regular basis or registered as a temp table that will exist only as long as the parent Spark application and it's executors (the application can run indefinitely). If a enough resources are available and really fast query response are required, any or all of the tables can cached and made available in-memory. Assuming a properly tuned infrastructure, and a clear understanding of how and when the data changes, this can make query response times extremely fast. Imagine caching the main fact table and leveraging map joins for the dimension tables. All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server. The Spark thrift server supports virtually the same API and many of the features supported by the battle tested Hive thrift server. At this point, OLAP and reporting BI tools can be used to display data from far and wide across the organization's data enterprise architecture. As stated earlier, it is certainly not the right choice in every situation and must be thought out carefully. However, it should be noted that this very design pattern is being used by large traditional software vendors to enhance their existing product sets. One great example of this is SAP Vora which extends the capabilities of Spark to enable an organization to greatly augment the processing and storage capabilities of HANA by leveraging Spark on Hadoop. There is definitely value in the work that vendors are doing to make SparkSQL more accessible. However, because Spark is open source, it can also be implemented without a capital acquisition cost. In general, SparkSQL is an excellent option for data processing and data federation. It can greatly improve BI performance and range of available data. This design pattern is not for the fait of heart but when implemented properly can lead to great progress for an organization on the Big Data Journey. For a working example of using SparkSQL for Data Federation check out: https://community.hortonworks.com/content/repo/29883/sparksql-data-federation-demo.html
... View more
Labels: