Member since
09-24-2015
27
Posts
69
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3073 | 12-04-2015 03:40 PM | |
18648 | 10-19-2015 01:56 PM | |
2613 | 09-29-2015 11:38 AM |
03-31-2016
11:20 PM
5 Kudos
Steps to connect from a client machine (MAC in this case) to Hadoop cluster using hive JDBC.
Here are couple of links that i have used to build this out
Link that talks about hive drivers and jar files https://streever.atlassian.net/wiki/pages/viewpage.action?pageId=4390924
Link that talks about how to setup java jdbc setup for hive jar files https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients 1.Here are the jar files you need to connect to HS2
For HDP 2.3
You'll only need 2 jar files for the JDBC Client with HDP 2.3
# From /usr/hdp/current/hive-client
hive-jdbc.jar (should be a symlink to the hive-jdbc-xxx-standalone.jar) # From /usr/hdp/current/hadoop-client
hadoop-common.jar (hadoop-common-....jar) 2. Make sure that java home is set on your machine
This is value that i have on my machine for JAVA_HOME echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home 3. Move these jar files to java library directory on my machine
/Library/Java/Extensions 4. Set the java classpath for the hive jar file
export CLASSPATH=$CLASSPATH:/Library/Java/Extensions/hive-jdbc.jar 5. Use any java based IDE (I used eclipse) to write a simple java class to connect to hiver server2.
Where the jdbc string is mentioned, you have to specify the hive server that you are using and corresponding userid and password as well
Here is the code for that package hive_test;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClientv1 {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
//replace "hive" here with the name of the user the queries should run as
Connection con = DriverManager.getConnection("jdbc:hive2://172.16.149.158:10000/default", "hive", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.execute("drop table if exists " + tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
// show tables
// String sql = "show tables '" + tableName + "'";
String sql = ("show tables");
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
}
}
... View more
- Find more articles tagged with:
- client
- Data Processing
- Hive
- How-ToTutorial
- jdbc
Labels:
02-08-2016
04:20 PM
I am not having any activity on HBase. I wanted to make sure its stable before i build a demo on top of it. Should i add a custom HBase site property name zookeeper.session.timeout and set it to 60000?
... View more
02-08-2016
11:44 AM
I have a HDP 2.3.2 sandbox up and running on VMWare fusion. I am able to start both HBase master and region server but then they fail after few hours. It seems like region server loses its connection to zookeeper for some reason. Here is the error that i get 2016-02-07 22:32:46,693 FATAL [main-EventThread] regionserver.HRegionServer: ABORTING region server sandbox.hortonworks.com,16020,1454868273283: regionserver:16020-0x150af1fbaec0164, quorum=sandbox.hortonworks.com:2181, baseZNode=/hbase-unsecure regionserver:16020-0x150af1fbaec0164 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Please let me know if i need to make any setting changes in sandbox.
... View more
Labels:
- Labels:
-
Apache HBase
01-14-2016
04:20 PM
5 Kudos
1Overview Traditionally enterprises have been dealing
with data flows or data movement within their data centers. But as the world
has become more flattened and global presence of companies has become a norm,
enterprises are faced with the challenge of collecting and connecting data from
their global footprint. This problem was
daunting NSA a decade ago and they came up with a solution for this using a
product which was later named as Apache Nifi.
Apache nifi is a easy to use, powerful, and reliable system to process
and distribute data. Within Nifi, as you will see, I will be able to build a
global data flow with minimal to no
Coding. You can learn the details
about Nifi from Apache Nifi website. This is one of most well
documented Apache projects. The
focus of this article to just look at one specific feature within Nifi that I
believe no other software product does it as well as Nifi. And this feature is
“site to site” protocol data transfer. 2Business use case One of the classic business problem is to
push data from a location that has a small IT footprint, to the main data
center, where all the data is collected and connected. This small IT footprint
could be a oil rig at the middle of the ocean, a small bank location at a
remote mountain in a town, a sensor on a vehicle so on and so forth. So, your
business wants a mechanism to push the data generated at various location to
say Headquarters in a reliable fashion, with all the bells and whistles of an
enterprise data flow which means maintain lineage, secure, provenance, audit,
ease of operations etc. The
data that’s generated at my sources are of various formats such as txt, csv,
json, xml, audio, image etc.. and they could of various size ranges from few
MBs to GBs. I wanted to break these files into smaller chunks as I have a low
bandwidth at my source data centers and want to stich them together at the
destination and load that into my centralized Hadoop data lake. 3Solution Architecture Apache Nifi (aka Hortonworks Data Flow) is a
perfect tool to solve this problem. The overall architecture looks something
like Fig 1. We
have a Australian & Russian data center from where we want to move the data
to US Headquarters. We will have what we call edge instance of nifi that will
be sitting in Australian & Russian data center, that will act as a data
acquisition points. We will then have a Nifi processing cluster in US where we
will receive and process all these data coming from global location. We will
build this end to end flow without any coding but rather by just a drag and
drop GUI interface. 4Build the data flow Here are the high level steps to build the
overall data flow. Step1) Setup a Nifi instance at Australian
data center that will act as data acquisition instance. I will create a local
instance of Nifi that will act as my Australian data center. Step2) Setup Nifi instance on a CentOS based
virtual machine that will act as my Nifi data processing instance. This could
be cluster of Nifi as well but, in my case it will be just a single instance. Step3) Build Nifi data flow for the processing
instance. This will have an input port that will indicate that this instance
can accept data from other Nifi instances. Step4) Build Nifi data for the data
acquisition instance. This will have a “remote process group” that will talk to
the Nifi data processing instance via site-to-site protocol. Step5) Test out the overall flow. Attached is the document that provides detailed step by step instruction on how to set this up. data-flow-across-data-centers-v5.zip
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- hdf
- how-to-tutorial
- NiFi
Labels:
12-10-2015
10:44 PM
I have a Solr Banana dashboard that has shows some panels with charts and tables. Is there a way to export a dashboard with data so that a user can play with it offline without being connected to the solr server?
... View more
Labels:
- Labels:
-
Apache Solr
12-04-2015
03:40 PM
2 Kudos
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates Option1 You can use split –l to break the bigger file into small one while using iconv Option2 I suppose it would be a good idea to write a little program using icu if iconv fails. http://userguide.icu-project.org/conversion/converters Option3 You can try to do it in Java. Here’s one example: https://docs.oracle.com/javase/tutorial/i18n/text/stream.html You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String): String s = String(byte[] bytes, Charset charset) And when writing it back out (String to byte[]): s.getBytes(Charset charset) This approach should solve your size limit problem.
... View more
12-04-2015
12:58 PM
1 Kudo
One of my client is trying to create an external Hive table in HDP from CSV files, (about 30 files, total of 2.5 TeraBytes) But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues Is there an easy way to convert CSV/TXT files from Unicode (UTF-16 / UCS-2) to ASCII (UTF-8)? Is there is a way for Hive to recognize this format? He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file. iconv -c -f utf-16 -t us-ascii Any suggestions??
... View more
Labels:
- Labels:
-
Apache Hive
11-21-2015
03:53 PM
As of HDF 1.0, we can write custom processor for HDF using Java, is there plan to support other programming languages.
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
11-20-2015
01:09 AM
Thanks @bbende!! So we do not recommend scaling Nifi vertical by increasing the heap size for the JVM to really large size?
... View more
11-19-2015
10:55 PM
Can we have a file watcher kind of mechanism in Nifi, where the data flow gets triggered when ever a file shows up at source? Is it same as scheduling a getfile processor or run always?
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
11-19-2015
10:46 PM
When we run HDF on a single machine , does all the data flow build on that machine run under a single JVM? I did see in Nifi documents which talks about how how you can control the spill the data from JVM to hardisk. But is there option to run via multiple JVM say one for each flow. Also How big of a JVM size you usually have for a edge node.
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
11-19-2015
01:29 PM
9 Kudos
1Nifi Custom Processor Overview Apache nifi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Its very simple to use product using which you can build "data flow" very easily. As of version 3.0 it has 90 prebuilt processors but then again you can extend them by adding your own custom processors. In this article I am going to talk about how
you can build a custom nifi processor in your local machine and then move the final
finished processor which is a nar file to nifi. This article is based on a video from youtube
and here is the link for that https://www.youtube.com/watch?v=3ldmNFlelhw 2Steps to build Customer processor Here are the steps involved to build the
custom processor for nifi. I used my mac to build this processor 2.1Required
Software Two software that you would need in you
location machine are 1.maven 2.Java Here is how you can quickly check if you have
them installed mvn -version java -version Here are the results from my machine $ mvn -version Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1;
2014-12-14T11:29:23-06:00) Maven home: /usr/local/Cellar/maven/3.2.5/libexec Java version: 1.8.0_65, vendor: Oracle Corporation Java home:
/Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.10.4", arch:
"x86_64", family: "mac" $ java -version java version "1.8.0_65" Java(TM) SE Runtime Environment (build 1.8.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) 2.2Create
a directory where you want to build the processer I created the directory under following
location cd <Home Dir>/Documents/nifi/ChakraProcessor mkdir ChakraProcessor 2.3Create
the nifi processor with default value Get to that new directory you just created
and use mvn command to build the required java files cd <Home Dir>/Documents/nifi/ChakraProcessor mvn archetype:generate You will be asked for bunch of parameters. I
choose following parameters which are highlighted in bold Choose a number or apply filter (format:
[groupId:]artifactId, case sensitive contains): 690: nifi Choose archetype: 1: remote ->
org.apache.nifi:nifi-processor-bundle-archetype (-) 2: remote ->
org.apache.nifi:nifi-service-bundle-archetype (-) Choose a number or apply filter (format:
[groupId:]artifactId, case sensitive contains): : 1 Choose
org.apache.nifi:nifi-processor-bundle-archetype version: 1: 0.0.2-incubating 2: 0.1.0-incubating 3: 0.2.0-incubating 4: 0.2.1 5: 0.3.0 Choose a number: 5: 4 Downloading:
https://repo.maven.apache.org/maven2/org/apache/nifi/nifi-processor-bundle-archetype/0.2.1/nifi-processor-bundle-archetype-0.2.1.jar Downloaded:
https://repo.maven.apache.org/maven2/org/apache/nifi/nifi-processor-bundle-archetype/0.2.1/nifi-processor-bundle-archetype-0.2.1.jar
(12 KB at 8.0 KB/sec) Downloading:
https://repo.maven.apache.org/maven2/org/apache/nifi/nifi-processor-bundle-archetype/0.2.1/nifi-processor-bundle-archetype-0.2.1.pom Downloaded:
https://repo.maven.apache.org/maven2/org/apache/nifi/nifi-processor-bundle-archetype/0.2.1/nifi-processor-bundle-archetype-0.2.1.pom
(2 KB at 9.4 KB/sec) Define value for property 'groupId': : hwx Define value for property 'artifactId': : HWX Define value for property 'version': 1.0-SNAPSHOT: : 1.0 Define value for property 'artifactBaseName':
: demo Define value for property 'package': hwx.processors.demo: : [INFO] Using property: nifiVersion =
0.1.0-incubating-SNAPSHOT Confirm properties configuration: groupId: hwx artifactId: HWX version: 1.0 artifactBaseName: demo package: hwx.processors.demo nifiVersion: 0.1.0-incubating-SNAPSHOT Y: : Y [INFO]
---------------------------------------------------------------------------- [INFO] Using following parameters for
creating project from Archetype: nifi-processor-bundle-archetype:0.2.1 [INFO] ---------------------------------------------------------------------------- 2.4Modify
the processor Above command will result in a MyProcessor.java
file which is where you will put your code for your custom processor Open MyProcessor.java under following
location <Home Dir>/Documents/nifi/ChakraProcessor/HWX/nifi-demo-processors/src/main/java/hwx/processors Add following lines at the end after //TODO
implement section //
TODO implement System.out.println("This is a custom
processor that will receive flow file"); session.transfer(flowFile,MY_RELATIONSHIP); 2.5Change
POM There is change that you need to make to the
POM file before you can create the package. Remove the -Snapshot from pom.xml file under
following location <Home Dir>/Documents/nifi/ChakraProcessor/HWX 2.6Create
nar file for your processor cd <Home Dir>/Documents/nifi/ChakraProcessor/HWX mvn install Once maven install is done you will have the
nar file at the target directory with name nifi-demo-nar-1.0.nar cd <Home Dir>/Documents/nifi/ChakraProcessor/HWX/nifi-demo-nar/target $ ls classes maven-archiver maven-shared-archive-resources
nifi-demo-nar-1.0.nar 2.7Copy
the nar file to Nifi Copy the nar file to the bin directory of
where nifi is installed Here is a sample command to copy the file -- scp nifi-demo-nar-1.0.nar
root@172.16.149.157:/opt/nifi-1.0.0.0-7/bin/ Once the nar file is copied youi need to
restart nifi Once restarted you should be able to add the
custom processor that you built which will show up with the name "MyProcessor" 2.8Build
Nifi data flow You can build a new data flow using this
customer processor GenerateFlowFile --> MyProcessor -->
LogAttribute For "MyProcessor" you can enter some random value under the property section to make
it valid.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- hdf
- how-to-tutorial
- NiFi
- nifi-processor
- tutorial
Labels:
10-27-2015
12:11 PM
4 Kudos
In terms of Azure HDInsight environment, here are few things to be aware of in terms of infrastructure:
You have option to install HDInsight on Windows or HDInight on Linux (only on ubuntu 12 LTS). Apache Ambari only comes only with Linux based install.
Type of machines used for Linux-based install were limited to D3, D4 & D12. Not sure if this is because of my Azure account limitations. HDInsight version is 3.2.1 which comes with HDP 2.2 certain components. Separate cluster required for Hadoop, Hbase , Storm. And Spark is available as Technical preview. Uses Blob storage as default for HDFS. Not sure if there is a option to add VHD or SSD. HDInsight 3.2 does not contain Falcon, Flume, Accumulo, Ambari Metrics, Atlas, Kafka, Knox, Ranger, Ranger KMS & Slider. Also it has a bit older version of hadoop components. Attached is a file that has comparison of HDInsight 3.2.1 components to that of HDP 2.3.2. hdinsight-and-hdp-component-comparison.zip Update by @Ancil McBarnett HDInsight Component Versioning: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/
... View more
Labels:
10-22-2015
07:01 PM
Right Ancil, these are jar files specifically for Oracle SQL developers connectivity. I thought this article will be useful for folks who have SQL developer as a standard SQL client tool in their company and have no other workaround 🙂
... View more
10-22-2015
03:02 PM
7 Kudos
Oracle SQL Developer is one of the most common SQL client tool that is used by Developers, Data Analyst, Data Architects etc for interacting with Oracle and other relational systems. So extending the functionality of SQL developer to connect to hive is very useful for Oracle users. I found this original article on oracle’s website and made some additions
based upon the issues that I ran into. Here is the original link Oracle
SQL Developer support for Hive Here are the steps that i followed Step1) For the latest version of SQL developer you would
need JDK 1.8 so you would need to install that on your mac and also change the
JAVA_HOME path so that it points to JDK 1.8. Download
JDK 1.8 Step2) Download the latest version of Oracle SQL developer
for mac, from oracle and unzip it Oracle
SQL Developer Download Move the SQL Developer file to your application, so that it
is available for you. Now when you try to open up Oracle SQL developer on mac,
it may not open. For me it showed up in the tray, blinked for a while and then
gone. So I had to follow this instruction to fix it Fix
for Mac SQL Developer setup Once you have this fix then you should be able to open the
Oracle SQL developer. Step3) Need to download JDBC driver for Hive that can work
with Oracle SQL Developer. Cloudera has one available and here it he link for
it Link
for Hive JDBC Driver for Oracle SQL Developer Step4) Unzip the downloaded file from step3. There will be
another zip file you will find called “Cloudera_HiveJDBC4_2.5.15.1040.zip”.
Unzip that file as well and move all the jars to <your home directory>/.sqldeveloper/ ql.jar hive_service.jar hive_metastore.jar TCLIServiceClient.jar zookeeper-3.4.6.jar slf4j-log4j12-1.5.11.jar slf4j-api-1.5.11.jar log4j-1.2.14.jar libthrift-0.9.0.jar libfb303-0.9.0.jar HiveJDBC4.jar Step5) Add these jars to SQL developer Go “Oracle SQL Developer” --> Preferences Select Database and then “Third Party JDBC Drivers” and use
add entry option to add the jar files mentioned in steps above. Restart the SQL developer to reflect this change. Step6) Open SQL developer and right click on connections to
add a connection. Select the hive tab, enter your hive server details and Add
that connection. You are all set to browse Hive tables via SQL Developer
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- oracle sql developer
Labels:
10-20-2015
11:49 AM
One of my client is using Azure based IaaS for their HDP cluster. They are open to using more expensive storage to get better performance. Is it recommended to use SSD for some of the data in hive tables, to get that boost in performance? Also what are the steps to make your temporary storage to point to SSD, that is used by Tez/MR jobs?
... View more
Labels:
- Labels:
-
Apache Hive
10-19-2015
01:56 PM
2 Kudos
Thanks guys for the response. I was able to modify the configuration for MS SQL server. Database Connection URLInfo--> jdbc:sqlserver://a5d3iwbrq1.database.windows.net:1433;databaseName=chakra Database Driver Class NameInfo--> com.microsoft.sqlserver.jdbc.SQLServerDriver Database Driver Jar UrlInfo--> file:///usr/share/java/sqljdbc4.jar
setDatabase UserInfo--> chakra PasswordInfo--> ****** Once you have the configuration set, you also need to use generateFlowFile or something to trigger the ExecuteSQL as Timer Driver schedule does not work on the version of Nifi that i was using. Once this is done i ran into a bug where ExecuteSQL is not able to get the source table structure and gives a avro schema error https://issues.apache.org/jira/browse/NIFI-1010 I am assuming that once the above bug is fixed we should be able to use ExecuteSQl for MS SQLServer DB.
... View more
10-17-2015
03:50 PM
2 Kudos
I am trying to build a DBCPConnectionPool that can connect to MSSQL server. I downloaded the jar file and gave the path in DBCPConnectionPool. Here is my configuration Database Connection URLInfo--> jdbc:mysql://a5d3iwbrq1.cloudapp.net:3306/chakra
Database Driver Class NameInfo--> /root/sqljdbc_4.0/enu/sqljdbc4.jar
Database Driver Jar UrlInfo--> No value
setDatabase UserInfo--> chakra
PasswordInfo--> ****** however get an error when i enable this 2015-10-17 14:55:54,352 ERROR [pool-28-thread-5] o.a.n.c.s.StandardControllerServiceNode [DBCPConnectionPool[id=ee00cbf3-7dd3-4c32-93a6-9a06a8e5e6a7]] Failed to invoke @OnEnabled method due to {}
org.apache.nifi.reporting.InitializationException: org.apache.commons.dbcp.SQLNestedException: Cannot load JDBC driver class '/root/sqljdbc_4.0/enu/sqljdbc4.jar Please let me now how we can resolve this.
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
10-15-2015
12:33 PM
In terms of security around smartsense file transfer, it is mentioned that we can use regex to replace some of the pieces within the bundle. Is there a configuration file for smartsense where this option is controlled that clients can change. Also can you please let me know where i can find the instructions on how client can upload the smartsense bundle to hortonworks support.
... View more
Labels:
- Labels:
-
Hortonworks SmartSense
10-08-2015
07:06 PM
23 Kudos
1 Overview In this article I am going to talk about three
main integration pieces between SAP and Hadoop. First I will cover the SAP integration
products and how they can be used to ingest data into hadoop systems. Second, I
am going to talk about SAP hana and how that integrated with hadoop. I will
provide a overview of various methods that can be used to push/pull data from
SAP Hana as well as aspects around data federation and data offload architecture.
Third I am going to touch on topic around how SAP BI tools integrate with
Hadoop and talk briefly about SAP Hana Vora. 2 Hadoop using SAP Data Integration components SAP has three primary products around data
integration. Based upon their integration pattern, here is a brief overview of
these products Batch processing
SAP Data services aka BODS
(Business Objects Data Services) is the batch ETL tool from SAP. SAP acquired
this product as part of its Business Objects acquisition. Real time ingestion
SLT (SAP Landscape Transformation) Replication server can be used for batch or
real time data integration and uses database triggers for replication. SAP replication server which is
much more broader replication capabilities and truly is a CDC (Change Data
Capture) product. Rest of this section provides bit more
details on how these products talk to hadoop 2.1 SAP Data services Data
services can be used to create Hive, Pig or MR (MapReduce) jobs. You can also
preview the files that is sitting in HDFS. 2.1.1 Hive Data
services (DS) uses Hive Adapter, which is something that SAP provides to
connect to the Hive tables. This hive adapter can act as a Data Store or as a
source/target for DS jobs. There are various operations you can perform using
Hive adapter based data store.
You can perform some of the “Push
Down” functionality, where you can push the entire DS job to hadoop cluster. DS
coverts the job into hive SQL commands and run it against the Hadoop cluster as
map-reduce job. You can use the data store to
create SQL transform and sql() function within DS. 2.1.2 HDFS & Pig In DS, you can create a HDFS based file format where basically you
are saying the you have file as source and that file is sitting in HDFS. Now,
when you apply certain transformations to those HDFS file format using things
like aggregation, filtering etc in DS, it can get converted to pig job by DS. The
conversion of DS job as a pig job depends on the kind of transformation that is
applied and if DS knows to convert those transformation to pig job. 2.1.3 MapReduce The integration related to Hive, HDFS &
Pig, all eventually gets converted to mapreduce job and gets executed in the
cluster. In additional to that, if you use “Text data processing”
transformation within data services, then it may also get converted to
Mapreduce jobs based upon if the source and target in hadoop for the job is in
hadoop. 2.2 SAP SLT (SAP Landscape
Transformation) SAP LT
replication server is one of the two real time replication solution options
from SAP. SAP LT can be used to load data on batch or real time basis from SAP
as well as some of the non SAP systems. This replication mechanism is trigger
based, so it can replicate the incremental data very effectively. SAP LT is
used for primarily loading data from SAP system to SAP Hana. However, it is bi
directional, which means that you can use SAP hana as a source and replicate to
other systems as well. SAP LT does not support hadoop
integration straight out of the box. However, there are few workarounds that
you can use to load Hadoop via SLT. Option1: Build a custom ABAP code within SAP
LT replication server, so that it can read the incremental data from the trigger
and load that into HDFS via a API call. Option2: SAP LT can write its data into
SAP Data service, which does have mechanism to load that data into Hadoop via
Hive adapters or HDFS file format. Option3: Most of the SAP LT jobs are designed to load data into SAP Hana.
You can let SAP LT write data to SAP Hana and then push data from SAP Hana to
Hadoop using sqoop JDBC connector. 2.3 SAP Replication server This is truly SAP’s CDC product, which you
can use to replicate data between SAP and/or non SAP systems. It has many more
features than SAP LT and it also have support for hive on linux. While
performing hive based SAP replication, you need to be caution about certain
limitations like
List of columns between hive and
table replication definition should match You can only load to static
partition columns, which means that you should know the value of the partition
column before you load. You have to perform some level of
workaround to achieve insert/update kind of operations on the table. You may have scalability
challenges using this integration technique
3 SAP Hana and Hadoop SAP HANA combines database, data processing,
and application platform capabilities in a single in-memory platform. The
platform also provides libraries for predictive, planning, text processing,
spatial, and business analytics. As your Hana systems grows, because of the
cost or scalability reasons, you may want to offload some of the cold data to
more cost effective storage such as hadoop. When you talk about offload from
SAP Hana to hadoop, there are two primary ways to approach it. Option1) Move only the
cold data from SAP Hana to Hadoop Option2) Move all data
to hadoop and let your user/query engine decide to go against SAP for hot/warm
data OR go against Hadoop for all data. 3.1 Offload
cold data to Hadoop There
are instances when you want to offload cold data (data that is not used very often
for reporting) from SAP Hana to much more cost effective storage like hadoop.
Here are some high level integration options for offloading data to hadoop. Use Hive ODBC to
connect to HDP. You can install Hive ODBC driver on SAP hana server and make a hive ODBC call to hadoop system
Use Smart data access
and provide query federation capabilities. Hana has features such as virtual
tables (aka SDA) and virtual functions (aka vUDF) to access Hive, Spark ,
MR You can also leverage
sqoop JDBC connector to pull data from HANA to Hadoop. This integration pattern
is dependent upon the fact that you are able to read and make sense of the
tables that is stored in HANA.
However, you need aware of the fact that using
Sqoop to extract data from Hana is not officially supported yet. There
are also a few complications to this, when you try to extract hana tables with
special characters in the table name. Here is a high level architectural diagram
that depicts this option SAP also has certain
products that you can leverage to pull data from Hana as a file. Here are
couple of such products. SAP Open Hub: Using SAP Open Hub, you can get data extract from SAP in the
form of a file which then can be inserted into Hadoop via NFS, WebHDFS etc. SAP BEx: BEx is more of a query tool that can be used to pull data
out of SAP cubes and then those file can be used to load to hadoop. 3.2 Move
All data to Hadoop Move all data to hadoop and let your user/query engine decide
to go against SAP for hot/warm data OR go against Hadoop for all data. This
option would be more compelling in situations where you do not want to have any
performance overhead that comes with query federation and also the case where
you do not want to build complex process within hana to identify the cold data
that needs to be pushed to hadoop. In this option you should be able to leverage
many of your existing integration jobs to point to hadoop as a target beside existing
SAP Hana target. Here is a high level architectural diagram that depicts this
option 3 Hadoop using SAP BI and Analytics tools There are three SAP BI tools that are very
commonly used within the enterprise. Here are the name of those products and
the way they integrate with hadoop SAP Lumira: It uses Hive JDBC
connector to read data from hadoop SAP BO (Business
Objects) & Crystal report: Both of these
products use Hive ODBC connector to read data from hadoop. Beside these three products, there has been a recent
announcement from SAP on a new product called SAP Hana Vora. Vora primarily
uses apache sparks at its core but also has some additional features that SAP
introduced to enhance the query execution technique, as well as improve the SQL
capabilities from what SparkSQL provides. Vora needs to be installed in all the nodes in the
hadoop cluster where you want to run vora from. It provides a local cache
capability using native c code for map functions, along with apache sparks core
capabilities. In the SQL front. it provides much richer features around OLAP
(On line Analytical Processing) such as hierarchical drill down capabilities. Vora also works with SAP Hana, but its not a must have.
You can use Vora to move data between SAP Hana and Hadoop using some of SAP’s
proprietary integration technique rather than the ODBC technique that Smart
Data Access uses. You also have capabilities to write federated queries that
used both SAP Hana and hadoop cluster using Vora. I
hope you find this article useful! Please provide feedbacks to make this more
article more accurate and complete.
... View more
- Find more articles tagged with:
- sap
- sap data services
- sap-hana
- solutions
09-29-2015
11:38 AM
8 Kudos
Above two answers are great for Hive Metastore backup. Now for the hive data itself, here are few options Option1) Hive data gets stored in HDFS (Hadoop Distributed File System), so any backup or DR (Disaster Recovery) strategy you have for HDFS could be used for Hive as well. So, you can use snapshot feature in HDFS to take a point in time image. These snapshot could be for entire file system, a sub-tree in a file system or just a file. You can also take incremental snapshot by doing a diff between two snapshots. Option2) You can write your own distcp code and make it part of a Falcon data pipeline. Using Distcp to copy files Option3) You can use Falcon Data mirroring capability to mirror the data in HDFS or Hive tables. Here is a link on that Falcon Data Mirroring Option4) You can have a active - active data load to both your primary cluster as well as your DR Cluster. So for example if you are using a scoop job to pull the data from a particular RDBMS and load it into hive table, you can create two scoop jobs one to load the primary cluster hive table and other to load the DR cluster Hive table. You choice of which option to pick depends upon the SLA (Service level agreements) around DR/Backup, budget, Skill level etc.
... View more