Member since
09-29-2015
32
Posts
55
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1863 | 11-26-2015 10:19 PM | |
2095 | 11-05-2015 03:22 AM |
04-08-2017
09:25 AM
yes for some reason after enabling ranger , it will remoce the hadoop.proxyusers.root.hosts settings even if you had it before.... annoying
... View more
09-03-2016
10:38 PM
Which Operating System you are installing on? also can you connect to this Database manually? was that an auto-created database or you did the creation on the server?
... View more
08-23-2016
09:51 AM
14 Kudos
Introduction Apache NiFi 1.0 was recently released
and being integrated into Hortonworks Data Flow (HDF) that will be release very
soon. In this easy tutorial we will see how
we can stream data from CSV format into Hive tables directly and start working
on it right away without a single line of coding to set up the streaming.
Pre-requisites In order to run this tutorial
successfully you need to download the Following: NiFi
1.0 or higher, you can download it from here HDP
Sandbox 2.4 or higher, you can download it from here Download
the Olympics CSV data from the attachment list below.
Changing
NiFi Port (Optional) Since Ambari and NiFi both uses port 8080, you will
have problems starting NiFi if you are running the sandbox and NiFi on the same
machine. Once NiFi is downloaded, uncompress it and go to
/nifi/conf/nifi.properties and change the port no to 8089 as follows: nifi.web.http.port=8089
Starting
NiFi and the Sandbox Once NiFi is downloaded, uncompress and start it
using the command: /nifi/bin/nifi.sh start you may open a new browser page and go to http://localhost:8089/nifi to make sure
NiFi is running fine, give it a minute to load. start the Sandbox from Vmware or Virtual Box and go
to Ambari on https://localhost:8080 and make sure Hive is started. now we lets work on the table and the streaming
part…
Creating
The Hive Table Since
we will have to create an empty external table, we need to make sure that the
folder do exist for this table so we can store the data there without a
problem, in order to do this connect to the sandbox and create the directory
using the hive user: Hive-user>hadoop fs -mkdir /user/hive/olympics Now
lets move on the the table creation, From the downloaded olympics data olympics.zip lets examine the header of any of the file City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal In order for Hive Streaming to work the following has to be in place: Table is stored as ORC Transactional Property is set to “True” The Table is Bucketed We will
have to create a table in Hive to match the schema as following: CREATE EXTERNAL TABLE
OLYMPICS(CITY STRING,EDITION INT,SPORT STRING,SUB_SPORT STRING,ATHLETE STRING,COUNTRY STRING,GENDER STRING,EVENT STRING,EVENT_GENDER STRING,MEDAL STRING)
CLUSTERED BY (EDITION)INTO 3 BUCKETS
ROW FORMAT DELIMITED
STORED AS ORC
LOCATION '/user/hive/olympics'
TBLPROPERTIES('transactional'='true'); Once
the table is created successfully we may move on to the NiFi part.
NiFi Template (Optional, if you are feeling lazy) if you dont want to follow the steps below, you can easily download the template that contains the whole thing from here hive-streaming-olympics.xml (easily start NiFi and import it) if you have done the previous part, just make sure to change the directories in the processors and the parameters in every processor to match your configuration. Configure
NiFi in a high level, we need to create the following
flow for our streaming to work: GetFile
Processor to read the data directly from the source folder InferAvroSchema
to pre-configure how the file will look like and to set any custom headers if needed. ConvertCSVtoAvro
is where the actual conversion is happening and then forwarded to HiveStreaming HiveStreaming
is where the data is being inserted into Hive We
are optionally using PutFile to capture any un-successful CSVs during the
streaming For more on the Avro Conversion, refer to the great write up from @Jeremy Dyer on how to convert CSV to Avro, as it explains in greated details how the flow is working. Pulling
Data from CSV Simply the only thing you need to do here is
configure your source directory, there are some handy parameters to check based
on the no. of CSV files like Batch Size(How Many CSVs per pull) Pre-Configure
the Files for Avro Conversion Make sure Flowfile-attribute is selected for Schema
Output Destination as we will capture the flow file in the next processor,
Content type could be JSON or CSV in our case it will be CSV. Since all CSVs here have no header, we will have to set teh definition for the header file easily using the processor, the header definitions will be as follow: City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal if we did have a header in every file, we can easily set Get
CSV Header definition from Data to “true” and let NiFi determine the schema (make sure you skip a line on the next processor if you are doing that, otherwise you will have the headers ingested as well) . CSV Header Skip Count is important if you have a
custom header and you want to ignore whatever headers you previously have in
your CSVs. Convert
to Avro. Nothing much to do here except for capturing the
flow file generated by the previous processor using the ${inferred.avro.schema}
parameter, we dont have to skip any headers lines here as we dont have any contained within the CSVs. Stream
into Hive Here is where all the action is happening, you will
need to configure the Hive Metastore URI to reflect the address to the sandbox
(I have added sandbox.horotnworks.com in my /etc/hosts file so I don’t have to
write the IP address) Another important thing is to grab the hive-site.xml
file from your sandbox (usually under /etc/hive/2.x.x.x-xxx/0/hive-site.xml),
save it in a local directory and refer to it here. Table Name will be “Olympics” where all data will be
stored. Catching
any errors In real life cases, not all CSVs are good to go, we
will get some corrupted ones from now to then, configuring a processor to store
those files so you can deal with them later is always a good idea, just simply
add the directory on where the corrupt or faulty files will be stored. Start
Streaming Now Simply press the play button and enjoy watching the
files being streamed into Hive, watch for any red flags on the processors which
means there are some sisues to resovle. Check
The Data Once the data is streamed, you can check the data
out using Ambari Hive View or even Zeppelin to visualise it. Lets look how the data will look like in the table using the Ambari / Hive View Now, lets do some cooler stuff with NiFi
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- Hive
- hive-streaming
- How-ToTutorial
- ingestion
- NiFi
- nifi-streaming
- olympics
- streaming
Labels:
03-31-2016
05:06 AM
Shane, thanks for the elaboration, I agree some further work should be done on this space... appreciate the elaboration again as this will be a great the explanation for whoever is trying to figure this out!
... View more
03-31-2016
05:04 AM
hbasenet worked... thanks
... View more
03-30-2016
04:46 AM
1 Kudo
The HDinsight SDK seems only to work with HD Insight on specific Hbase releases... 0.98,wonder if anyone had luck working it out on HDP or with hbase version 1+
... View more
02-07-2016
12:15 AM
1 Kudo
need more elaboration on this, How will kerberos solve the problem?
... View more
12-08-2015
05:26 AM
3 Kudos
From what we have witnessed in the field and during some customers testing, SparkSQL (1.4.x) at the time of testing was generally 50% - %200 faster when querying small datasets, by small we mean anywhere < 100GB datasets, which is usually great for data discovery, data wrangling, testing stuff out, or even running a production usecase where the datasets tend to be a lot but relatively small. the bigger the table especially when joins are not effectively used or we are scanning a single one big table, and if you are in the BI space, and SLAs are required and you cant afford a query to break and start over, Tez was able to shine, its rigid stable, and the bigger the datasets the better the performance gets compared to Spark, at a 250GB datasets you will see a lot of similarities on the execution time, of course this will depend on how big is the cluster, how much memory allocated..etc in general, my personal opinion we shouldn't compare both at this time as both shine in seperate contexts, at some stage Tez might be needed but maybe more Spark would be required in smaller datasets, and as I mentioned that was based on Spark 1.4.x , would love to re-run the testings again especially after the new cube functionalities in Spark 1.5. hope this was helpful.
... View more
11-30-2015
09:29 PM
3 Kudos
One of the first cases we get to see with Hbase is loading it up with Data, most of the time we will have some sort of data in some format like CSV availalble and we would like to load it in Hbase, lets take a quick look on how does the procedure looks like: lets examine our example data by looking at the simple structure that I have got for an industrial sensor id, temp:in,temp:out,vibration,pressure:in,pressure:out
5842, 50, 30, 4, 240, 340
First of all make sure Hbase is started on your Sandbox as following Creating the HBase Table
Login as Root to the HDP Sandbox and and switch to the Hbase User root> su - hbase
Go to the Hbase Shell by typing hbase> hbase shell
Create the example table by typing hbase(main):001:0> create 'sensor','temp','vibration','pressure'
lets make sure the table was created and examine the structure by typing hbase(main):001:0> list
now, exit the shell by typing 'exit' and lets load some data Loading the Data
lets put the hbase.csv file in HDFS, you may SCP it first to the cluster by using the following command macbook-ned> scp hbase.csv root@sandbox.hortonworks.com:/home/hbase
now put in HDFS using the following command hbase> hadoop dfs -copyFromLocal hbase.csv /tmp
we shall now execute the Loadtsv statement as following hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv
once the mapreduce job is completed, return back to hbase shell and execute hbase(main):001:0> scan sensor
you should now see the data in the table Remarks
Importtsv statement generates massive amount of logs, so make sure you have enough space in /var/logs, its always better to have it mounted on a seperate directories in real cluster to avoid operational stop becuase of logs filling the partition.
... View more
- Find more articles tagged with:
- csv
- Data Processing
- HBase
- how-to-tutorial
- import
- loadtsv
- ned
Labels:
11-26-2015
10:19 PM
2 Kudos
ok here is the latest, The R Interpreter for Zeppelin has not been merged yet with the latest Zeppelin dist. however you can use it now from here https://github.com/apache/incubator-zeppelin/pull/208. All the Best 🙂
... View more
11-20-2015
05:15 AM
This should now be solved, starting Zeppelin 0.5.5 you dont need to rebuild for different Spark/Hadoop versions... enjoy 🙂
... View more
11-11-2015
10:33 PM
@azeltov@hortonworks.com you can, as long as you modify the ZEPPELIN HUB API TOKEN and you have a direct internet connection from the Sandbox
... View more
11-08-2015
11:06 PM
5 Kudos
Introduction Hive is one of the most common used databases on Hadoop, users of Hive are doubling per year due to the amazing enhancements and the addition of Tez and Spark that enabled Hive to by pass the MR era to a an in-memory execution that changed how people are using Hive. in this blog post, I will show you how to connect squirrel Sql Client to Hive, the concept is similar to any other clients out there as long as you are using the open-source libraries that matches the ones here you should be fine. Prerequisite Download Hortonworks Sandbox with HDP 2.2.4, Squirrel SQL Client Step 1 Follow the Squirrel documentation and run it on your Mac or PC. Step 2 Follow the Hortonworks HDP Installation on VritualBox, VMware or Hyper-V and start up the virtual Instance. Step 3 once you are HDP is up and running, connect it it using SSH as it shows on the console, once you are connected you need to download some JAR files in order to establish the connection. Step 4 if you are using MacOS, simply while you are connected to you HDP instance search for the following JARs using the command: root> find / -name JAR_FILE once you find the file needed, easily copy it using SCP to your laptop/PC root> scp JAR_FILE yourMacUser@yourIPAddress:/PATH_TO_JARS the files you should look for are the following (versions will differ base on which Sandbox you are running but different versions are unlikely to cause a problem)
commons-logging-1.1.3.jar hive-exec-0.14.0.2.2.4.2-2.jar hive-jdbc-0.14.0.2.2.4.2-2.jar hive-service-0.14.0.2.2.4.2-2.jar httpclient-4.2.5.jar httpcore-4.2.5.jar libthrift-0.9.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar hadoop-common-2.6.0.2.2.4.2-2.jar if you are running windows you might need to install winSCP in order to grab the files from their locations. Step 5 Once all Jars above are downloaded into your local machine, Open up Squirrell and go to Drivers and Add New Driver. Name: Hive Driver (could be anything else you want)
Example URL: jdbc:hive2://localhost:10000/default
Class Name: org.apache.hive.jdbc.HiveDriver
go to Extra Class Paths and add all the JARS you downloaded you may change the port no or IP addresses if you are not running with the defaults. Step 6 login to you Hadoop Sandbox and verify that HIVESERVER2 is running using: netstat -anp | grep 10000 if there was nothing running you can hiveserver2 manually hive> hiveserver2 Step 7 once you verify hiveserver2 is up and running you are ready to test the connection on Squirrel by creating a new Alias as following you are now ready to connect, once connection is successful you should get a screen like this Step 8 (Optional) With your first Hive Query, Squirrel can be buggy and complain about memory and heap size, if this ever occurred, if you are on Mac, right click on the app icon-->show package contents-->open info.plist and add the following snippet <key>Java</key>
<dict>
<key>VMOptions</key>
<array>
<string>-Xms128m</string>
<string>-Xmx512m</string>
</array>
</dict> Now you can enjoy...
... View more
Labels:
11-08-2015
10:56 PM
12 Kudos
Introduction Apache Zeppelin (Incubator at the time of writing this post) is one of my favourite tools that I try to position and present to anyone interested in Analytics, Its 100% open source with an intelligent international team behind it in Korea (NFLABS) (Moving to San Francisco soon), its mainly based on interpreter concept that allows any language/data-processing-backend to be plugged into Apache Zeppelin. Very similar to IPython/Jupyter except that the UI is probably more appealing and the amount of interpreters supported are richer, at the time of writing this Blog Zeppelin supported:
Apache Hive QL ApacheSpark (SQL, Scala and Python) ApacheFlink Postgres Pivotal HAWQ Shell Apache Tajo AngularJS Apache Cassandra Apache ignite Apache Phoenix Apache Geode Apache Kylin Apache Lens with this rich set of interpreters provided, it makes on boarding platforms like Apache Hadoop or Data Lake concepts much easier where data is sitting and consolidated somewhere and different organizational units with different skill sets needs to access the data and perform their day to day duties on it as data discovery, queries, data modelling, data streaming and finally Data Science using Apache Spark. Apache Zeppelin Overview With the notebook style editor and the ability to save notebooks on the fly, you can end up with some really cool notebooks, whether you are a data engineer, data scientist or a BI specialist. Dataset showing the Health Expenditure of the Australian Government over time by state. Zeppelin also got a basic clean visualization views integrated with it, it also gives you control over what do you want to include in your graph by dragging and dropping fields in your visualization as below:
The sum of government budget healthcare expenditure in Australia by State Also when you are done with your awesome notebook story, you can easily create a report out of it and either print it or send it out. Car Accident Fatalities related to Alcohol driving , showing the most fatal days on the streets and the most fatal car accident types during Alcohol times Playing with Zeppelin If you have never played with Zeppelin before then visit this link for a quick way to start working it out using the latest Hortonworks tutorial we are including Zeppelin as part of HDP as a technical preview, which may supporting it officially may follow, check it out Here try out the different interpreters and how it interacts with Hadoop. Zeppelin Hub I was recently given access to the beta version of Hub, Hub is supposed to make life in organizations easier when it comes to sharing notebooks between different departments or pepole within the organization. Lets assume an Organization got Marketing, BI and Data Science practices, the three departments overlaps with each other when it comes to the datasets being used, therfore there is no need anymore for each department to work completely isolated from the others, as they can share their experience together, brag about their notebooks, work together on the same notebook when trying to work on either complicated notebook or different skills are required. Zeppelin Hub UI Lets have a deeper look at Hub... Hub Instances Instance is backed by a Zeppelin installation somewhere (server,laptop,hadoop..etc), every time you create a new Instance a new Token is generated, this token should be added in your local Zeppelin installation under folder /incubator_zeppelin/conf/zeppelin-env.sh e.g. export ZEPPELINHUB_API_TOKEN="f41d1a2b-98f8-XXXX-2575b9b189" Once the token is added, you will be able to see the notebooks online whenever you connect to Hub (http://zeppelin.hub.com). Hub Spaces once an instance is added, you will be able to see all the notebook for each instance, and since every space is actually either a dept. or a category of notebooks that needs to be shared across certain people, you can easily drag and drop notebooks into spaces making them shared across this specific space. Adding a Notebook to a Space Showing a Notebook inside Zeppelin Hub Very cool ! Since its beta, there is still much of work to be done like executing notebooks from Hub directly, resizing and formatting and some other minor issues, I am sure the All Stars team @nflabs will make it happen very soon as they always did. if you are interested in playing with Beta, you may request access on Apache Zeppelin website here Hortonworks and Apache Zeppelin Hortonworks is heavily adopting Apache Zeppelin, that showed in the contribution they have made into the product and into Apache Ambari, @ali one of Rockstars at Hortonworks created an Apache Zeppelin View on Ambari, which gives Zeppelin authentication and allows users to have a single pane of glass when it comes to uploading datasets using HDFS view on Apache Ambari Views and other operational needs. Apache Ambari with Zeppelin View Integration Apache Zeppelin Notebook editor from Apache Ambari If you want to integrate Zeppelin in Ambari with Apache Spark as well, just easily follow the steps on this link Hortonworks Gallery for Apache Zeppelin Recently we have published a Gallery where anyone can contribute and add their notebooks publicly in order to share their notebooks, all what you need to do is to grab the notebook folder and upload check it out here If you are not sure how to start, a great way is to take a look at Hortonworks Gallery for Apache Zeppelin, you will be able to have a 360 view on different ways of creating different notebooks Helium Project Helium is a revolutionary change in Zeppelin, Helium allows you to integrate almost any standard html, css, javascript as a visualization or a view inside Zeppelin. Helium Application would consists of an View, Algortihm and an Access to the resource, you can get more information of Helium here
... View more
- Find more articles tagged with:
- Data Processing
- FAQ
- notebook
- Spark
- zeppelin
- zeppelin-notebook
Labels:
11-05-2015
03:27 AM
make perfect since, wonder if it will work with backward compatibility though,right not I ended up with different Zeppelin folders pointed at different Spark versions
... View more
11-05-2015
03:22 AM
would copying and modifying the interpreter file under /incubator-zeppelin/interpreter folder helps?
... View more
11-05-2015
03:19 AM
One of our prospects is looking at NiFi, they cant have GUI operated tools as they need everything to be scripted for ease of operations, can we create and manage NiFi flows with CLI without the need for GUI?
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
11-05-2015
03:16 AM
2 Kudos
according to NFLABS this is coming any time (before end of the year)
... View more