Member since
09-29-2015
32
Posts
55
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4666 | 11-26-2015 10:19 PM | |
3726 | 11-05-2015 03:22 AM |
04-08-2017
09:25 AM
yes for some reason after enabling ranger , it will remoce the hadoop.proxyusers.root.hosts settings even if you had it before.... annoying
... View more
09-03-2016
10:38 PM
Which Operating System you are installing on? also can you connect to this Database manually? was that an auto-created database or you did the creation on the server?
... View more
08-23-2016
09:51 AM
14 Kudos
Introduction Apache NiFi 1.0 was recently released
and being integrated into Hortonworks Data Flow (HDF) that will be release very
soon. In this easy tutorial we will see how
we can stream data from CSV format into Hive tables directly and start working
on it right away without a single line of coding to set up the streaming.
Pre-requisites In order to run this tutorial
successfully you need to download the Following: NiFi
1.0 or higher, you can download it from here HDP
Sandbox 2.4 or higher, you can download it from here Download
the Olympics CSV data from the attachment list below.
Changing
NiFi Port (Optional) Since Ambari and NiFi both uses port 8080, you will
have problems starting NiFi if you are running the sandbox and NiFi on the same
machine. Once NiFi is downloaded, uncompress it and go to
/nifi/conf/nifi.properties and change the port no to 8089 as follows: nifi.web.http.port=8089
Starting
NiFi and the Sandbox Once NiFi is downloaded, uncompress and start it
using the command: /nifi/bin/nifi.sh start you may open a new browser page and go to http://localhost:8089/nifi to make sure
NiFi is running fine, give it a minute to load. start the Sandbox from Vmware or Virtual Box and go
to Ambari on https://localhost:8080 and make sure Hive is started. now we lets work on the table and the streaming
part…
Creating
The Hive Table Since
we will have to create an empty external table, we need to make sure that the
folder do exist for this table so we can store the data there without a
problem, in order to do this connect to the sandbox and create the directory
using the hive user: Hive-user>hadoop fs -mkdir /user/hive/olympics Now
lets move on the the table creation, From the downloaded olympics data olympics.zip lets examine the header of any of the file City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal In order for Hive Streaming to work the following has to be in place: Table is stored as ORC Transactional Property is set to “True” The Table is Bucketed We will
have to create a table in Hive to match the schema as following: CREATE EXTERNAL TABLE
OLYMPICS(CITY STRING,EDITION INT,SPORT STRING,SUB_SPORT STRING,ATHLETE STRING,COUNTRY STRING,GENDER STRING,EVENT STRING,EVENT_GENDER STRING,MEDAL STRING)
CLUSTERED BY (EDITION)INTO 3 BUCKETS
ROW FORMAT DELIMITED
STORED AS ORC
LOCATION '/user/hive/olympics'
TBLPROPERTIES('transactional'='true'); Once
the table is created successfully we may move on to the NiFi part.
NiFi Template (Optional, if you are feeling lazy) if you dont want to follow the steps below, you can easily download the template that contains the whole thing from here hive-streaming-olympics.xml (easily start NiFi and import it) if you have done the previous part, just make sure to change the directories in the processors and the parameters in every processor to match your configuration. Configure
NiFi in a high level, we need to create the following
flow for our streaming to work: GetFile
Processor to read the data directly from the source folder InferAvroSchema
to pre-configure how the file will look like and to set any custom headers if needed. ConvertCSVtoAvro
is where the actual conversion is happening and then forwarded to HiveStreaming HiveStreaming
is where the data is being inserted into Hive We
are optionally using PutFile to capture any un-successful CSVs during the
streaming For more on the Avro Conversion, refer to the great write up from @Jeremy Dyer on how to convert CSV to Avro, as it explains in greated details how the flow is working. Pulling
Data from CSV Simply the only thing you need to do here is
configure your source directory, there are some handy parameters to check based
on the no. of CSV files like Batch Size(How Many CSVs per pull) Pre-Configure
the Files for Avro Conversion Make sure Flowfile-attribute is selected for Schema
Output Destination as we will capture the flow file in the next processor,
Content type could be JSON or CSV in our case it will be CSV. Since all CSVs here have no header, we will have to set teh definition for the header file easily using the processor, the header definitions will be as follow: City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal if we did have a header in every file, we can easily set Get
CSV Header definition from Data to “true” and let NiFi determine the schema (make sure you skip a line on the next processor if you are doing that, otherwise you will have the headers ingested as well) . CSV Header Skip Count is important if you have a
custom header and you want to ignore whatever headers you previously have in
your CSVs. Convert
to Avro. Nothing much to do here except for capturing the
flow file generated by the previous processor using the ${inferred.avro.schema}
parameter, we dont have to skip any headers lines here as we dont have any contained within the CSVs. Stream
into Hive Here is where all the action is happening, you will
need to configure the Hive Metastore URI to reflect the address to the sandbox
(I have added sandbox.horotnworks.com in my /etc/hosts file so I don’t have to
write the IP address) Another important thing is to grab the hive-site.xml
file from your sandbox (usually under /etc/hive/2.x.x.x-xxx/0/hive-site.xml),
save it in a local directory and refer to it here. Table Name will be “Olympics” where all data will be
stored. Catching
any errors In real life cases, not all CSVs are good to go, we
will get some corrupted ones from now to then, configuring a processor to store
those files so you can deal with them later is always a good idea, just simply
add the directory on where the corrupt or faulty files will be stored. Start
Streaming Now Simply press the play button and enjoy watching the
files being streamed into Hive, watch for any red flags on the processors which
means there are some sisues to resovle. Check
The Data Once the data is streamed, you can check the data
out using Ambari Hive View or even Zeppelin to visualise it. Lets look how the data will look like in the table using the Ambari / Hive View Now, lets do some cooler stuff with NiFi
... View more
Labels:
03-31-2016
05:06 AM
Shane, thanks for the elaboration, I agree some further work should be done on this space... appreciate the elaboration again as this will be a great the explanation for whoever is trying to figure this out!
... View more
03-31-2016
05:04 AM
hbasenet worked... thanks
... View more
03-30-2016
04:46 AM
1 Kudo
The HDinsight SDK seems only to work with HD Insight on specific Hbase releases... 0.98,wonder if anyone had luck working it out on HDP or with hbase version 1+
... View more
03-30-2016
03:40 AM
4 Kudos
Labels:
- Labels:
-
Apache HBase
02-27-2016
12:24 AM
2 Kudos
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-07-2016
12:15 AM
1 Kudo
need more elaboration on this, How will kerberos solve the problem?
... View more
02-07-2016
12:10 AM
1 Kudo
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
-
Apache Hive