About cstanca

cstanca · ‎03-07-2017

@spdvnz From my experience, missing one of the pre-requisites for host settings is the cause. See this: https://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.0.1/bk_ambari-installation/content/ch_getting_ready.html Check once again all the pre-requisites mentioned in this document. It happened to me to believe that I did everything right, but going on it, I realized that it was one that I missed.

cstanca · ‎03-07-2017

@CriCL Try Zeppelin notebook. Here is a basic example of ingesting data from a source via NiFi, store it to Hive and access it via Zeppelin: https://community.hortonworks.com/articles/87230/customer-demographics-demo-with-apache-nifi-hive-a-1.html There are several good notebooks in HDP sandbox, but also Github. That is if you stored your data in HDFS or HBase and you wanted to see how to report on it with Zeppelin. Other notebooks can be also connected to HDP, e.g. iPython or Jupyter.

cstanca · ‎03-07-2017

@Gaurav Jain There are many ways to do it, however, assuming that you would use NiFi, you could build a flow starting with GetFile processor pointing to a folder where you have all your CSV files. Follow with MergeContent and PutFile processors. Otherwise, it could be even simpler and without NiFi. If your files have the same structure and each has a header, then a simple shell script can extract the header from one file and remove the header from the others and just merge the content. You can easily do all this using sed unix command.

cstanca · ‎03-07-2017

@Sean Anderson Not related, but I noticed that you use heavily VARCHAR data type. That is a known performance issue. Try to use STRING instead. Coming back to your issue, I assume that both columns (ID, AccountType) are not null, otherwise if they have null values, that could explain the difference on number of records dropped.

cstanca · ‎03-06-2017

Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin. Pre-requisites Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0 Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0 My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook, customer_demographics_orc_table_ddl.hql database and table DDLs Apache Hive 1.2.1 included with HDP 2.5.0 Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql Steps Import NiFi Template Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml: screen-shot-2017-03-06-at-74106-pm.png Create Data Folder and Upload Data Files In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name. Import Zeppelin Notebook Execute NiFi Flow Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo. Execute Zeppelin Notebook During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.

cstanca · ‎03-05-2017

Scott mentioned below some good practices for memory sizing.

cstanca · ‎02-24-2017

@Sunile Manjee As KafkaConnect, the Kafka REST Proxy is also part of the Kafka included in HDP or HDF. Neither one is supported as far as I am aware.

cstanca · ‎02-24-2017

@Faruk Berksoz Kafka - YES for all scenarios. Kafka is not for storing. Kafka is for transport. Your data still needs to land somewhere, e.g. As you mentioned that is HBase via Phoenix, but it could also be HDFS or Hive. 1. Yes. Flume is ok for ingest, but you still need something else to post to Kafka (Kafka Producer), e.g. KafkaConnect. 2. No. Spark Streaming is appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 3. No. Same response as for #2 4. No. Storm is for appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 5. Could work, not recommended. The most common architectures are: a) Flume-> KafkaConnect-> Kafka; consumer applications are built using either Storm or Spark Streaming. Other options are available, but less used. b) Nifi -> Kafka -> Storm; consumer applications are built using Storm; this is Hortonworks DataFlow stack c) Others (Attunity, Syncsort) -> Kafka -> consumer applications built in Storm or Spark Streaming Since I am biased, I would say go with b) - Storm or Spark Streaming, or both. I'm saying that only because I am biased but because each of the components scale amazingly and because I used Flume before and don't want to go back there once I've seen what I can achieve with NiFi. Additionally, HDF will evolve an integrated platform for stream analytics with visual definition of flows and analytics requiring the least programming. You will be amazed of the functionality provided out of box and via visual definition and that is only months away. Flume is less and less used. NiFi does what Flume does and much beyond. With NiFi writing the producers to Kafka is trivial. Think beyond your current use case. What other use cases can this enable?... One more thing. For landing data to HBase you can still use NiFi and its Phoenix connector to HBase. Another scalable approach.

cstanca · ‎02-22-2017

@James Dinkel "In-Memory Cache per Daemon", by default is set to none. Did you allocate anything to it? This configuration is also available in hive-interactive-site.

cstanca · ‎02-22-2017

@Connor O'Neal As I am looking at the tags associated with your question and the "consumer client" in your question, you are asking about a tool available to manage the offsets for a consumer client that is committing offsets to Kafka. Bad news! There is currently no tool available to manage the offsets for a consumer client that is committing offsets to Kafka. This function is only available for consumers that are committing offsets to Zookeeper. In order to manage offsets for a group that is committing to Kafka, you must use the APIs available in the client to commit offsets for the group. +++ Please don't forget to vote and accept the best answer.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Cannot register hosts in ambari for HDF

Re: Reporting Tool

Re: How to read multiple Excel/CSV file and write ...

Re: Loading bucketed table ignores or skips rows

Customer Demographics Demo with Apache Nifi, Hive ...

Re: LLAP not using io cache

Re: Does HDP or HDF support Kafka rest api?

Re: Real-Time Data Ingestion for mission critical ...

Re: LLAP not using io cache

Re: Is there a tool to manage offsets for a consum...