About gkeys

gkeys · ‎08-04-2017

Setting this property creates a tmp directory in BOTH local and HDFS. It does so in HDFS because other properties use hadoop.tmp.dir as a base path to store data in HDFS. Example: dfs.name.dir=${hadoop.tmp.dir}/dfs/name creates this path in hdfs. There is no way to have this property NOT create a path locally. See these links for a good discussion: https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml

gkeys · ‎08-01-2017

Using your sed approach, this should replace all NULL with empty character sed 's/[\t]/,/g; s/NULL//g' > myfile.csv If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;' > myfile.csv Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time. Note also: To use empty character as nulls in the actual hive table, use the following in the DDL TBLPROPERTIES('serialization.null.format'='');

gkeys · ‎07-28-2017

Just want to be sure you are using port 10500. LLAP and non-LLAP each have their own HiveServer2 (ports 10500 and 10000, respectively.

gkeys · ‎07-28-2017

@Darko Milovanovic You can update it once (in the version control) but unfortunately it has to be re-deployed to each separate instance in your flows. This is because each component is instantiated separately with a different global id as described in section 5. Do note that in HDF 3.0 after you do this NiFi keeps versions of each processor deployed, so you can use one version of a processor in one flow, and another version in a different flow (all versions available to choose from). There is active work on making reusable components shared (instantiated) but that has not been released.

gkeys · ‎07-28-2017

Both are similar in their awesome drag-and-drop UI to process data in motion, However, they differ fundamentally in purpose and underlying technology. Differences Purpose NiFi is meant for data flow management while Streaming Analytics Manager (SAM) is meant for advanced (complex) real-time analytics. In general, for NiFi think acquiring, transforming and routing data to target destinations and for SAM think complex analytics on data as it is flowing across the wire. Here is a more detailed comparison between flow management (NiFI) and stream analytics (SAM) Flow Management (NiFi) Stream Analytics (SAM) data velocity batch, microbatch or streaming (from diverse sources) streaming (from diverse sources) data size (per content) small (kb) to large (GB) small (KB, MB) per message in stream data manipulation rich: parse, filter, join, transform, enrich, reformat minimal changes to data data flow management powerful: queue prioritization, back pressure, route/merge, persist to target minimal: mostly route/merge and persist to target real-time analytics basic powerful So NiFi is great to manage the movement of data from diverse sources (from small sensors, ftp locations, relational databases, rest apis in the cloud, and so on) to similar targets while modifying and making decisions on the data in between. SAM is great at watching real-time streams of data and doing advanced analytics (dashboarding/visualizations, alerting, predictions, etc) as it flows by. Technology NiFi is built around processors and connections with repositories underneath. SAM is built on top of Storm and Kafka (and Druid). Shared What do they have in common? Both have easy UI development that hides complexity underneath. Both are components of Hortonworks Data Flow (HDF) distribution. Both share Kafka (see below). Both are managed by the Ambari (admin and monitoring) and Ranger (authorization and security). Both can use the same Schema Registry to work with data structure of content. Do they connect? A very common pattern is this: stream data using NiFi (and possibly filter, transform, enrich) and pass it to a Kafka queue to make it durable (persistent until consumed). SAM pulls from the queue (subscribes to a topic) and does advanced analytics from there (dashboarding/visualizations, alerting, predictions, etc). SAM pushes to hadoop (HBase or Hive) to persist for further historical analysis and exploration (data science, business intelligence, etc) Tutorial mentioned by @Wynner is an excellent example of this pattern and the separate strengths of NiFi and SAM.

gkeys · ‎07-28-2017

One point: if you specify a delimiter that is not the true delimiter in the file ... no error will be thrown. Rather, it will treat the full record (including its true delimiters) as a single field. In this case, the true delims will just be characters in a string.

gkeys · ‎07-28-2017

@Aditya Jadhav Small mistake: you need uppercase PigStorage('|'). lp = load '/employee.txt ' using PigStorage('|') as (aa,bb,cc,dd,ee); Error shows that it is looking for a java function called pigStorage and cannot find it. In addition to Pig's native functions (which PigStorage belongs) functions can be found in referenced libraries (e.g. 3rd party or ones you build yourself as User Defined Functions).

gkeys · ‎07-14-2017

Thank you @ccasano It was due to this error handling design and InvokeHTTP not able to establish a connection.

gkeys · ‎07-14-2017

One of the Hive Interactive Query (LLAP) configs is "Hold Containers to Reduce Latency" and it is set to false by default. What specifically does this config control, and since the goal of LLAP is fast response times (down to subsecond), why is the default value not true since the config name suggests that turning it on would reduce latency?

gkeys · ‎07-10-2017

I recall now the port differences ... it is between HiveServer2 (10000) and HiveServer2 Interactive (10500) and nothing with jdbc

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: How to connect hive LLAP via ODBC using http a...

Re: Enterprise NiFi: Implementing Reusable Compone...

Re: [HDF-3.0] Difference between Nifi and Stream b...

Re: getting error when trying to deilimit the rows...

Re: getting error when trying to deilimit the rows...

Re: NiFi message when emptying queues: "Waiting fo...

Meaning of Hive LLAP configuration "Hold Container...

Re: When I implement Hive LLAP the JDBC URLS are i...