Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3101 | 08-25-2017 03:09 PM | |
1964 | 08-22-2017 06:52 PM | |
3388 | 08-09-2017 01:10 PM | |
8061 | 08-04-2017 02:34 PM | |
8113 | 08-01-2017 11:35 AM |
08-04-2017
02:34 PM
1 Kudo
Setting this property creates a tmp directory in BOTH local and HDFS. It does so in HDFS because other properties use hadoop.tmp.dir as a base path to store data in HDFS. Example: dfs.name.dir=${hadoop.tmp.dir}/dfs/name creates this path in hdfs. There is no way to have this property NOT create a path locally. See these links for a good discussion: https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
... View more
08-01-2017
11:35 AM
Using your sed approach, this should replace all NULL with empty character sed 's/[\t]/,/g; s/NULL//g' > myfile.csv If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;' > myfile.csv Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time. Note also: To use empty character as nulls in the actual hive table, use the following in the DDL TBLPROPERTIES('serialization.null.format'='');
... View more
07-28-2017
09:31 PM
Just want to be sure you are using port 10500. LLAP and non-LLAP each have their own HiveServer2 (ports 10500 and 10000, respectively.
... View more
07-28-2017
08:42 PM
@Darko Milovanovic You can update it once (in the version control) but unfortunately it has to be re-deployed to each separate instance in your flows. This is because each component is instantiated separately with a different global id as described in section 5. Do note that in HDF 3.0 after you do this NiFi keeps versions of each processor deployed, so you can use one version of a processor in one flow, and another version in a different flow (all versions available to choose from). There is active work on making reusable components shared (instantiated) but that has not been released.
... View more
07-28-2017
08:15 PM
3 Kudos
Both are similar in their awesome drag-and-drop UI to process data in motion, However, they differ fundamentally in purpose and underlying technology. Differences Purpose NiFi is meant for data flow management while Streaming Analytics Manager (SAM) is meant for advanced (complex) real-time analytics. In general, for NiFi think acquiring, transforming and routing data to target destinations and for SAM think complex analytics on data as it is flowing across the wire. Here is a more detailed comparison between flow management (NiFI) and stream analytics (SAM) Flow Management (NiFi) Stream Analytics (SAM) data velocity batch, microbatch or streaming (from diverse sources) streaming (from diverse sources) data size (per content) small (kb) to large (GB) small (KB, MB) per message in stream data manipulation rich: parse, filter, join, transform, enrich, reformat minimal changes to data data flow management powerful: queue prioritization, back pressure, route/merge, persist to target minimal: mostly route/merge and persist to target real-time analytics basic powerful So NiFi is great to manage the movement of data from diverse sources (from small sensors, ftp locations, relational databases, rest apis in the cloud, and so on) to similar targets while modifying and making decisions on the data in between. SAM is great at watching real-time streams of data and doing advanced analytics (dashboarding/visualizations, alerting, predictions, etc) as it flows by. Technology NiFi is built around processors and connections with repositories underneath. SAM is built on top of Storm and Kafka (and Druid). Shared What do they have in common? Both have easy UI development that hides complexity underneath. Both are components of Hortonworks Data Flow (HDF) distribution. Both share Kafka (see below). Both are managed by the Ambari (admin and monitoring) and Ranger (authorization and security). Both can use the same Schema Registry to work with data structure of content. Do they connect? A very common pattern is this: stream data using NiFi (and possibly filter, transform, enrich) and pass it to a Kafka queue to make it durable (persistent until consumed). SAM pulls from the queue (subscribes to a topic) and does advanced analytics from there (dashboarding/visualizations, alerting, predictions, etc). SAM pushes to hadoop (HBase or Hive) to persist for further historical analysis and exploration (data science, business intelligence, etc) Tutorial mentioned by @Wynner is an excellent example of this pattern and the separate strengths of NiFi and SAM.
... View more
07-28-2017
12:55 PM
One point: if you specify a delimiter that is not the true delimiter in the file ... no error will be thrown. Rather, it will treat the full record (including its true delimiters) as a single field. In this case, the true delims will just be characters in a string.
... View more
07-28-2017
12:34 PM
2 Kudos
@Aditya Jadhav Small mistake: you need uppercase PigStorage('|'). lp = load '/employee.txt ' using PigStorage('|') as (aa,bb,cc,dd,ee); Error shows that it is looking for a java function called pigStorage and cannot find it. In addition to Pig's native functions (which PigStorage belongs) functions can be found in referenced libraries (e.g. 3rd party or ones you build yourself as User Defined Functions).
... View more
07-14-2017
04:09 PM
Thank you @ccasano It was due to this error handling design and InvokeHTTP not able to establish a connection.
... View more
07-14-2017
01:55 PM
One of the Hive Interactive Query (LLAP) configs is "Hold Containers to Reduce Latency" and it is set to false by default. What specifically does this config control, and since the goal of LLAP is fast response times (down to subsecond), why is the default value not true since the config name suggests that turning it on would reduce latency?
... View more
Labels:
- Labels:
-
Apache Hive
07-10-2017
06:04 PM
I recall now the port differences ... it is between HiveServer2 (10000) and HiveServer2 Interactive (10500) and nothing with jdbc
... View more