Member since
07-11-2016
25
Posts
1
Kudos Received
0
Solutions
04-20-2017
06:53 AM
@Wynner Thank you for the answer. And also need one more help. Do you have any documents or reference for Best practice used in NiFi data flow development ?
... View more
04-19-2017
01:39 PM
@Wynner
We want to keep the file there only. I am using NiFi in cluster.
... View more
04-19-2017
10:36 AM
Getting files from FTP, where we can use ListSFTP and then FetchSTP to get file instead of using GetSFTP processor to get. What could be the advantage of having ListSFTP+FetchFTP over GETSFTP?
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
03-09-2017
10:30 AM
@Michael Young
The _all field is not disabled and we are getting the
following response for the query. Query: GET
/movies/_search?pretty {
"size": 10,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*drama*" } } } Query Response: {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0 },
"hits": {
"total": 4,
"max_score": 1,
"hits": [ {
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoYRhQexAEXKBamIeYy",
"_score": 1 }, {
"_index": "movies",
"_type": "movie_shows",
"_id": "AVoYRuxxxAEXKBamIeY2",
"_score": 1 }, {
"_index": "movies",
"_type": "movie_shows",
"_id": "AVoYRuxxxAEXKBamIeY4",
"_score": 1 }, {
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoYRhQexAEXKBamIeYw",
"_score": 1 } ] } } The high level intent is to identify fields and values from
index matching search - for presence of keyword anywhere in the document and so
the _all field is used.
... View more
03-07-2017
10:40 AM
@Michael Young We are using the default analyzer and tokenizer. The
_settings endpoint for index does not provide the analyzer that is being used. We are using default mappings for fields and we have not
added any new templates. Please find the mappings used for the index movies below: {
"movies": {
"mappings": {
"movie_shows": {
"properties": {
"date": {
"type": "date" },
"genres": {
"type": "text", "fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } },
"id": {
"type": "long" }, "theatre": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } },
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } } } },
"movie_intrnl": {
"properties": {
"director": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } },
"genres": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } },
"id": {
"type": "long" },
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256 } } },
"year": {
"type": "long" } } } } } }
... View more
03-05-2017
11:03 PM
We are using ElasticSearch 5.0.0. Please let us know if there is any regex or any other way to
perform case insensitive search. Please find data in movies index in ElasticSearch in attachment. Please find aggregation query to find fields matching search
string “*drama*” in movies index: GET
/movies/_search?pretty {
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*drama*" } },
"aggs": {
"distinct_tables_1": {
"terms": {
"field": "_type" },
"aggs": {
"distinct_col_1": {
"terms": {
"field": "genres.keyword",
"include" : ".*drama.*" } } } },
"distinct_tables_2": {
"terms": {
"field": "_type" },
"aggs": {
"distinct_col_2": {
"terms": {
"field": "director.keyword", "include"
: ".*drama.*" } } } },
"distinct_tables_3": {
"terms": {
"field": "_type" },
"aggs": {
"distinct_col_3": {
"terms": { "field":
"theatre.keyword",
"include" : ".*drama.*" } } } } } } We get the following response: {
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0 },
"hits": {
"total": 4,
"max_score": 0,
"hits": [] },
"aggregations": {
"distinct_tables_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "movie_intrnl",
"doc_count": 2,
"distinct_col_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } }, {
"key": "movie_shows",
"doc_count": 2,
"distinct_col_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } } ] },
"distinct_tables_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "movie_intrnl",
"doc_count": 2,
"distinct_col_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } }, {
"key": "movie_shows",
"doc_count": 2,
"distinct_col_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } } ] },
"distinct_tables_3": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "movie_intrnl",
"doc_count": 2,
"distinct_col_3": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } }, {
"key": "movie_shows",
"doc_count": 2,
"distinct_col_3": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } } ] } } } It can be seen from the response that there are no matching
columns values in response even though there are documents matching search
string “drama”. The search for regex in aggregations appears to be case
sensitive and so no values are returned. We used this alternate query to find words matching Drama to
perform case-insensitive search. However this uses only part word .*rama.*
instead of Drama and it would be better to perform case-insensitive search. GET
/movies/_search?pretty {
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true, "query":
"*drama*" } },
"aggs": {
"distinct_tables_1": {
"terms": {
"field": "_type" }, "aggs":
{
"distinct_col_1": {
"terms": {
"field": "genres.keyword",
"include" : ".*rama.*" } } } },
"distinct_tables_2": {
"terms": { "field": "_type" },
"aggs": {
"distinct_col_2": {
"terms": {
"field": "director.keyword",
"include" : ".*rama.*" } } } },
"distinct_tables_3": {
"terms": {
"field": "_type" },
"aggs": {
"distinct_col_3": {
"terms": {
"field": "theatre.keyword",
"include" : ".*rama.*" } } } } } } Response for the query given above: {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0 },
"hits": {
"total": 4,
"max_score": 0,
"hits": [] },
"aggregations": {
"distinct_tables_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ { "key": "movie_intrnl",
"doc_count": 2,
"distinct_col_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "BiographyDrama",
"doc_count": 1 }, {
"key": "Drama",
"doc_count": 1 } ] } }, {
"key": "movie_shows",
"doc_count": 2,
"distinct_col_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ { "key":
"BiographyDrama",
"doc_count": 1 }, {
"key": "Drama",
"doc_count": 1 } ] } } ] },
"distinct_tables_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "movie_intrnl",
"doc_count": 2,
"distinct_col_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "Drama1",
"doc_count": 1 }, {
"key": "Drama4",
"doc_count": 1 } ] } }, {
"key": "movie_shows",
"doc_count": 2, "distinct_col_2":
{
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [] } } ] },
"distinct_tables_3": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ {
"key": "movie_intrnl",
"doc_count": 2,
"distinct_col_3": {
"doc_count_error_upper_bound": 0, "sum_other_doc_count":
0,
"buckets": [] } }, {
"key": "movie_shows",
"doc_count": 2,
"distinct_col_3": {
"doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
"buckets": [ {
"key": "Drama4",
"doc_count": 1 } ] } } ] } } }
... View more
- Tags:
- ElasticSearch
02-06-2017
08:13 PM
Hi @Michael Young Please find updated question. It would be great if you assist on this. Pl
... View more
02-06-2017
08:13 PM
We are new to ElasticSearch and Kibana. We are using ElasticSearch 5.0.0 to identify relationships between Hive table fields in a database by searching across all columns for specific keywords. We are open to use queries or ElasticSearch APIs or any other solutions to meet the requirement.
We have uploaded details into a single index in ElasticSearch
by installing elasticsearch-hadoop-5.1.1 library and creating external Hive
tables in ElasticSearch. We are using _type column for storing Hive table
names. Please find the initial query used along with highlight
feature to identify matching Hive table and field names with values. GET /movies/_search?pretty
{
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false,
"fragment_size": 2147483647
}
}
The issue with this approach is that only individual
documents are returned and distinct table names and field names are not
returned. We tried using the following terms aggregation query to get
the distinct table names (_type) along with some static field, say field_1. GET /movies/_search?pretty
{
"size": 0,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"aggs": {
"distinct_tables": {
"terms": {
"field": "_type"
},
"aggs" : {
"unique_set_2": {
"terms": {
"field": "title.keyword"
}
}
}
}
}
}
Then we thought of extending it by combining the two queries
to get a dynamic response of matched fields from highlight and do a distinct on
those to get the distinct table names and distinct column names for each table
name. But we are not able to achieve this through queries and other means. Request you to help us address the requirement. Please
suggest if there are simpler or alternative approaches or other open source
tools as well. Thanks. We have used the _type field to store the Hive Table Names
for the example index movies. movies/movie_intrnl movies/movie_shows Data used for Hive Tables stored in ElasticSearch Index
movies: movie_intrnl: Title1,1,Action1,2003,Action Title2,2,dire2,2007,Crime Title3,3,dire3,2004,CrimeThriller Title4,4,Drama1,2003,Drama Title5,5,Action2,2005,Action Title6,6,Drama4,2007,BiographyDrama movie_shows: Action1,Title1,1,2017-04-03 00:00:00,Action Theatre2,Title2,2,2016-05-07 00:00:00,Crime Theatre3,Title2,3,2015-06-04 00:00:00,CrimeThriller Drama4,Title4,4,2014-08-03 00:00:00,Drama Action5,Title1,5,2019-09-05 00:00:00,Action Theatre6,Title6,6,2017-10-07 00:00:00,BiographyDrama ElasticSearch Query to get the distinct table names (_type
field in ElasticSearch): GET /movies/_search?pretty
{
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"aggs": {
"distinct_tables": {
"terms": {
"field": "_type"
}
}
}
}
We got the response given below for getting the distinct
tables using the following tags: aggregations -> buckets -> key movie_intrnl movie_shows Response for the ES Query to get distinct table names: {
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"distinct_tables": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "movie_intrnl",
"doc_count": 2
},
{
"key": "movie_shows",
"doc_count": 2
}
]
}
}
}
We are using the highlight feature to identify matching
field names for the search query. ES Query to get matching table names and column names for
a search pattern: GET /movies/_search?pretty
{
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false,
"fragment_size": 2147483647
}
}
We got the response below for getting the matching table
names and column names. Response - 1 st matching document: {
{
"_index": "movies",
"_type": "movie_shows",
"_id": "AVoEcEMrxAEXKBamIeYm",
"_score": 1,
"highlight": {
"genres": [
"<em>BiographyDrama</em>"
]
}
},
{
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoEbkPFxAEXKBamIeYe",
"_score": 1,
"highlight": {
"director": [
"<em>Drama1</em>"
],
"genres": [
"<em>Drama</em>"
]
}
},
{
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoEbkPFxAEXKBamIeYg",
"_score": 1,
"highlight": {
"director": [
"<em>Drama4</em>"
],
"genres": [
"<em>BiographyDrama</em>"
]
}
}
]
}
}
But this approach does not give the distinct table names and
column names across all the matched documents. Expected response as per the requirement is given below. "_type": "movie_shows" "theatre" "genres" "_type": "movie_intrnl" "director" "genres" Response for search query to get matching column names
and table names: { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 1, "hits": [ { "_index": "movies", "_type": "movie_shows", "_id":
"AVoEcEMrxAEXKBamIeYk", "_score": 1, "highlight": { "theatre": [ "<em>Drama4</em>" ], "genres": [ "<em>Drama</em>" ] } }, { "_index": "movies", "_type": "movie_shows", "_id":
"AVoEcEMrxAEXKBamIeYm", "_score": 1, "highlight": { "genres": [ "<em>BiographyDrama</em>" ] } }, { "_index": "movies", "_type":
"movie_intrnl", "_id":
"AVoEbkPFxAEXKBamIeYe", "_score": 1, "highlight": { "director": [ "<em>Drama1</em>" ], "genres": [ "<em>Drama</em>" ] } }, { "_index": "movies", "_type":
"movie_intrnl", "_id":
"AVoEbkPFxAEXKBamIeYg", "_score": 1, "highlight": { "director": [ "<em>Drama4</em>" ], "genres": [ "<em>BiographyDrama</em>" ] } } ] } } In the query given below, we have tried aggregation on _type
column to get distinct table names and with sub-aggregation by using a sample
static field “genres”. However, since columns from the search result is
dynamic, we are looking for a mechanism to use sub-aggregation on top of highlight
field results to get the distinct column names within each identified distinct
table name. ES Query tried to get distinct table names and column
names: GET /movies/_search?pretty
{
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"aggs": {
"distinct_tables": {
"terms": {
"field": "_type"
},
"aggs" : {
"unique_set_2": {
"terms": {
"field": "genres.keyword"
}
}
}
}
}
}
Response to ES Query tried to get distinct table names
and column names: {
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"distinct_tables": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "movie_intrnl",
"doc_count": 2,
"unique_set_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "BiographyDrama",
"doc_count": 1
},
{
"key": "Drama",
"doc_count": 1
}
]
}
},
{
"key": "movie_shows",
"doc_count": 2,
"unique_set_2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "BiographyDrama",
"doc_count": 1
},
{
"key": "Drama",
"doc_count": 1
}
]
}
}
]
}
}
}
@Michael Young
... View more
Labels:
- Labels:
-
Apache Hadoop
11-22-2016
10:32 AM
Hello All, my requirement is to store multiple images along with some identifier column in Hive table . Is there any way to store multiple images in Hive tables?
... View more
Labels:
- Labels:
-
Apache Hive
09-22-2016
01:03 PM
1 Kudo
Hello Team, There is one query regarding Apache Nifi and Kafka. Both are messaging system. Can someone tell can we replace Nifi with Kafka or vice-versa. And what are advantage of Nifi over Kafka.
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache NiFi
08-16-2016
08:25 AM
@mclark Thanks for the response and appreciated. Do I need to configure something at back-end as well i.e. in nifi.properties or any other file in cluster or node because I am facing attached error.
... View more
08-10-2016
02:11 PM
Thanks @mclark .
I am attaching a template of a flow which extract earthquake data from US government site. But getting duplicate data as output.eqdataus.xml
... View more
08-05-2016
02:04 PM
Thanks for Respond.
I tried this but within a single node I am getting duplicate data. Do we have any expression so that I can use it to remove duplicate data.
Can you check the data flow which i attached.
... View more
08-05-2016
01:15 PM
@mclark Can you please elaborate it ?
... View more
08-05-2016
08:11 AM
Hello Team, I am trying to create a data flow for live data streaming of twitter using Nifi. But while I try to run my flow it gives me duplicates (say one tweet two or more times). I have attached the screen shot and template of flow. Can you please help me out with any expression which i can put in the flow to remove the duplicates. I don't want to use DeleteDuplicate processor because it affect the performance by taking time in cache creation. (In flow I am formatting the tweets and used repalcetext to make it formatted) @Matt Burgess -Thank you1.jpg2.jpg
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
08-01-2016
11:29 AM
Thanks Pierre Villard.
My Nifi is installed in cluster so what setting I need to mention in "DistributedMapCacheClientService". And I also read somewhere that we need to mention "nifi.controller.service.configuration.file" in file "nifi.properties". Can you put some light on this as well?
... View more
08-01-2016
10:57 AM
I want to use "DetectDuplicate" processor to remove duplicate JSON content or duplicate tweets and merge into a single file. Can someone help me in this .@Jeremy Dyer,@Matt Burgess Thanks in advance.
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
07-12-2016
06:35 AM
@Abdelkrim Hadjidj @Lubin Lemarchand
Can you please let me know in Unix How can I write values in file for GetFile Processor because symbol "∾" changed into some junk value when I write it in unix flat file.
... View more
07-11-2016
07:43 AM
Hi @Manikandan Durairaj, Can you share template with me if you get success on this. Regards, Yogesh
... View more