About stevenmatison

stevenmatison · ‎08-15-2020

@avi166 I think by the time you get to RouteOnAttribute you should have already read the file, but there isn't really a right or wrong answer. One of the things I like the most about nifi is that there are many different ways to achieve the same end result. To answer your next question, you may be able to use the same flow for different CSV file structures, and you should if you can but dont be afraid to split the flow again as I have outline above. These may need different schemas, different record reader, but same record writer. Rejoin again when files are ready to converge into the same processor or branch of functionality. I also tried to point out that at first you may, for example, have to route some CSVs to a different file structure branch. Then by finishing all csv branches, and knowing the differences for each, you should be able to make a final more dynamic branch to replace 1 or more previous branches. This is tuning and optimization steps that you really won't know until you evaluate that final flow branch against the previous versions.

stevenmatison · ‎08-14-2020

@avi166 This is a common use case for nifi to create a data flow that is a single entry point for data files of different expected types up to an including all types. For example you an create an API with HandleHttpRequest/HandleHttpResponse to accept a post of a file. Another example is using getFile/ListFile/etc at the top of a flow to read a directory. Another new common example would be to get the files from Amazon S3. After the top of the flow where files arrive inbound, it is common to create a single flow with a single branch for a specific use case. This is how you have created it for CSV. To improve your flow you would add RouteOnAttribute to check the file name ends in CSV. This would create a "csv" route which you would then direct downstream the flow you created. Next you similarly split the flow for other types TXT, AVRO, etc, and then one for unmatched type. Once the split is made for each you can now create separate branches (data flows) add additional processors that needed to prepare each type for insertion. Sometimes you can create a branch that can handle multiple types too. Some split branches may take 3-5+ processors to prepare for DB2 while others maybe even just 1 or 2 to prepare the data. When all the different data flow branches are ready, you then route them all back to a single processor or processor group to handle insert into DB2. So you have a flow that is a single entry, that splits into many branches, and then rejoins at the bottom. While working and operating in this manner you may make separate flow branches and realize later you could combine them by making a new branch that is a lil more dynamic. You should always be looking to improve your flows over time in this manner. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-14-2020

Please accept the solution as answer. Doin this helps complete the solution.

stevenmatison · ‎08-13-2020

@elfz I believe the functions you are working in only return unix timestamp to the seconds. It is completely ignoring .SSS which should be the correct indication for 3 millisecond decimals. Hive timestamps are able to be down to milliseconds with to 9 decimals. Here is a great post on this topic: https://community.cloudera.com/t5/Support-Questions/HIVE-datatype-Timestamp-with-miliseconds/td-p/150944

stevenmatison · ‎08-13-2020

@yangcm Below is the terminal output I used to test your create statements. You may need to do some additional hql for parquet formatting, but this should get you through to working with the tables: [root@c7301 ~]# sudo su - hdfs [hdfs@c7301 ~]$ hdfs dfs -mkdir -p /user/a/client [hdfs@c7301 ~]$ hdfs dfs -mkdir -p /user/a/mysql [hdfs@c7301 ~]$ hdfs dfs -chown -R hive:hive /user/a/ Then get to hive: sudo su - hive hive And run your creation statements. I did them as single inserts in my test but you can do them together as an HQL file too. The key here is that the locations must be owned by hive or the user executing the hive command. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://c7301.ambari.apache.org:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2 20/08/13 15:23:16 [main]: INFO jdbc.HiveConnection: Connected to c7301.ambari.apache.org:10000 Connected to: Apache Hive (version 3.1.0.3.1.0.0-78) Driver: Hive JDBC (version 3.1.0.3.1.0.0-78) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.0.3.1.0.0-78 by Apache Hive 0: jdbc:hive2://c7301.ambari.apache.org:2181/> CREATE EXTERNAL TABLE default.a1 ( . . . . . . . . . . . . . . . . . . . . . . .> client_id smallint, . . . . . . . . . . . . . . . . . . . . . . .> client_name varchar(255), . . . . . . . . . . . . . . . . . . . . . . .> client_address varchar(1234) . . . . . . . . . . . . . . . . . . . . . . .> ) . . . . . . . . . . . . . . . . . . . . . . .> PARTITIONED BY (month string) . . . . . . . . . . . . . . . . . . . . . . .> stored as parquet . . . . . . . . . . . . . . . . . . . . . . .> LOCATION '/user/a/client'; INFO : Compiling command(queryId=hive_20200813152342_b1597bb7-670b-4f18-9dbb-a9ee68cf1323): CREATE EXTERNAL TABLE default.a1 ( client_id smallint, client_name varchar(255), client_address varchar(1234) ) PARTITIONED BY (month string) stored as parquet LOCATION '/user/a/client' INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20200813152342_b1597bb7-670b-4f18-9dbb-a9ee68cf1323); Time taken: 1.44 seconds INFO : Executing command(queryId=hive_20200813152342_b1597bb7-670b-4f18-9dbb-a9ee68cf1323): CREATE EXTERNAL TABLE default.a1 ( client_id smallint, client_name varchar(255), client_address varchar(1234) ) PARTITIONED BY (month string) stored as parquet LOCATION '/user/a/client' INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20200813152342_b1597bb7-670b-4f18-9dbb-a9ee68cf1323); Time taken: 0.584 seconds INFO : OK No rows affected (2.923 seconds) 0: jdbc:hive2://c7301.ambari.apache.org:2181/> CREATE EXTERNAL TABLE default.a2 ( . . . . . . . . . . . . . . . . . . . . . . .> mysql_12 double, . . . . . . . . . . . . . . . . . . . . . . .> mysql_13 float . . . . . . . . . . . . . . . . . . . . . . .> ) . . . . . . . . . . . . . . . . . . . . . . .> PARTITIONED BY (month string) . . . . . . . . . . . . . . . . . . . . . . .> stored as parquet . . . . . . . . . . . . . . . . . . . . . . .> LOCATION '/user/a/mysql'; INFO : Compiling command(queryId=hive_20200813152356_48370efa-c51c-466f-8b2d-cf8ebb30d55b): CREATE EXTERNAL TABLE default.a2 ( mysql_12 double, mysql_13 float ) PARTITIONED BY (month string) stored as parquet LOCATION '/user/a/mysql' INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20200813152356_48370efa-c51c-466f-8b2d-cf8ebb30d55b); Time taken: 0.047 seconds INFO : Executing command(queryId=hive_20200813152356_48370efa-c51c-466f-8b2d-cf8ebb30d55b): CREATE EXTERNAL TABLE default.a2 ( mysql_12 double, mysql_13 float ) PARTITIONED BY (month string) stored as parquet LOCATION '/user/a/mysql' INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20200813152356_48370efa-c51c-466f-8b2d-cf8ebb30d55b); Time taken: 0.084 seconds INFO : OK No rows affected (0.42 seconds) 0: jdbc:hive2://c7301.ambari.apache.org:2181/> describe default.a1 . . . . . . . . . . . . . . . . . . . . . . .> ; INFO : Compiling command(queryId=hive_20200813152409_da398c2e-edc1-4109-bb07-5577133b02aa): describe default.a1 INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20200813152409_da398c2e-edc1-4109-bb07-5577133b02aa); Time taken: 0.372 seconds INFO : Executing command(queryId=hive_20200813152409_da398c2e-edc1-4109-bb07-5577133b02aa): describe default.a1 INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20200813152409_da398c2e-edc1-4109-bb07-5577133b02aa); Time taken: 0.295 seconds INFO : OK +--------------------------+----------------+----------+ | col_name | data_type | comment | +--------------------------+----------------+----------+ | client_id | smallint | | | client_name | varchar(255) | | | client_address | varchar(1234) | | | month | string | | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | month | string | | +--------------------------+----------------+----------+ 8 rows selected (0.891 seconds) 0: jdbc:hive2://c7301.ambari.apache.org:2181/> describe default.a2; INFO : Compiling command(queryId=hive_20200813152426_a2916edf-e217-4b72-8398-6fd9d91dde60): describe default.a2 INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20200813152426_a2916edf-e217-4b72-8398-6fd9d91dde60); Time taken: 0.293 seconds INFO : Executing command(queryId=hive_20200813152426_a2916edf-e217-4b72-8398-6fd9d91dde60): describe default.a2 INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20200813152426_a2916edf-e217-4b72-8398-6fd9d91dde60); Time taken: 0.179 seconds INFO : OK +--------------------------+------------+----------+ | col_name | data_type | comment | +--------------------------+------------+----------+ | mysql_12 | double | | | mysql_13 | float | | | month | string | | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | month | string | | +--------------------------+------------+----------+ 7 rows selected (0.778 seconds) 0: jdbc:hive2://c7301.ambari.apache.org:2181/> If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@JonnyL I would highly recommend that you back up and create a small 3 node nifi cluster to test this feature. Putting 2 nifi on single node, does not satisfy the test cases you really want to be experimenting with. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@gakroman What platform are you using Hue with (Cdx/HDx)? What does the metastore and user permissions look like? Without more information (you'll need to look under the hood at the hue logs) I think you need to look into permissions for the user you are using to the metastore. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@scotth1 You should be able to use QueryRecord processor with some advanced sql to extract what you need from any values in the underlying data result. Here is a great article about QueryRecord: https://community.cloudera.com/t5/Community-Articles/Running-SQL-on-FlowFiles-using-QueryRecord-Processor-Apache/ta-p/246671 If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@devops there are some informations around the internet for work arounds making ambari think python 2 is really python 3, but the short answer is above. Please accept the answer. The workaround just creates more problems than it solves. Python 3 was never finished in Ambari, and it doesn't look like it will ever be improved for python 3. In the same high level conversation, the version of java supported is now similarly antiquated.

stevenmatison · ‎08-13-2020

@das_dineshk In Hive a varchar is a string that is capable to set a max length. It is still a string: Varchar Varchar types are created with a length specifier (between 1 and 65535), which defines the maximum number of characters allowed in the character string. If a string value being converted/assigned to a varchar value exceeds the length specifier, the string is silently truncated. Character length is determined by the number of code points contained by the character string. Like string, trailing whitespace is significant in varchar and will affect comparison results. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

Online	Offline
Last Visited	‎06-01-2022 03:47 PM

Name	Steven Matison
Location	Florida
Member Since	‎07-19-2018 04:45 PM
Last Visited	‎06-01-2022 03:47 PM
Posts	613
Kudos received	101

Cloudera Community

Re: Apache nifi - how to convert a file .txt into ...

Re: Apache Nifi - Using PutParquet, the HDFS file ...

Re: How to extract csv column record and used it f...

Re: Could not connect to Distributed Map Cache ser...

Re: NiFi InvokeHTTP POST JSON

Re: NIFI read different file format and convert it...

Re: NIFI read different file format and convert it...

Re: Writing parquet file to hdfs for internal Hive...

Re: Convert String to Timestamp with milliseconds...

Re: create multiple tables using hdl file

Re: NiFi Load Balancing Demo

Re: Hue cannot retrieve information on metastore

Re: extract regex from record field in UpdateRecor...

Re: Ambari server installation on Centos 7 with py...

Re: Datatype from varchar to string in hive automa...