About stevenmatison

stevenmatison · ‎01-19-2021

@singyik Yes. I believe that is the last free public repo. Who knows how long it will remain. If you are using it i would recommend to fully copy and use the copy.

stevenmatison · ‎01-11-2021

You must have the reader incorrectly configured for your CSV schema.

stevenmatison · ‎01-11-2021

@Lallagreta You should be able to define the filename, or change the filename to what you want. That said the filename doesnt dictate the type, so you can have parquet saved as .txt. One recommendation I have is to use parquet command line tools during the testing of your use case. This is the best way to validate that files are looking right, have the right schema, and right results. https://pypi.org/project/parquet-tools/ I apologize i do not have any exact samples, but from my recall of a year ago, you should be able to get simple commands to check schema of a file, and another command to show the data results. You may have to copy your hdfs file to local file system to inspect them from command line. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven

stevenmatison · ‎01-08-2021

@murali2425 The solution you are looking for is QueryRecord configured with a CSV Record Reader and Record Writer. You also have UpdateRecord and ConvertRecord which can use the Readers/Writers. This method is preferred over splitting the file and adds some nice functionality. This method allows you to provide a schema for both the inbound csv (reader) and the downstream csv (writer). Using QueryRecord you should be able to split the file, and set attribute of filename set to column1. At the end of the flow you should be able to leverage that filename attribute to resave the new file. You can find some specific examples and configuration screen shots here: https://community.cloudera.com/t5/Community-Articles/Running-SQL-on-FlowFiles-using-QueryRecord-Processor-Apache/ta-p/246671 If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven

TimothySpann · ‎01-05-2021

ARRAYs are a bit tricky. But JSONReader and Writer may work better.

MattWho · ‎01-05-2021

@kiranps11 Did you add and start a "DistributedMapCacheServer" controller service running on port 4557? The "DistributedMapCacheClientService" controller service only creates a client that is used to connect to a server you must also create. Keep in mind that the DistributedMapCacheServer does not offer High Availability (HA). Enabling this controller services will start a DistributedMapCacheServer on each node in your NiFi cluster, but each of those servers do not talk to each other. This is important to understand since you have configured your DMC Client to use localhost. This means that each node in your cluster would be using its own DMC server rather than a single DMC server. For a HA solution you should be using an external map cache via one of the other client offerings like "HBase_2_ClientMapCacheService " or "RedisDistributedMapCacheClientService", but this would require you to setup that external HBAs or Redis server with HA yourself. Hope this helps, Matt

stevenmatison · ‎01-04-2021

@schnell Glad you were able to find the remnant that blocked re-install. Here is my SO reply, which gives some details about how to completely remove HDP and components from node filesystem... With ambari, any service that is deleted with the UI, will still exist on the original node(s) the service was installed on. You would need to manually remove them from the node(s). This process is hard to find documentation on, but basically goes as follows: Delete the application from file system locations such as /etc/ /var/ /opt/ etc Remove user accounts/groups You can find some more details in this blog post here which goes into some of the detail for completely removing HDP. Just follow steps for single service. https://henning.kropponline.de/2016/04/24/completely-uninstall-remove-hdp-nodes/ https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/ch_uninstalling_hdp_chapter.html https://gist.github.com/hourback/085500397bb2588964c5 If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven

stevenmatison · ‎12-22-2020

Amazing work here sir!

Tim Armstrong · ‎12-21-2020

We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!

stevenmatison · ‎12-18-2020

Excellent news. Can you accept the first 2 responses to close this solution?

Online	Offline
Last Visited	‎06-01-2022 03:47 PM

Name	Steven Matison
Location	Florida
Member Since	‎07-19-2018 04:45 PM
Last Visited	‎06-01-2022 03:47 PM
Posts	613
Kudos received	100

Cloudera Community

Re: Apache nifi - how to convert a file .txt into ...

Re: Apache Nifi - Using PutParquet, the HDFS file ...

Re: How to extract csv column record and used it f...

Re: Could not connect to Distributed Map Cache ser...

Re: NiFi InvokeHTTP POST JSON

Re: HDP 3 public repo

Re: Apache nifi - how to convert a file .txt into ...

Re: Apache Nifi - Using PutParquet, the HDFS file ...

Re: How to extract csv column record and used it f...

Re: PutDatabaseRecord and ARRAY type

Re: Could not connect to Distributed Map Cache ser...

Re: How to Delete Zeppelin completely from host (H...

Re: Smart Stocks with FLaNK (NiFi, Kafka, Flink SQ...

Re: How to add a new column to an existing parquet...

Re: NiFi InvokeHTTP POST JSON