About wfloyd

wfloyd · ‎12-07-2016

Issues: 1) In your table definition "create table ..." you do not specify the LOCATION attribute of your table. Therefore Hive will default to look for the file in the default warehouse dir path. The location in your screenshot is under /user/admin/. You can run the command "show create table ..." to see where Hive thinks the table's files are located. By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. 2) You are specifying the format using hive.default.fileformat. I would avoid using this property. Instead simply use "STORED AS TEXTFILE" or "STORED AS ORC" in your table definition. Please change the above, retest and let us know how that works

wfloyd · ‎11-28-2016

My guess is that your local "Cluster A" config values are superseding your use of the "-D" option to overwrite the defaultFS parameter. E.g. your local Cluster A values may have higher priority. I would have expected that your second command with "hadoop fs -ls" should work to display the remote clusters file directory. Perhaps there was a typo or some other reason why this is not being picked up? Could you alternatively use WebHDFS command via REST API (bash or Python) to list directories? https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#LISTSTATUS

wfloyd · ‎11-28-2016

How did you write the ORC file to this location (Pig, Spark, NiFi other?). Can you show the schema of the table, the contents of the folder where ORC files are written, and also any details/code on how the file was ingested?

wfloyd · ‎11-23-2016

Well done Ned!

wfloyd · ‎08-25-2016

Appreciate the correction 😉

wfloyd · ‎08-24-2016

My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you: Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table. Write a brief Pig script to parse the CSV table and push to your destination Hive table Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes Cheers! Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDe

wfloyd · ‎08-24-2016

To your questions: "Will the processing be distributed" - Yes, the incoming flow of data will be evenly distributed to all available NiFi instances in the NiFi cluster. NCM will act as load balancer "Distribute the fetching of the source file" - could you elaborate on what you mean by this? In your existing example the Standalone instance is the only one which has access to its local filesystem. How would you prefer to distribute this?

wfloyd · ‎05-25-2016

I haven't seen a full document that covers sanity checking the entire cluster. This is often performed by the PS team at customer engagements. Side note: the most important common individual component test I use to smoke test a cluster is Hive-TestBench.

wfloyd · ‎04-07-2016

Does Knox’s Hive service configuration support the overloaded URLs used in the HiveServer2 + ZK HA approach? For example would this be a supportable Knox configuration: <service> <role>HIVE</role> <url>http://zk1.customer.com:2181,zk2.customer.com:2181,zk2.customer.com:2181/cliservice</url> </service> Not sure if the ZK URL can be passed through Knox?

wfloyd · ‎03-17-2016

Please find below a potential design for Disk & RAID configuration for a typical 12 disk Server running NiFi. This design is intended for a simple log ingestion use case, where customer needs very little provenance records, but would also like reliability on the storage layer. FlowFile repo: 2 drives setup as RAID 1 Provenance repo: 2 Drives RAID 1 Content repo setup either: 4 drives (RAID 10) /cont_repo1 4 drives (RAID 10) /cont_repo2 or 2 drives (RAID 1) /cont_repo1 2 drives (RAID 1) /cont_repo2 2 drives (RAID 1) /cont_repo3 2 drives (RAID 1) /cont_repo4 Thanks @mpayne & @Andrew Grande for guidance!

Online	Offline
Last Visited	‎04-24-2017 02:32 PM

Member Since	‎09-23-2015 09:15 PM
Last Visited	‎04-24-2017 02:32 PM
Posts	88
Kudos received	109

Cloudera Community

Re: Is there is any workaround to map csv columns ...

Re: Destination table is stored as ORC but the fil...

Re: could not list the files in remote HDFS cluste...

Re: Destination table is stored as ORC but the fil...

Re: Stream data into HIVE like a Boss using NiFi H...

Re: Load balancing while the fetching of file fro...

Re: Is there is any workaround to map csv columns ...

Re: Load balancing while the fetching of file fro...

Re: Sanity Check / Cluster Validation documents?

Does Knox’s Hive service configuration support the...

NiFi Hypothetical Disk Layout and RAID Configurati...