Member since
09-23-2015
88
Posts
109
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7202 | 08-24-2016 09:13 PM |
12-07-2016
03:38 PM
Issues: 1) In your table definition "create table ..." you do not specify the LOCATION attribute of your table. Therefore Hive will default to look for the file in the default warehouse dir path. The location in your screenshot is under /user/admin/. You can run the command "show create table ..." to see where Hive thinks the table's files are located. By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. 2) You are specifying the format using hive.default.fileformat. I would avoid using this property. Instead simply use "STORED AS TEXTFILE" or "STORED AS ORC" in your table definition. Please change the above, retest and let us know how that works
... View more
11-28-2016
07:57 PM
1 Kudo
My guess is that your local "Cluster A" config values are superseding your use of the "-D" option to overwrite the defaultFS parameter. E.g. your local Cluster A values may have higher priority. I would have expected that your second command with "hadoop fs -ls" should work to display the remote clusters file directory. Perhaps there was a typo or some other reason why this is not being picked up? Could you alternatively use WebHDFS command via REST API (bash or Python) to list directories? https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#LISTSTATUS
... View more
11-28-2016
07:46 PM
How did you write the ORC file to this location (Pig, Spark, NiFi other?). Can you show the schema of the table, the contents of the folder where ORC files are written, and also any details/code on how the file was ingested?
... View more
11-23-2016
06:55 PM
Well done Ned!
... View more
08-25-2016
08:23 AM
Appreciate the correction 😉
... View more
08-24-2016
09:13 PM
1 Kudo
My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you: Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table. Write a brief Pig script to parse the CSV table and push to your destination Hive table Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes Cheers! Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDe
... View more
08-24-2016
03:34 PM
To your questions: "Will the processing be distributed" - Yes, the incoming flow of data will be evenly distributed to all available NiFi instances in the NiFi cluster. NCM will act as load balancer "Distribute the fetching of the source file" - could you elaborate on what you mean by this? In your existing example the Standalone instance is the only one which has access to its local filesystem. How would you prefer to distribute this?
... View more
05-25-2016
10:40 AM
I haven't seen a full document that covers sanity checking the entire cluster. This is often performed by the PS team at customer engagements. Side note: the most important common individual component test I use to smoke test a cluster is Hive-TestBench.
... View more
04-07-2016
08:30 PM
2 Kudos
Does Knox’s Hive service configuration support the overloaded URLs used in the HiveServer2 + ZK HA approach? For example would this be a supportable Knox configuration: <service>
<role>HIVE</role>
<url>http://zk1.customer.com:2181,zk2.customer.com:2181,zk2.customer.com:2181/cliservice</url>
</service> Not sure if the ZK URL can be passed through Knox?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Knox
03-17-2016
10:06 PM
1 Kudo
Please find below a potential design for Disk & RAID configuration for a typical 12 disk Server running NiFi. This design is intended for a simple log ingestion use case, where customer needs very little provenance records, but would also like reliability on the storage layer. FlowFile repo: 2 drives setup as RAID 1
Provenance repo: 2 Drives RAID 1 Content repo setup either:
4 drives (RAID 10) /cont_repo1
4 drives (RAID 10) /cont_repo2
or
2 drives (RAID 1) /cont_repo1
2 drives (RAID 1) /cont_repo2
2 drives (RAID 1) /cont_repo3
2 drives (RAID 1) /cont_repo4 Thanks @mpayne & @Andrew Grande for guidance!
... View more
Labels: