Member since
12-01-2015
13
Posts
2
Kudos Received
0
Solutions
02-08-2021
09:54 AM
Thanks, I set the same avro schema in both the reader and writer before reading this, and that worked after I set some optional fields. Last task is to gain a better understanding of how I can control the output filename. Maybe something like this: https://community.cloudera.com/t5/Support-Questions/Nifi-date-filename/m-p/172305
... View more
02-05-2021
02:19 PM
Thanks Matt, The author of that article recommended this, I'll give it a try. " Also, NiFi has now a Parquet Record that you can use outside of the PutParquet. I advise you to use ConvertRecord then PutHDFS directly. This is better." I'm a little unsure of how and where I need to configure the schemas but hopefully I'll figure it out.
... View more
02-05-2021
06:07 AM
I'm trying to convert local json file with getfile processor into parquet file on HDFS. I'm following this guide: https://medium.com/@abdelkrim.hadjidj/using-apache-nifi-for-json-to-parquet-conversion-945219d5caba I choose Record Reader of JsonTreeReader, but I don't see any of the schema properties in the putparquet processor? (Schema Access Strategy, Schema Text). How do I get those to show up, or how do I enter them manually?
... View more
Labels:
- Labels:
-
Apache NiFi
05-02-2018
02:16 PM
Thanks Romainr, You pointed us in the right direction. I was setting it in several incorrect places in CM Impala and Hue. The place that finally worked was: Hue-->Configuration Hue Server Advanced Configuration Snippet (Safety Valve) for hue_safety_valve_server.ini [impala] query_timeout_s=86400
... View more
05-01-2018
06:21 PM
I've set QUERY_TIMEOUT_S to 604800 (1 week) in the impala configuration safety valve, restarted impala and hue, but it doesn't seem to matter. My query results show up for a while in hue, then after a few minutes change to 'query failed'. Looking at the impala queries in CM, they show as Query 324d091f74c2ccc2:89c7ef9000000000 expired due to client inactivity (timeout is 10m) I must be missing something. What is the default behavior for HUE impala queries, do they expire after 10 minutes?
... View more
04-30-2018
04:14 PM
Getting this message when trying to Download results of a query that has been sitting for a while (probably more than 10 minutes: Query db41f6674e5688ca:cc35f3ad00000000 expired due to client inactivity (timeout is 10m) What setting controls how long these results stick around? We have the following set on our impala daemons, which I thought would prevent sessions and results from expiring, but it doesn't appear so: flag descrip default curr value idle_session_timeout (int32) The time, in seconds, that a session may be idle for before it is closed (and all running queries cancelled) by Impala. If 0, idle sessions are never expired. 0 0 idle_query_timeout (int32) The time, in seconds, that a query may be idle for (i.e. no processing work is done and no updates are received from the client) before it is cancelled. If 0, idle queries are never expired. The query option QUERY_TIMEOUT_S overrides this setting, but, if set, --idle_query_timeout represents the maximum allowable timeout. 0 0 We'd like to keep query results around for a long long time if possible. Do we have to enable proxy load balancing to get Impala to keep query results accessible for longer times? Version: Cloudera Enterprise 5.13.0
... View more
Labels:
- Labels:
-
Apache Impala
02-05-2018
02:31 PM
After some experimenting, here's what we found. I'd be interested if anyone enthusiastically agrees or passionately disagrees this. Tested the following scenarios: 1) 128mb block, 128mb file, gzip 2) 128mb block, 1gb file, gzip 3) 1gb block, 1gb file, gzip 4) 128mb block, 128 file, snappy 5) 128mb block, 1gb file, snappy 6) 1gb block, 1gb file, snappy The worst in storage and performance seemed to be the 2 cases where the block size was much smaller than the file size in both compression formats, so strike out #2 and #5. The performance for 1, 3, 4, and 6 all seemed to be very similar in the queries we tested. But gzip used only about 60% as much storage. So, probably going with gzip. Finally we're thinking the smaller block and file size is probably the way to go to get a little more parallelism.
... View more
12-12-2017
08:37 AM
Thanks for the response, good info. We've decided to do some testing and profiling ourselves to have a little more confidence before we start the migration. We're going to do a much smaller data set using some variations of file size, block size, and compression and see which performs best in Impala.
... View more
12-05-2017
09:46 AM
Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage. We don't care about compression speed, but do care about compression rate and decompression speed. We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format. The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change. I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.
... View more
12-04-2017
04:46 PM
Looking for some guidance on the size and compression of Parquet files for use in Impala. We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc). Now we just need to make a decision on their size and compression.
I've read Snappy is splittable, and you should make your files bigger than your HDFS block size. So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice? Is it possible to increase or decrease the amount of compression with Snappy?
Gzip gives us better compression. But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size?
We will have lots of partitions and some of them will be large, hundreds of GB or bigger. The total amount of data will be hundreds of Terabytes or more.
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Spark
-
HDFS