About medloh

medloh · ‎02-08-2021

Thanks, I set the same avro schema in both the reader and writer before reading this, and that worked after I set some optional fields. Last task is to gain a better understanding of how I can control the output filename. Maybe something like this: https://community.cloudera.com/t5/Support-Questions/Nifi-date-filename/m-p/172305

medloh · ‎02-05-2021

Thanks Matt, The author of that article recommended this, I'll give it a try. " Also, NiFi has now a Parquet Record that you can use outside of the PutParquet. I advise you to use ConvertRecord then PutHDFS directly. This is better." I'm a little unsure of how and where I need to configure the schemas but hopefully I'll figure it out.

medloh · ‎02-05-2021

I'm trying to convert local json file with getfile processor into parquet file on HDFS. I'm following this guide: https://medium.com/@abdelkrim.hadjidj/using-apache-nifi-for-json-to-parquet-conversion-945219d5caba I choose Record Reader of JsonTreeReader, but I don't see any of the schema properties in the putparquet processor? (Schema Access Strategy, Schema Text). How do I get those to show up, or how do I enter them manually?

medloh · ‎05-02-2018

Thanks Romainr, You pointed us in the right direction. I was setting it in several incorrect places in CM Impala and Hue. The place that finally worked was: Hue-->Configuration Hue Server Advanced Configuration Snippet (Safety Valve) for hue_safety_valve_server.ini [impala] query_timeout_s=86400

medloh · ‎05-01-2018

I've set QUERY_TIMEOUT_S to 604800 (1 week) in the impala configuration safety valve, restarted impala and hue, but it doesn't seem to matter. My query results show up for a while in hue, then after a few minutes change to 'query failed'. Looking at the impala queries in CM, they show as Query 324d091f74c2ccc2:89c7ef9000000000 expired due to client inactivity (timeout is 10m) I must be missing something. What is the default behavior for HUE impala queries, do they expire after 10 minutes?

medloh · ‎04-30-2018

Getting this message when trying to Download results of a query that has been sitting for a while (probably more than 10 minutes: Query db41f6674e5688ca:cc35f3ad00000000 expired due to client inactivity (timeout is 10m) What setting controls how long these results stick around? We have the following set on our impala daemons, which I thought would prevent sessions and results from expiring, but it doesn't appear so: flag descrip default curr value idle_session_timeout (int32) The time, in seconds, that a session may be idle for before it is closed (and all running queries cancelled) by Impala. If 0, idle sessions are never expired. 0 0 idle_query_timeout (int32) The time, in seconds, that a query may be idle for (i.e. no processing work is done and no updates are received from the client) before it is cancelled. If 0, idle queries are never expired. The query option QUERY_TIMEOUT_S overrides this setting, but, if set, --idle_query_timeout represents the maximum allowable timeout. 0 0 We'd like to keep query results around for a long long time if possible. Do we have to enable proxy load balancing to get Impala to keep query results accessible for longer times? Version: Cloudera Enterprise 5.13.0

medloh · ‎02-05-2018

After some experimenting, here's what we found. I'd be interested if anyone enthusiastically agrees or passionately disagrees this. Tested the following scenarios: 1) 128mb block, 128mb file, gzip 2) 128mb block, 1gb file, gzip 3) 1gb block, 1gb file, gzip 4) 128mb block, 128 file, snappy 5) 128mb block, 1gb file, snappy 6) 1gb block, 1gb file, snappy The worst in storage and performance seemed to be the 2 cases where the block size was much smaller than the file size in both compression formats, so strike out #2 and #5. The performance for 1, 3, 4, and 6 all seemed to be very similar in the queries we tested. But gzip used only about 60% as much storage. So, probably going with gzip. Finally we're thinking the smaller block and file size is probably the way to go to get a little more parallelism.

medloh · ‎12-12-2017

Thanks for the response, good info. We've decided to do some testing and profiling ourselves to have a little more confidence before we start the migration. We're going to do a much smaller data set using some variations of file size, block size, and compression and see which performs best in Impala.

medloh · ‎12-05-2017

Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage. We don't care about compression speed, but do care about compression rate and decompression speed. We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format. The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change. I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.

medloh · ‎12-04-2017

Looking for some guidance on the size and compression of Parquet files for use in Impala. We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc). Now we just need to make a decision on their size and compression. I've read Snappy is splittable, and you should make your files bigger than your HDFS block size. So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice? Is it possible to increase or decrease the amount of compression with Snappy? Gzip gives us better compression. But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size? We will have lots of partitions and some of them will be large, hundreds of GB or bigger. The total amount of data will be hundreds of Terabytes or more.

Online	Offline
Last Visited	‎06-06-2022 11:57 AM

Member Since	‎12-01-2015 02:44 PM
Last Visited	‎06-06-2022 11:57 AM
Posts	13
Kudos received	2

Cloudera Community

Re: Nifi PutParquet processor, how to enter json s...

Re: Nifi PutParquet processor, how to enter json s...

Nifi PutParquet processor, how to enter json schem...

Re: Query blah expired due to client inactivity (t...

Re: Query blah expired due to client inactivity (t...

Query blah expired due to client inactivity (timeo...

Re: Recommended file size for Impala Parquet files...

Re: Recommended file size for Impala Parquet files...

Re: Recommended file size for Impala Parquet files...

Recommended file size for Impala Parquet files?