Member since
10-13-2016
9
Posts
2
Kudos Received
0
Solutions
04-14-2020
12:10 AM
you can use .repartition(1) DF..repartition(1) .....
... View more
09-15-2017
06:46 AM
1 Kudo
@Gundrathi babu you can try it with groupBy and filter in pyspark which you have mentioned in your questions. Sample: grp = df.groupBy("id").count(1) fil = grp.filter(lambda grp : '' in grp) fil will have the result with count. Hope it helps!! This is how you have to workout I dont have running spark cluster in handy to verify the code. But this flow should help you out to solve the issue.
... View more
09-12-2017
10:03 AM
Hi @Gundrathi babu By using coalesce/partition you will be re-distributing the data in the partitions. Whereas if it is stored in hive as different partitions then they are few information available in Hcatalog/Hive metastore which enables it get the count much faster than spark. If you want to find the row count of each partition and assuming the table stats are enable/collected then it will again perform better than spark. Whereas in spark there will not separate metadata handled for spark. That's the reason for performance difference. Hope it helps!!
... View more
02-08-2017
03:36 PM
1 Kudo
As for >2 GB blobs, Hive STRING or even BINARY won't handle AFAIK. But that is just googled, Hive experts please add your thoughts. Please note that the "InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit." part in your stack trace tells you that you hit the limits of ProtocolBuffers, not Hive field type limitations. That could explain the 500 MB limit that you got in your investigations. In Hive code, orc input stream implementation I could see that there is 1 GB protobuf limit set but that is for the whole message and the blob is only a part of it.
... View more