About RahulSoni

RahulSoni · ‎04-01-2018

@Selvaraju Sellamuthu Try using the following properties to control the mapper count for your job. set tez.grouping.min-size=16777216; --16 MB min splitset tez.grouping.max-size=1073741824; --1 GB max split These parameters will control the number of mappers for splittable formats with Tez. Please update your results after using these properties for your execution.

RahulSoni · ‎03-31-2018

@Félicien Catherin If I understand your question properly, you want to check all the attributes of a flow file and then take some action on that attribute. For this, you can use getAttributes() function in your script. This will return you a map with attribute name as key and attribute value as value. For example flowFile = session.get() attrMap = flowFile.getAttributes() You can iterate on the map to check if a certain property exists or not or whatever actions you may want to take. Hope that helps!

RahulSoni · ‎03-31-2018

Please mark the answer as accepted if it resolved your problem. This way you can help other community users having similar issues identify the resolution faster.

RahulSoni · ‎03-30-2018

No problem.if your existing approach is not working. Please.share.some.info regarding your convert record processor etc and same dataset it is passing for and is failing for I can look in further

RahulSoni · ‎03-30-2018

@Krishna R In your Hive terminal, set the following properties set hive.exec.compress.output=true; set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec; This will enable the compression and will set the compression codec, gzip in this case. Now you can insert the data into an HDFS directory and the output will be in gzip format. insert overwrite directory 'myHDFSDirectory' row format delimited fields terminated by ',' select * from myTable; This will store the output of my select * query in the HDFS directory. Let know if that works for you.

RahulSoni · ‎03-30-2018

@Stefan Constantin If you are using NiFi 1.2+, I would highly recommend NOT using EvaluateJSONPath. Now talking about the alternate approach, which is using ConvertRecord where you are facing arrayIndexOutOfBoundException, the problem probably is with your schema. Use default null values for your columns in your schema for ConvertRecord Processor like. { "type":"record", "name":"nifi_logs", "fields":[ {"name":"column1","type":["null","string"]}, {"name":"column2","type":["null","string"]}, {"name":"column3","type":["null","string"]}, {"name":"column4","type":["null","string"]} ] } Try this and let know if you still face any problems. Cheers!

RahulSoni · ‎03-29-2018

@subbiram Padala I don't think so! The certification is purely going to be based on your Spark skills and external jars are generally omitted. Also, if there is any such requirement, this would be explicitly mentioned that you need to store the header. And that I mentioned earlier, may not be the case. Keep the spirits up! All the best with your exam.

RahulSoni · ‎03-29-2018

Have you seen the filter condition in my answer above? val rdd = data.filter(row => row != header) Now use such filter condition to filter your null records, if there are any, according to your use case.

RahulSoni · ‎03-28-2018

@swathi thukkaraju You can do it without using CSV package. Use the following code. import org.apache.spark.sql.Row import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType} val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true)) val data = sc.textFile("/user/206571870/sample.csv") val header = data.first() val rdd = data.filter(row => row != header) val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2))) val df = sqlContext.createDataFrame(rowsRDD,schema) After this, do df.show and you will be able to see your data in a relational format. Now you can fire whatever queries you want to fire on your "DataFrame". For example, filtering based on state and saving on HDFS etc. PS - If you want to persist your DataFrame as a CSV file, spark 1.6 DOES NOT support it out of the box, you either need to convert it to RDD, then save or use the CSV package from DataBricks. Let know if that helps!

RahulSoni · ‎03-27-2018

@Manikandan Jeyabal What are the Spark and Hive versions? If you have Hive 2.x and Spark version below 2.2, this is a known issue and was fixed in Spark 2.2 Here is the Jira Link .

Online	Offline
Last Visited	‎10-08-2020 11:27 AM

Member Since	‎08-03-2019 10:44 AM
Last Visited	‎10-08-2020 11:27 AM
Posts	186
Kudos received	33

Cloudera Community

Re: Hive / HBase migration - Different clusters

Re: Flowfiles are stuck in que/connection of Nifi

Re: Save dataframe with header in spark 1.6

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: Hive Dense_Rank Query to read 30 million recor...

Re: Get dynamic property name in execute script

Re: How do I get rid of failed to fetch table info...

Re: ReplaceText doesn't work properly

Re: Regarding the text file compression

Re: ReplaceText doesn't work properly

Re: Save dataframe with header in spark 1.6

Re: how to read schema of csv file and according t...

Re: how to read schema of csv file and according t...

Re: Spark SQL Error connecting Hive Metastore