About mqureshi

bhide · ‎08-21-2019

@ankurkapoor_wor Hi, Even I am facing the same issue as @mqureshi. I am trying to fetch data from SQL server in Avro fomat through NiFi and load it to Redshift through copy command. But the generated Avro file is converting the date and timestamp datatypes to string because of which copy command is loading all NULL values in the target table. So I tried to follow your approach, In my case I'm using ExecuteSQLRecord processor to fetch the data from SQL server and writing it to json format and then trying to convert it to Avro format using ConvertJsonToAvro processor but then I am unable to parse Record Schema. Could you please help me also to resolve this issue. Thanks in advance! Anusha

swatisnh8 · ‎10-23-2017

@mqureshi Thanks for your inputs 🙂 I would like to add that maybe instead of long value, the field is receiving NULL value. Explaining my problem in detail below: Stated below is an outline of the code I'm using: insert overwrite table db.test_tbl named_struct('end_customers_d_party_id',A.end_customers_d_party_id,'dv_cr_party_id',B.dv_cr_party_id,'original_sales_order',A.original_sales_order) as key, case when A.manager_name is not NULL OR A.manager_name <> '' OR length(A.manager_name) > 0 then A.manager_name else '' end as manager_name, case when G.cec_id is not NULL OR G.cec_id <> '' OR length(G.cec_id) > 0 then G.cec_id else '' end as cec_id, case when G.primary_name is not NULL OR G.primary_name <> '' OR length(G.primary_name) > 0 then G.primary_name else '' end as primary_name, case when E.cse_id is not NULL OR E.cse_id <> '' OR length(E.cse_id) > 0 then E.cse_id else '' end as cse_id, case when C.companyname is not NULL OR C.companyname <> '' OR length(C.companyname) > 0 then C.companyname else '' end as companyname, case when A.product_id is not NULL OR A.product_id <> '' OR length(A.product_id) > 0 then A.product_id else '' end as product_id from db.amp_provision C INNER JOIN db.table1 A ON TRIM(C.guid) = TRIM(A.guid) INNER JOIN db.table2 D ON TRIM(C.guid) = TRIM(D.guid) INNER JOIN db.table3 AUL ON TRIM(C.guid) = TRIM(AUL.guid) JOIN db.table4 B ON TRIM (A.original_sales_order) = B.sales_order_num AND B.offer_code= 'X' INNER JOIN db.table5 E ON TRIM (C.guid) = TRIM(E.offer_reference_id) INNER JOIN db.table6 F ON B.dv_cr_party_id = F.cr_party_id AND E.cse_id = F.cs_ent_cust_id AND E.offer_name = 'X' The issue that happened that column cse_id came as null for one of the persistent customers, because that customer was getting dropped based on the last join E.cse_id = F.cs_ent_cust_id and was not at all present in table5. (The same value is present in all other tables from 1 to 4) My question now is how can overcome this. I want to persist some customers based on their cse_id; irrespective of its presence in table5 which has high chances of dropping few customers every time its refreshed. Using a LEFT JOIN with table5 causes VERTEX FAILURE in hive run. And the error is posted above. Kindly help with a sturdy solution out of this. I'm happy to explain the above issue in more detail, if required. THANKS ALL !!! 🙂 Swati

shreyag1207 · ‎09-27-2017

HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.

Shu_ashu · ‎09-26-2017

Hi @sally sally, if you are extracting only one value to attribute then its easy to use ExtractText processor:- by adding new property to it by adding regex like below. <count>(.*)<\/count> ExtractText Processor configs:- This regex only captures the value in <count></count> message and adds an attribute count to the flowfile.

mqureshi · ‎09-21-2017

@Riddhi Sam First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.

MattWho · ‎09-19-2017

@sally sally By setting your minimums (Min Num Entries and Min Group Size to some large value), FlowFiles that are added to a bin will not qualify for merging right away. You should then set "Max Bin Age" to a unit of time you are willing to allow a bin to hang around before it is merged regardless of the number of entries in that bin or that bins size. As far as the number of bins go, a new bin will be created for each unique filename found in the incoming queue. Should the MergeContent processor encounter more unique filenames then there are bins, the MergeContent processor will force merging of the oldest bin to free a bin for the new filename. So it is important to have enough bins to accommodate the number of unique filenames you expect to pass through this processor during the configured "max bin age" duration; otherwise, you could still end up with 1 FlowFile per merge. Thanks, Matt

mqureshi · ‎09-19-2017

@Vijay Parmar Are you using hive on spark? These libraries are under hive and if you are not using Hive on spark then your other applications should not be affected. Regardless, I am not asking you to delete. Just move to resolve this issue and then you can restore in the unlikely event of anything else getting impacted.

qiwang · ‎08-18-2017

I did created a truststore for queue manager view. But I believe although the truststore is located on Ambari server, by importing Ambari HTTPS cert to the store it is actually used by Ambari views to connect to Ambari HTTPS server. It is not really for other client like SAM.

kbrodie · ‎12-27-2017

I deleted all the snapshots and data after getting a go-ahead from the developers...

MFP · ‎08-08-2017

@mqureshi , Thanks for answer. It makes sense. Regards, Fahim

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Convert Json to Avro processor -- Failed to Pa...

Re: How to resolve for NULL values when they are c...

Re: Should we use RAID with Hadoop?

Re: how to get xml node value in nifi processor...

Re: Why is spark has better speed than Hadoop

Re: How to merge flowfiles in nifi?

Re: Hive Error: SLF4J: Class path contains multipl...

Re: How to create SAM service pool for secured Amb...

Re: HBASE "archive". How to clean? My disk spa...

Re: Is it possible to assign Hadoop queues to Hado...