Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3433 | 10-18-2017 10:19 PM | |
3823 | 10-18-2017 09:51 PM | |
13619 | 09-21-2017 01:35 PM | |
1446 | 08-04-2017 02:00 PM | |
1982 | 07-31-2017 03:02 PM |
08-21-2019
10:39 AM
@ankurkapoor_wor Hi, Even I am facing the same issue as @mqureshi. I am trying to fetch data from SQL server in Avro fomat through NiFi and load it to Redshift through copy command. But the generated Avro file is converting the date and timestamp datatypes to string because of which copy command is loading all NULL values in the target table. So I tried to follow your approach, In my case I'm using ExecuteSQLRecord processor to fetch the data from SQL server and writing it to json format and then trying to convert it to Avro format using ConvertJsonToAvro processor but then I am unable to parse Record Schema. Could you please help me also to resolve this issue. Thanks in advance! Anusha
... View more
10-23-2017
07:41 AM
1 Kudo
@mqureshi Thanks for your inputs 🙂 I would like to add that maybe instead of long value, the field is receiving NULL value. Explaining my problem in detail below: Stated below is an outline of the code I'm using: insert overwrite table db.test_tbl
named_struct('end_customers_d_party_id',A.end_customers_d_party_id,'dv_cr_party_id',B.dv_cr_party_id,'original_sales_order',A.original_sales_order) as key,
case when A.manager_name is not NULL OR A.manager_name <> '' OR length(A.manager_name) > 0 then A.manager_name else '' end as manager_name,
case when G.cec_id is not NULL OR G.cec_id <> '' OR length(G.cec_id) > 0 then G.cec_id else '' end as cec_id,
case when G.primary_name is not NULL OR G.primary_name <> '' OR length(G.primary_name) > 0 then G.primary_name else '' end as primary_name,
case when E.cse_id is not NULL OR E.cse_id <> '' OR length(E.cse_id) > 0 then E.cse_id else '' end as cse_id,
case when C.companyname is not NULL OR C.companyname <> '' OR length(C.companyname) > 0 then C.companyname else '' end as companyname,
case when A.product_id is not NULL OR A.product_id <> '' OR length(A.product_id) > 0 then A.product_id else '' end as product_id
from db.amp_provision C
INNER JOIN db.table1 A
ON TRIM(C.guid) = TRIM(A.guid)
INNER JOIN db.table2 D
ON TRIM(C.guid) = TRIM(D.guid)
INNER JOIN db.table3 AUL
ON TRIM(C.guid) = TRIM(AUL.guid)
JOIN db.table4 B
ON TRIM (A.original_sales_order) = B.sales_order_num
AND B.offer_code= 'X'
INNER JOIN db.table5 E
ON TRIM (C.guid) = TRIM(E.offer_reference_id)
INNER JOIN db.table6 F
ON B.dv_cr_party_id = F.cr_party_id
AND E.cse_id = F.cs_ent_cust_id
AND E.offer_name = 'X' The issue that happened that column cse_id came as null for one of the persistent customers, because that customer was getting dropped based on the last join E.cse_id = F.cs_ent_cust_id and was not at all present in table5. (The same value is present in all other tables from 1 to 4) My question now is how can overcome this. I want to persist some customers based on their cse_id; irrespective of its presence in table5 which has high chances of dropping few customers every time its refreshed. Using a LEFT JOIN with table5 causes VERTEX FAILURE in hive run. And the error is posted above. Kindly help with a sturdy solution out of this. I'm happy to explain the above issue in more detail, if required. THANKS ALL !!! 🙂 Swati
... View more
09-27-2017
05:40 AM
HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.
... View more
09-26-2017
06:58 PM
Hi @sally sally, if you are extracting only one value to attribute then its easy to use ExtractText processor:- by adding new property to it by adding regex like below. <count>(.*)<\/count> ExtractText Processor configs:- This regex only captures the value in <count></count> message and adds an attribute count to the flowfile.
... View more
09-21-2017
01:35 PM
2 Kudos
@Riddhi Sam
First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.
... View more
09-19-2017
02:09 PM
@sally sally By setting your minimums (Min Num Entries and Min Group Size to some large value), FlowFiles that are added to a bin will not qualify for merging right away. You should then set "Max Bin Age" to a unit of time you are willing to allow a bin to hang around before it is merged regardless of the number of entries in that bin or that bins size. As far as the number of bins go, a new bin will be created for each unique filename found in the incoming queue. Should the MergeContent processor encounter more unique filenames then there are bins, the MergeContent processor will force merging of the oldest bin to free a bin for the new filename. So it is important to have enough bins to accommodate the number of unique filenames you expect to pass through this processor during the configured "max bin age" duration; otherwise, you could still end up with 1 FlowFile per merge. Thanks, Matt
... View more
09-19-2017
04:27 AM
@Vijay Parmar Are you using hive on spark? These libraries are under hive and if you are not using Hive on spark then your other applications should not be affected. Regardless, I am not asking you to delete. Just move to resolve this issue and then you can restore in the unlikely event of anything else getting impacted.
... View more
08-18-2017
02:26 PM
2 Kudos
I did created a truststore for queue manager view. But I believe although the truststore is located on Ambari server, by importing Ambari HTTPS cert to the store it is actually used by Ambari views to connect to Ambari HTTPS server. It is not really for other client like SAM.
... View more
12-27-2017
02:48 PM
I deleted all the snapshots and data after getting a go-ahead from the developers...
... View more
08-08-2017
11:55 AM
@mqureshi , Thanks for answer. It makes sense. Regards, Fahim
... View more