Support Questions

Find answers, ask questions, and share your expertise

Unable to upload related entities separately

avatar
New Contributor

Hi Team,

I am working with Apache Atlas using PyApacheAtles to work on Azure Purview. I already created custom types (table and column) and their relationship definition. In my system there are about 30k entities to upload, so when I try to push all of them in one batch I receive timeout.

 

I tried to apply the logic of upload from Atlas Jira  https://issues.apache.org/jira/browse/ATLAS-4389

Firstly upload all parents (tables in my case), then columns (related with tables). After successful upload of tables batch, I received an error, when columns batch upload started

"errorCode":"ATLAS-404-00-00A","errorMessage":"Referenced entity -1001 is not found"

 -1001 is a guid of the table, which already is uploaded. I noticed that in case of upload table and column in one batch everything works fine.

 

It looks like Atlas checks if relationship exists in uploaded batch, not between batch and already uploaded entities.  Is there any way to upload related entities in separate batches or should them be uploaded only in one batch? Do you have another strategy to avoid timeouts during bulk upload?

1 REPLY 1

avatar
Expert Contributor

Hello @nowy19 

Thanks for the posting your query and here is my detailed ans of your query

It seems that Atlas checks if the relationship exists within the uploaded batch, rather than between the batch and already uploaded entities. There are a couple of approaches you could consider to avoid timeouts during bulk uploads:

Uploading Related Entities in Separate Batches: It’s possible to upload related entities in separate batches. However, you need to ensure that dependencies are respected between batches. If relationships between entities need to be established, you may have to upload them in an order that ensures the relationships can be checked and linked after the entities are uploaded.

Batch Size Management: If timeouts are an issue, you might want to consider reducing the batch size for uploads. Smaller batches can reduce the load on the system and help avoid timeouts. This might involve splitting larger datasets into smaller, more manageable chunks.

Optimize Atlas Configuration: Adjusting some configurations in Atlas, such as increasing the batch size limit or optimizing the database (e.g., using indexing) might help to handle larger uploads efficiently.

Asynchronous Upload Strategy: If possible, you can consider uploading entities asynchronously to prevent long-running operations that can lead to timeouts. This allows the system to handle multiple requests in parallel without overwhelming it.

Increase Timeout Settings: If you're encountering timeouts during bulk uploads, you could also look into adjusting timeout settings for the upload process, either at the Atlas server or API level, if that's a feasible option.

If you want to upload everything in one batch but avoid timeouts, breaking down the process into smaller, logical steps, while maintaining the required relationships, is usually the most effective approach.