Member since
02-04-2023
5
Posts
0
Kudos Received
0
Solutions
06-15-2023
04:26 AM
Thank you for your advise. We will investigate proposed solution with spark-xml Best regards
... View more
06-13-2023
06:36 AM
Hi, Thank you for response. Documentation and samples is available to download (XSD, XML sample, documentation in XLS): https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd / DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure. Have you ever struggled with such complex and huge XMLs?
... View more
05-26-2023
02:47 AM
Hi, We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure. As the output we want to have tables in Hive or Impala and to use SQL to query this tables. Could you please advise how to that in the most effective way? Effective - that is reducing manual coding works. Best regards
... View more
04-26-2023
12:45 PM
Hello, I would like to refresh this topic. Do you have if is it possible to build efficient documents repository on HDFS? I am concerned if many small files stored end retrived from HDFS will be effective solution? Best regards Tomek
... View more
02-04-2023
02:39 AM
Hello, We are going to build documents repository (Word, PDF, Excel, pptx, ...). Is it a good idea to use HDFS + Solr for such repository? Key requirements are: 1. Store documents with some metadata about documents 2. Full text search of documents 3. Search documents based on metadata about documents 4. Retrive documents from repository 5. In the future we are going to do Natural Language Processing on Word/PDF documents. Maybe we should better use any other technologies from Hadoop ecosystem like: Ozon or any database like Hbase? Let's assume that we use CDP Private Cloud. Best regards Tomek
... View more
Labels:
- Labels:
-
Apache HBase
-
HDFS