Support Questions

vhp1360 · ‎01-23-2022

Hello everyone, first of all I apologise for my English.
I'm facing a big problem between IBM DataStage and HortonWorks Let me first explain IBM DataStage: It's an ETL tool that's some connection types for importing/exporting data from a number of DataSource types.
I'm trying to load data from IBM DataStage 11.7 into Hive using the Hive connector, but I'm encountering some strange behavior:

There are a couple of configurations for the Hive connector, the most important of which is - as I suspected - :
. Record count=2000
. Batch size=2000

for a dataset with 8 columns and almost 1000 rows, data inserted into Hive.

For a dataset with 200 columns and 20 million rows, it behaves strangely:

for 10 columns, works.

For more than 10 columns, the multiplication of the stack size propertiesfails - I mean for 2000, 4000 or 20000 rows - with 'IIS-CONN-DAAPI -00099 Hive_Connector_7,0: java.lang.StringIndexOutOfBoundsException: String index out of bounds: 0 at java.lang.String.substring (String.java: 2667)'

I'm sure this error isn't related to String because with 'Batch size=2000' the job loads almost 2000 rows into the hive table and if I increase the value to 4000 it loads almost 4000 records into the hive table.

Does anyone know the reason for this error?

Thanks a lot

ChethanYM · ‎03-04-2024

@vhp1360

Given the behavior you've observed with different batch sizes and column counts, it's possible that there is a memory or resource constraint causing the error when dealing with a large number of columns and rows. Here are some potential causes and troubleshooting steps to consider:

Memory Constraints: Loading a dataset with 200 columns and 20 million rows can require a significant amount of memory, especially if each column contains large amounts of data. Ensure that the system running IBM DataStage has sufficient memory allocated to handle the processing requirements.
Configuration Limits: Check if there are any configuration limits or restrictions in the IBM DataStage or Hive connector settings that might be causing the issue. For example, there could be a maximum allowed stack size or buffer size that is being exceeded when processing large datasets.
Resource Utilization: Monitor the resource utilization (CPU, memory, disk I/O) on the system running IBM DataStage during the data loading process. High resource utilization or contention could indicate a bottleneck that is causing the error.
Optimization Techniques: Consider optimizing the data loading process by adjusting parameters such as batch size, record count, or buffer size. Experiment with different configurations to find the optimal settings that can handle the larger dataset without encountering errors.
Data Format Issues: Verify that the data format and schema of the dataset are consistent and compatible with the Hive table schema. Data inconsistencies or mismatches could potentially cause errors during the loading process.

Regards,

Chethan YM

Cloudera Community

Support Questions

insert data into Hive from IBM DataStage

Data not inserting in hive table (CDP)

IBM DataStage on Hadoop in Azure

HBase stores base64 data when data is inserted fro...

Intro to IBM DSX and Data Access Basics

Visualizing Hive Data Using Microsoft Power BI

How to insert data into Hive from SparkSQL

Job hang when Insert data into table in Spark Thri...

Spark2 save insert data to Hive with snappy compre...

Tackling Data Science challenges with Hortonworks ...

Fetch objects from an IBM Cloud S3 bucket using Ap...