About RahulSoni

RahulSoni · ‎03-27-2018

@Vincent van Oudenhoven Does that help?

RahulSoni · ‎03-27-2018

@TAMILMARAN c When you say read and write in parallel, do you mean reading a data which is In Progress to be written on to HDFS?

RahulSoni · ‎03-27-2018

@Christian Lunesa Refer to this link for the issue that you are facing. If this is the exact issue, you may need to update the RM config. Let know if that helps!

RahulSoni · ‎03-27-2018

@Sami Ahmad You are creating the table wrong! There are two types of files when we talk about avro. Avro files - which have the data avsc files - avro schema files When you do a sqoop import, you can see your avsc files on your machine somewhere, probably the outdir, if you mentioned one. If you are not able to spot the avsc files, follows the steps to extract the avsc files from avro data and then create table using those avsc files. //Take a few lines from your avro file hdfs dfs -cat <your avro file name>| head --bytes 10K> SAMPLE_FILE //Extract the avro schema from your avro data file java -jar $AVRO_TOOLS_PATH/avro-tools-1.7.7.jar getschema SAMPLE_FILE > AVRO_SCHEMA_FILE //Upload the schema to hdfs hdfs dfs -put AVRO_SCHEMA_FILE $AVRO_SCHEMA_DIR //Create the hive table using avro schema CREATE EXTERNAL TABLE sampe_table STORED AS AVRO LOCATION 'hdfs:///user/hive/' TBLPROPERTIES ('avro.schema.url'='<your avro schema path here>'); Refer this AvroSerDe documentation for more details. PS - If you already have the avro schema files, you can skip all the schema creation and steps and simply use the last step to create your table.

RahulSoni · ‎03-27-2018

@Sami Ahmad The syntax is this sqoop job (generic-args) (job-args) So try changing your sqoop job to something like sqoop job -Dmapreduce.job.user.classpath.first=true --create incjob4 -- import <Everything else> Let know if that works!

RahulSoni · ‎03-27-2018

@Sami Ahmad Add the following property to your sqoop job. -Dmapreduce.job.user.classpath.first=true Try this and let me know if that works for you.

RahulSoni · ‎03-27-2018

@Chen Yimu The issue is with your data type at the source side. Let's talk about the biggest "integral" data types in Avro and Hive. As per Hive documentation, BIGINT is defined as BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 ) As per Avro documentation, the Long datatype is defined as follows. long: 64-bit signed integer These are signed numbers and the max range is 9,223,372,036,854,775,807 [19 digits]. You are trying to port essentially a number from the source database which is way beyond Hive's/Avro data type range [number in MySQL looks like BIGINT(20) ] Why are you not able to access the column value using INT/BIGINT data type? The data is stored on HDFS in Avro format with your column specified as "STRING". Your data type does not match the data type of the data and hence you will get a parse exception. A switch from INT to STRING works for the same very reason. Solution - You should either tone down the size of the column at the source side, even BIGINT(19) is not recommended since the values may go beyond the range of Hive BIGINT/AVRO Long. Or else, have your table on the destination side(Hive) with the column as STRING. PS - Avro DOES NOT have Date/Timestamp datatypes, so such columns are also converted to a string when imported. Hope that helps!

RahulSoni · ‎03-27-2018

@Vincent van Oudenhoven Here is a very elementary flow to depict it using ExecuteStreamCommand processor. The flow looks like In the GenerateFlowFile processor, I am generating a flow file with sample text "foobar" In the ExecuteStreamCommand, I am referring to my python code as The sample.py looks like as silly as And now the content of the flow file looks like However, if you want to access the content of the existing flow file, I guess the only way you can do it is by converting the content to attribute and this can have consequences since attributes are kept in memory and a very large value for an attribute or a lot of attributes can adversely affect the performance. Let know if that helps!

RahulSoni · ‎03-27-2018

@Christian Lunesa If the answer helped solving your query, please mark the answer as accepted 🙂

RahulSoni · ‎03-26-2018

@Vinitkumar Pandey --driver-class-path is used to mention "extra" jars to add to the "driver" of the spark job --driver-library-path is used to "change" the default library path for the jars needed for the spark driver --driver-class-path will only push the jars to the driver machine. If you want to send the jars to "executors", you need to use --jar Hope that helps!

Online	Offline
Last Visited	‎10-08-2020 11:27 AM

Member Since	‎08-03-2019 10:44 AM
Last Visited	‎10-08-2020 11:27 AM
Posts	186
Kudos received	33

Cloudera Community

Re: Hive / HBase migration - Different clusters

Re: Flowfiles are stuck in que/connection of Nifi

Re: Save dataframe with header in spark 1.6

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: Can anyone provide an example of a python scri...

Re: how read and write is done in parallel manner ...

Re: How do I get rid of failed to fetch table info...

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: Nifi Mysql to hive data format not match

Re: Can anyone provide an example of a python scri...

Re: How do I split columns on table that has varch...

Re: Spark-submit Options --jar, --spark-driver-cla...