Member since
08-03-2019
186
Posts
34
Kudos Received
26
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1978 | 04-25-2018 08:37 PM | |
5907 | 04-01-2018 09:37 PM | |
1615 | 03-29-2018 05:15 PM | |
6793 | 03-27-2018 07:22 PM | |
2032 | 03-27-2018 06:14 PM |
03-27-2018
09:27 PM
@Vincent van Oudenhoven Does that help?
... View more
03-27-2018
09:18 PM
@TAMILMARAN c When you say read and write in parallel, do you mean reading a data which is In Progress to be written on to HDFS?
... View more
03-27-2018
09:16 PM
2 Kudos
@Christian Lunesa Refer to this link for the issue that you are facing. If this is the exact issue, you may need to update the RM config. Let know if that helps!
... View more
03-27-2018
07:22 PM
1 Kudo
@Sami Ahmad You are creating the table wrong! There are two types of files when we talk about avro.
Avro files - which have the data avsc files - avro schema files When you do a sqoop import, you can see your avsc files on your machine somewhere, probably the outdir, if you mentioned one. If you are not able to spot the avsc files, follows the steps to extract the avsc files from avro data and then create table using those avsc files. //Take a few lines from your avro file
hdfs dfs -cat <your avro file name>| head --bytes 10K> SAMPLE_FILE
//Extract the avro schema from your avro data file
java -jar $AVRO_TOOLS_PATH/avro-tools-1.7.7.jar getschema SAMPLE_FILE > AVRO_SCHEMA_FILE
//Upload the schema to hdfs
hdfs dfs -put AVRO_SCHEMA_FILE $AVRO_SCHEMA_DIR
//Create the hive table using avro schema
CREATE EXTERNAL TABLE sampe_table
STORED AS AVRO
LOCATION 'hdfs:///user/hive/'
TBLPROPERTIES ('avro.schema.url'='<your avro schema path here>'); Refer this AvroSerDe documentation for more details. PS - If you already have the avro schema files, you can skip all the schema creation and steps and simply use the last step to create your table.
... View more
03-27-2018
07:03 PM
@Sami Ahmad The syntax is this sqoop job (generic-args) (job-args) So try changing your sqoop job to something like sqoop job -Dmapreduce.job.user.classpath.first=true --create incjob4 -- import <Everything else> Let know if that works!
... View more
03-27-2018
06:14 PM
@Sami Ahmad Add the following property to your sqoop job. -Dmapreduce.job.user.classpath.first=true Try this and let me know if that works for you.
... View more
03-27-2018
03:03 PM
2 Kudos
@Chen Yimu The issue is with your data type at the source side. Let's talk about the biggest "integral" data types in Avro and Hive. As per Hive documentation, BIGINT is defined as
BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 ) As per Avro documentation, the Long datatype is defined as follows.
long: 64-bit signed integer These are signed numbers and the max range is 9,223,372,036,854,775,807 [19 digits]. You are trying to port essentially a number from the source database which is way beyond Hive's/Avro data type range [number in MySQL looks like BIGINT(20) ] Why are you not able to access the column value using INT/BIGINT data type? The data is stored on HDFS in Avro format with your column specified as "STRING". Your data type does not match the data type of the data and hence you will get a parse exception. A switch from INT to STRING works for the same very reason. Solution - You should either tone down the size of the column at the source side, even BIGINT(19) is not recommended since the values may go beyond the range of Hive BIGINT/AVRO Long. Or else, have your table on the destination side(Hive) with the column as STRING. PS - Avro DOES NOT have Date/Timestamp datatypes, so such columns are also converted to a string when imported. Hope that helps!
... View more
03-27-2018
06:00 AM
1 Kudo
@Vincent van Oudenhoven Here is a very elementary flow to depict it using ExecuteStreamCommand processor. The flow looks like In the GenerateFlowFile processor, I am generating a flow file with sample text "foobar" In the ExecuteStreamCommand, I am referring to my python code as The sample.py looks like as silly as And now the content of the flow file looks like However, if you want to access the content of the existing flow file, I guess the only way you can do it is by converting the content to attribute and this can have consequences since attributes are kept in memory and a very large value for an attribute or a lot of attributes can adversely affect the performance. Let know if that helps!
... View more
03-27-2018
12:32 AM
@Christian Lunesa If the answer helped solving your query, please mark the answer as accepted 🙂
... View more
03-26-2018
11:40 PM
@Vinitkumar Pandey --driver-class-path is used to mention "extra" jars to add to the "driver" of the spark job --driver-library-path is used to "change" the default library path for the jars needed for the spark driver --driver-class-path will only push the jars to the driver machine. If you want to send the jars to "executors", you need to use --jar Hope that helps!
... View more