Member since
07-05-2016
17
Posts
1
Kudos Received
0
Solutions
08-30-2017
04:04 PM
Also from your log and post hadoopctrl is namenode, resourcemanager, oozie. Is it data node and node manager also? It may be in a bottle neck with memory. Oozie trying to use the memory but yarn can not allocate memory or write the data. Potentially try moving your oozie serve to another node or reduce or redistribute to memory allocation, oozie usually doesn't need too much. This will probably explain the heart beat issue.
... View more
08-29-2017
09:07 PM
By doing hadoop fs -chmod -R 777 on your hive table, we can probably eliminate permission issues. This is a great puzzle. It should have been raised in the logs, but anything strange about your data? Nulls, NAs, Empty? strange date formats, decimals, special characters? Anything in @Artem Ervits post that helped?
... View more
08-28-2017
08:15 PM
From the log it seems that your sqoop job gets stuck with heart beat, heart beat... loop. This is a common result/problem if something has gone wrong. Do search 'oozie sqoop import heart beat'. But I believe it is potentially a permissions issue, as it has got through 95%. I suspect that when you run the sqoop job manually you run as 'hdfs' user. Can you confirm this? USER="hdfs" and realUser=oozie Is mentioned in the logs. I suspect the 'oozie' user does not have permission to overwrite the table. Check to permission of the table. Maybe change permission or ownership for diagnosis, and try again.
... View more
03-23-2017
08:54 AM
So annoyingly, the nvarchar/numeric issue was resolved and now I receive a generic error message: 31728 [main] ERROR org.apache.sqoop.mapreduce.ExportJobBase - Export job failed!
31728 [main] ERROR org.apache.sqoop.mapreduce.ExportJobBase - Export job failed!
31728 [main] ERROR org.apache.sqoop.tool.ExportTool - Error during export: Export job failed!
31728 [main] ERROR org.apache.sqoop.tool.ExportTool - Error during export: Export job failed!
... View more
03-22-2017
11:30 AM
Thanks, the log gave me: .. 2017-03-13 14:07:47,804 ERROR [Thread-12] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread: com.microsoft.sqlserver.jdbc.SQLServerException: Error converting data type nvarchar to numeric. ... Will investigate what column is coursing this issue and will try to resolve.
... View more
03-21-2017
11:24 AM
In the release note: https://sqoop.apache.org/docs/1.4.6/sqoop-1.4.6.releasenotes.html It states it is supported: New features:
SQOOP-1403 Upsert export for SQL Server
... View more
03-20-2017
02:45 PM
Thanks Mark. I have looked into your suggestions. Which has lead me to LZO Compression; http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ I think this may be something I try next. Do you have any suggestions with this? Doesn't HDP already comes with LZO? The link is a good few years old. should I try something else before I spend a few hows with this? My company is not keen on me spending a few hours writing Java sequenceFile jar.
... View more
03-13-2017
02:27 PM
I have a hive table that I can successfully export to a mssql table: sqoop export --connect jdbc:sqlserver://{some.ip.address};database={somedatabase} / --username 'someuser' /
--password-file '/some/password/file' /
--table 'Sometable' /
--columns ID,value1,value2 /
--export-dir /apps/hive/warehouse/some.db/Sometable /
--input-fields-terminated-by "||" / -m 2 /
/user/oozie/share/lib/sqoop/sqljdbc4.jar
However, i wish to update on a key and run: sqoop export --connect jdbc:sqlserver://{some.ip.address};database={somedatabase} /
--username 'someuser' /
--password-file '/some/password/file' /
--table 'Sometable' /
--columns ID,value1,value2 /
--export-dir /apps/hive/warehouse/some.db/Sometable /
--input-fields-terminated-by "||" /
--update-key ID /
--update-mode allowinsert /
-m 2 /
/user/oozie/share/lib/sqoop/sqljdbc4.jar
The logs are very unhelpful, (note: the sqoop is run through an oozie job): ...
5972 [main] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1485423751090_3566
6016 [main] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://had003.headquarters.7layer.net:8088/proxy/application_1485423751090_3566/
6016 [main] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://had003.headquarters.7layer.net:8088/proxy/application_1485423751090_3566/
6017 [main] INFO org.apache.hadoop.mapreduce.Job - Running job: job_1485423751090_3566
6017 [main] INFO org.apache.hadoop.mapreduce.Job - Running job: job_1485423751090_3566
20284 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1485423751090_3566 running in uber mode : false
20284 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1485423751090_3566 running in uber mode : false
20287 [main] INFO org.apache.hadoop.mapreduce.Job - map 0% reduce 0%
20287 [main] INFO org.apache.hadoop.mapreduce.Job - map 0% reduce 0%
27001 [main] INFO org.apache.hadoop.mapreduce.Job - map 50% reduce 0%
27001 [main] INFO org.apache.hadoop.mapreduce.Job - map 50% reduce 0%
Heart beat
37117 [main] INFO org.apache.hadoop.mapreduce.Job - map 100% reduce 0%
37117 [main] INFO org.apache.hadoop.mapreduce.Job - map 100% reduce 0%
38139 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1485423751090_3566 failed with state FAILED due to: Task failed task_1485423751090_3566_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
38139 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1485423751090_3566 failed with state FAILED due to: Task failed task_1485423751090_3566_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
38292 [main] INFO org.apache.hadoop.mapreduce.Job - Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=338177
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=166
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=1
Launched map tasks=2
Other local map tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=16369
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=16369
Total vcore-milliseconds taken by all map tasks=16369
Total megabyte-milliseconds taken by all map tasks=25142784
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=156
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=79
CPU time spent (ms)=960
Physical memory (bytes) snapshot=230920192
Virtual memory (bytes) snapshot=3235606528
Total committed heap usage (bytes)=162529280
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
38292 [main] INFO org.apache.hadoop.mapreduce.Job - Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=338177
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=166
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=1
Launched map tasks=2
Other local map tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=16369
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=16369
Total vcore-milliseconds taken by all map tasks=16369
Total megabyte-milliseconds taken by all map tasks=25142784
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=156
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=79
CPU time spent (ms)=960
Physical memory (bytes) snapshot=230920192
Virtual memory (bytes) snapshot=3235606528
Total committed heap usage (bytes)=162529280
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
38319 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Transferred 166 bytes in 34.0574 seconds (4.8741 bytes/sec)
38319 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Transferred 166 bytes in 34.0574 seconds (4.8741 bytes/sec)
38332 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Exported 0 records.
38332 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Exported 0 records.
38332 [main] ERROR org.apache.sqoop.mapreduce.ExportJobBase - Export job failed!
38332 [main] ERROR org.apache.sqoop.mapreduce.ExportJobBase - Export job failed!
38333 [main] ERROR org.apache.sqoop.tool.ExportTool - Error during export: Export job failed!
38333 [main] ERROR org.apache.sqoop.tool.ExportTool - Error during export: Export job failed!
<<< Invocation of Sqoop command completed <<<
Hadoop Job IDs executed by Sqoop: job_1485423751090_3566
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://{Something}
38406 [main] INFO org.apache.hadoop.io.compress.zlib.ZlibFactory - Successfully loaded & initialized native-zlib library
38407 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new compressor [.deflate]
Oozie Launcher ends
38538 [main] INFO org.apache.hadoop.mapred.Task - Task:attempt_1485423751090_3565_m_000000_0 is done. And is in the process of committing
38538 [main] INFO org.apache.hadoop.mapred.Task - Task:attempt_1485423751090_3565_m_000000_0 is done. And is in the process of committing
38601 [main] INFO org.apache.hadoop.mapred.Task - Task attempt_1485423751090_3565_m_000000_0 is allowed to commit now
38601 [main] INFO org.apache.hadoop.mapred.Task - Task attempt_1485423751090_3565_m_000000_0 is allowed to commit now
38641 [main] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_1485423751090_3565_m_000000_0' to hdfs://{Something}
38692 [main] INFO org.apache.hadoop.mapred.Task - Task 'attempt_1485423751090_3565_m_000000_0' done.
38692 [main] INFO org.apache.hadoop.mapred.Task - Task 'attempt_1485423751090_3565_m_000000_0' done.
Does anyone have an idea why I cannot update with inserts to mssql?
... View more
Labels:
- Labels:
-
Apache Sqoop
01-25-2017
11:12 AM
So I have changed the way I tar.gz the files. At first I tried to create files of the size of 128mb (about 4 files), then 64mb (about 8-10 files), and then 1mb (100+). Obviously, this alters the amount of tasks that run. The task run faster the smaller the file, except one! One task always takes ~50mins. Why does this happen? How do I speed up this task?
... View more
01-18-2017
10:19 AM
Please see my post below
... View more
01-18-2017
10:19 AM
Please see my post below
... View more
01-18-2017
10:04 AM
Thanks for a quick reply.
I am using a mixed environment (for dev): 64GB, 3*32GB and 3*16GB memory CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 All with os: centos7. The reason why we tar.gz the files is because we receive may small xml files, 25,000. Loading these files into hadoop will take over 4 hours. tar.gz reduces the load time to around 10mins as well as reducing the size from 14GB to 0.4GB. I have tried removing the tar.gz, the speed becomes 1h45. This is likely to be the result of many small files. To add, the pig parser maybe faster because the XML structure is hardcoded. This wants to be avoided because we have experienced machines changing the way the xml is produced so the Spark parsing is more robust. Ideally, we would like to use the more robust spark parser but have the load into hadoop at around 10min and the processing time at around 10mins.
Any ideas?
One idea is to tar.gz into multiple files, i.e. 25,000 into 10 files. the load time would be ~10mins, processing time somewhere in between 10mins and 50mins. Does anyone have: Any better idea? Reasons why this may not be a good idea? Issues I may come across?
... View more
01-16-2017
10:03 AM
Hello All, I require to import and parse xml files in Hadoop. I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins. In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts:
http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html
http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html
The version used is;
http://github.com/databricks/spark-xml/tree/branch-0.3 The script I try to run is similar to: import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
import org.apache.hadoop.fs._
import com.databricks.spark
import com.databricks.spark.xml
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
// drop table
val dfremove = hiveContext.sql("DROP TABLE FileExtract")
// Create schema
val xmlSchema = StructType(Array(
StructField("Text1", StringType, nullable = false),
StructField("Text2", StringType, nullable = false),
StructField("Text3", StringType, nullable = false),
StructField("Text4", StringType ,nullable = false),
StructField("Text5", StringType, nullable = false),
StructField("Num1", IntegerType, nullable = false),
StructField("Num2", IntegerType, nullable = false),
StructField("Num3", IntegerType, nullable = false),
StructField("Num4", IntegerType, nullable = false),
StructField("Num5", IntegerType, nullable = false),
StructField("Num6", IntegerType, nullable = false),
StructField("AnotherText1", StringType, nullable = false),
StructField("Num7", IntegerType, nullable = false),
StructField("Num8", IntegerType, nullable = false),
StructField("Num9", IntegerType, nullable = false),
StructField("AnotherText2", StringType, nullable = false)
))
// Read file
val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz")
// select
val selectedData = df.select("Text1",
"Text2",
"Text3",
"Text4",
"Text5",
"Num1",
"Num2",
"Num3",
"Num4",
"Num5",
"Num6",
"AnotherText1",
"Num7",
"Num8",
"Num9",
"AnotherText2"
)
selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract")
The xml file looks similar to: <?xml version="1.0"?>
<File>
<row>
<Text1>something here</Text1>
<Text2>something here</Text2>
<Text3>something here</Text3>
<Text4>something here</Text4>
<Text5>something here</Text5>
<Num1>2</Num1>
<Num2>1</Num2>
<Num3>1</Num3>
<Num4>0</Num4>
<Num5>1</Num5>
<Num6>0</Num6>
<AnotherText1>something here</AnotherText1>
<Num7>2</Num7>
<Num8>0</Num8>
<Num9>0</Num9>
<AnotherText2>something here</AnotherText2>
</row>
<row>
<Text1>something here</Text1>
<Text2>something else here</Text2>
<Text3>something new here</Text3>
<Text4>something here</Text4>
<Text5>something here</Text5>
<Num1>2</Num1>
<Num2>1</Num2>
<Num3>1</Num3>
<Num4>0</Num4>
<Num5>1</Num5>
<Num6>0</Num6>
<AnotherText1>something here</AnotherText1>
<Num7>2</Num7>
<Num8>0</Num8>
<Num9>0</Num9>
<AnotherText2>something here</AnotherText2>
</row>
...
...
</File> Many xml files are zipped together. Hence the tar.gz file. This runs. However for a 400MB file it takes 50mins to finish. Does anyone have an idea why it is so slow, or how I may speed it up?
I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2. Many thanks in Advance!
... View more
Labels:
- Labels:
-
Apache Spark