When loading a csv file into an HBase table, some bad lines are dropped. How I can identify which lines are dropped?
A similar case is here. https://community.hortonworks.com/questions/73985/no-data-shown-in-hbase-after-importtsv.html
Running ImportTsv generates big log files, as mentioned here. https://community.hortonworks.com/articles/4942/import-csv-data-into-hbase-using-importtsv.html.
Maybe the log file can help me but I do not know where those log files are. I expect they are in hadoop storage, rather than linux local storage. So I look into the my hadoop path /app-logs/hdfs/logs/application_1557438882545_0077. The folder name "application_1557438882545_0077" is the mapreduce job id. There is one file inside this folder and the file name is name1.abc.local_45454_1562965335601. This file is not in a human-readable format.
If you inspect the Mapper log files, you should be able to find mention of an unparseable row when one is processed. You may have to increase the log level from INFO to DEBUG.
Each Mapper is assigned an InputSplit which will be a contiguous group of lines from the input files that you specified (e.g. fileA lines 50 through 200). You can also use this information to work backwards.
Thanks for the reply. Where can I find those mapper log files? I tried the job tracker UI (port 8088) via the Ambari link "Yarn - ResourceManagerUI", but could not find any log record. I did locate the map reduce jo/application id. On the "logs" tab for this job, I picked the only one attempt "appattempt_1557438882545_0078_00001", then I got the message "No container data available!"
By the way, on the Job Tracker UI page for this job, I saw a link to "Log", but that log page does not look like a mapper log, it contains sections for different log types, such as directory info, launch_container.sh, … , stderr, stdout, and syslog.
In addition, on the Ambari - YARN - Configs - Advance page, the "Enable Log Aggregation" is set to be enabled.