Member since
04-13-2016
80
Posts
12
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3158 | 03-17-2017 06:10 PM |
06-10-2018
04:32 PM
@sunile.manjee Are you referring to mappartition() (http://apachesparkbook.blogspot.com/2015/11/mappartition-example.html) ?...Could you provide an example of how i might apply this? thx mike
... View more
06-08-2018
03:59 PM
I think I've found the issue, the hive table I query to create the data frame has the same number of underlying HDFS blocks. Merging these together has improved performance, although it still takes 5mins to complete.
... View more
06-08-2018
09:25 AM
Hi all, I'm performing a write operation to a postgres database in spark. The dataframe has 44k rows and is in 4 partitions. But the spark job takes 20mins+ to complete. Looking at the logs (attached) I see the map stage is the bottleneck where over 600+ tasks are created. Does anyone have any insight into why this might be and how I could resolve the performance issue? I've also included a screenshot of my cluster metrics. Thanks, Mike
... View more
Labels:
- Labels:
-
Apache Spark
06-28-2017
11:05 AM
On further investigation in the timeline server log file I saw that periodically there was a FileNotFound error when attempting to clean out the earliest application log directory that still contained data in ats/done : 2017-06-28 11:25:07,910 INFO timeline.EntityGroupFSTimelineStore (EntityGroupFSTimelineStore.java:cleanLogs(462)) - Deleting hdfs://XXX:8020/ats/done/1494799829596/0000/000/application_1494799829596_0508
2017-06-28 11:25:07,924 ERROR timeline.EntityGroupFSTimelineStore (EntityGroupFSTimelineStore.java:run(899)) - Error cleaning files
java.io.FileNotFoundException: File hdfs:/XXX:8020/ats/done/1494799829596/0000/000/application_1494799829596_0508 does not exist. It seems because this file was missing in the directory the process died, hence from this point the logs have been building up because it has been unable to clear them causing storage problems. The question is why does the process not continue to the next logs if it cannot find a specific file to delete, or in fact why is it looking for a specific file when it should just purge whatever is there given the timestamp expiry?
... View more
06-28-2017
09:26 AM
yarn.timeline-service.entity-group-fs-store.retain-seconds - ive tried reducing this (to 60 seconds) but it still doesnt seem to clear out the logs? any ideas?
... View more
06-27-2017
08:53 PM
Hi - ive made these changes to the parameters and restarted yarn but the large files still remain in /ats/done.
... View more
06-27-2017
06:52 PM
I want to do 'hdfs dfs -rm -R /ats/done' because there are some large files in there taking up a lot of space. Is this safe to do? I also want to clear out logs in /app-logs/ , can I also delete these manually? Thanks, Mike
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
03-17-2017
06:10 PM
I found the problem to be with the interpreter.json file that had somehow become corrupted/empty.
... View more
03-15-2017
10:09 PM
Hi, I get the following error on restarting zeppelin through ambari on HDP2.5. Nothing has changed since the last time it was running. stderr:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/ZEPPELIN/0.6.0.2.5/package/scripts/master.py", line 330, in <module>
Master().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/ZEPPELIN/0.6.0.2.5/package/scripts/master.py", line 184, in start
self.update_kerberos_properties()
File "/var/lib/ambari-agent/cache/common-services/ZEPPELIN/0.6.0.2.5/package/scripts/master.py", line 234, in update_kerberos_properties
config_data = self.get_interpreter_settings()
File "/var/lib/ambari-agent/cache/common-services/ZEPPELIN/0.6.0.2.5/package/scripts/master.py", line 209, in get_interpreter_settings
config_data = json.loads(config_content)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
....
2017-03-15 22:01:28,144 - call returned (0, '')
2017-03-15 22:01:28,145 - DFS file /apps/zeppelin/zeppelin-spark-dependencies-0.6.0.2.5.3.0-37.jar is identical to /usr/hdp/current/zeppelin-server/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0.2.5.3.0-37.jar, skipping the copying
2017-03-15 22:01:28,145 - HdfsResource[None] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/current/hadoop-client/bin', 'keytab': [EMPTY], 'default_fs': 'hdfs://smartclean-master.lancs.ac.uk:8020', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': [EMPTY], 'user': 'hdfs', 'action': ['execute'], 'hadoop_conf_dir': '/usr/hdp/current/hadoop-client/conf'}
Command failed after 1 tries
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Zeppelin
01-17-2017
05:52 PM
Hi, Each of my nodes in a cluster I'm setting up use an alternative port to the standard port 22 for SSH. When performing the automated install with Ambari is there an option to change the port ambari uses when running SSH connections?
... View more
Labels:
- Labels:
-
Apache Ambari