Member since
08-11-2016
9
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2491 | 08-19-2016 12:58 PM |
08-30-2016
06:55 AM
I started using lxml, but now I am using the Selenium package of Python. That note might help me out in the future, thanks for that!
... View more
08-30-2016
06:52 AM
I decided to stick to regular Python, there wasn't really a need for Spark. As I had to get results, I didn't even use Scrapy or Nutch, but I certainly will have a look at it. It looks very interesting!
... View more
08-23-2016
07:01 AM
And would you recommend to run Scrapy in a PySpark environment? I will have a look at the Pydoop API, thanks for the recommendation.
... View more
08-19-2016
12:59 PM
1 Kudo
For a use case, I am looking to web scrape the prices and additional information of around 25.000 items on a specific website. The names of these items are on a separate list. The resulting prices and additional information then have to be added to the list of the item names. How can this be implemented best in Hadoop? I thought about using Scrapy [1] on PySpark, then writing a script for joining the prices and the item names. Is this possible? I suppose Hadoop is not necessarily needed for this small job, but I want to get to know the Hadoop ecosystem better (I'm a Hadoop beginner). Thanks! Nicolas [1] http://scrapy.org/
... View more
Labels:
- Labels:
-
Apache Spark
08-19-2016
12:58 PM
Problem solved by reinstalling using CentOS 6.
... View more
08-12-2016
06:49 AM
1) First one confirmed (works on both hosts) hduser@ubuntuVM:~$ ssh ubuntuvm2.com
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-31-generic x86_64)
* Documentation: <a href="https://help.ubuntu.com/">https://help.ubuntu.com/</a>
System information disabled due to load higher than 1.0
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Fri Aug 12 08:28:46 2016 from ubuntuvm.com
hduser@ubuntuvm2:~$
2) After a while, the connection is automatically interrupted ("closed by foreign host"). hduser@ubuntuVM2:~$ telnet ubuntuvm.com 8440
Trying 192.168.10.131...
Connected to ubuntuVM.com.
Escape character is '^]'.
Connection closed by foreign host.
hduser@ubuntuVM2:~$
... View more
08-11-2016
05:29 PM
Note that I excluded some lines beginnen with "at ..." (these were just lists of services). So this is probably a problem due to a wrong hostname? INFO:root:BootStrapping hosts ['ubuntuvm1.com', 'ubuntuvm2.com'] using /usr/lib/python2.6/site-packages/ambari_server cluster primary OS: ubuntu14 with user 'hduser' sshKey File /var/run/ambari-server/bootstrap/6/sshKey password File null using tmp dir /var/run/ambari-server/bootstrap/6 ambari: ubuntuvm.com; server_port: 8080; ambari version: 2.2.2.0; user__run_as: root
INFO:root:Executing parallel bootstrap
Bootstrap process timed out. It was destroyed.
11 Aug 2016 18:25:16,717 INFO [pool-15-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/6
11 Aug 2016 18:25:16,717 INFO [pool-15-thread-1] BSHostStatusCollector:62 - HostList for polling on [ubuntuvm1.com, ubuntuvm2.com]
11 Aug 2016 18:25:17,509 ERROR [qtp-ambari-client-181] AbstractResourceProvider:280 - Caught AmbariException when creating a resource
org.apache.ambari.server.HostNotFoundException: Host not found, hostname=
at ..
11 Aug 2016 18:25:17,511 ERROR [qtp-ambari-client-181] BaseManagementHandler:57 - Caught a system exception while attempting to create a resource: An internal system exception occurred: Host not found, hostname =
org.apache.ambari.server.controller.spi.SystemException: An internal system exception occurred: Host not found, hostname=
at org.apache.ambari.server.controller.internal.AbstractResourceProvider.createResources(AbstractResourceProvider.java:282)
...
Caused by: org.apache.ambari.server.HostNotFoundException: Host not found, hostname=
at ..
11 Aug 2016 18:25:22,292 INFO [qtp-ambari-agent-314] HeartBeatHandler:309 - HeartBeatHandler.sendCommands: sending ExecutionCommand for host ubuntuvm.com, role check_host, roleCommand ACTIONEXECUTE, and command ID 88-0, task ID 646
Thanks for your quick reply!
... View more
08-11-2016
05:29 PM
I am only testing out Hadoop, using 2 seperate laptops running virtual machines. Code: INFO:root:BootStrapping hosts ['ubuntuvm1.com', 'ubuntuvm2.com'] using /usr/lib/python2.6/site-packages/ambari_server cluster primary OS: ubuntu14 with user 'hduser' sshKey File /var/run/ambari-server/bootstrap/6/sshKey password File null using tmp dir /var/run/ambari-server/bootstrap/6 ambari: ubuntuvm.com; server_port: 8080; ambari version: 2.2.2.0; user__run_as: root
INFO:root:Executing parallel bootstrap
Bootstrap process timed out. It was destroyed.
11 Aug 2016 18:25:16,717 INFO [pool-15-thread-1] BSHostStatusCollector:55 - Request directory /var/run/ambari-server/bootstrap/6
11 Aug 2016 18:25:16,717 INFO [pool-15-thread-1] BSHostStatusCollector:62 - HostList for polling on [ubuntuvm1.com, ubuntuvm2.com]
11 Aug 2016 18:25:17,509 ERROR [qtp-ambari-client-181] AbstractResourceProvider:280 - Caught AmbariException when creating a resource
org.apache.ambari.server.HostNotFoundException: Host not found, hostname=
at ..
11 Aug 2016 18:25:17,511 ERROR [qtp-ambari-client-181] BaseManagementHandler:57 - Caught a system exception while attempting to create a resource: An internal system exception occurred: Host not found, hostname =
org.apache.ambari.server.controller.spi.SystemException: An internal system exception occurred: Host not found, hostname=
at org.apache.ambari.server.controller.internal.AbstractResourceProvider.createResources(AbstractResourceProvider.java:282)
...
Caused by: org.apache.ambari.server.HostNotFoundException: Host not found, hostname=
at ..
11 Aug 2016 18:25:22,292 INFO [qtp-ambari-agent-314] HeartBeatHandler:309 - HeartBeatHandler.sendCommands: sending ExecutionCommand for host ubuntuvm.com, role check_host, roleCommand ACTIONEXECUTE, and command ID 88-0, task ID 646
Thanks for your quick reply!
... View more
08-11-2016
04:02 PM
I have one running host, but when trying to add additional nodes, the status keeps saying "failed". To make things worse, there are no log files in Ambari. This is all it says: ==========================
Creating target directory...
==========================
Command start time 2016-08-11 16:42:54
Can someone tell me where I can find what I did wrong ? Or more importantly: how can I fix this? Thanks in advance. Specs:
Ubuntu 14.04.5 LTS Ambari 2.2.2.0 Java 1.8.0_101 Hostnames are valid FQDNs, all servers can connect by ssh without password, firewall is disabled (I did :: sudo ufw disable), IPv6 has been disabled (in file /etc/sysctl.conf), THP has been disabled, NTP enabled, Java path (asked when setting up ambari server) leads to custom Java version 1.8. Please note that I'm a complete Linux and Hadoop noob.
... View more
Labels:
- Labels:
-
Apache Ambari