Member since
01-03-2018
11
Posts
0
Kudos Received
0
Solutions
10-03-2018
11:49 AM
I am new to Data Science and Big Data Frameworks. Lets say,I have a DataSet input in CSV. What I found from Google and other resources about a Data Analyst and Data Scientist daily job, Once user gets DataSet, first will manipulate with help of python pandas library which includes Data cleaning and other stuffs. Then User visualizes the datas using matplotlib and other techniques. User can write Machine Learning algorithms to get a prediction for some criterias. All the above workflows can be summarized into data analysis and prediction. Now, on the other account, I found out Pydoop(a Hadoop framework of Python) to do operations like Storage, processing etc I am bit confused, in the Data Analysis workflow mentioned above where pydoop stands exactly in that? Please guide me.
... View more
04-18-2018
04:10 AM
Hi @Harald Berghoff Thanks for the information.
... View more
04-17-2018
09:51 AM
Hello All, My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship. Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra tutorial/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation? Help me out. Thanks
... View more
Labels:
03-13-2018
05:29 AM
I am new to azure development. I have to select database in azure to store big data. So i have to finalize data storage now. Strongly like to understand why Hadoop ( Non Microsoft ) in side azure ? ( I hope some strong reasons will be ) 1) Available Microsoft Azure storage (ex blobs etc ) can not perform like hadoop ? 2) Can not achieve something in azure but can achieve in hadoop ? 3) Perfoemance ? like this lots of question coming to me, Please provide clear ideas on this. Regards,
... View more
Labels:
02-23-2018
09:41 AM
Hello All, Im trying to send splunk-index data to Hadoop using Hadoop Data roll.
However Im not able to establish connection between splunk and Hadoop at all...I get below error on my splunk indexer
bash-4.1$ /opt/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -ls hdfs://hadoopnamenode.company.com:8020/ /user/splunkdevuser/
17/12/01 08:38:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
Warning: fs.defaultFS is not set when running "ls" command.
ls: `/user/splunkd1/': No such file or directory Can someone kindly help me on this Hadoop and Splunk Below is my indexes.conf [hadoopidx]
coldPath = $SPLUNK_DB/hadoopidx/colddb
enableDataIntegrityControl = 0
enableTsidxReduction = 0
homePath = $SPLUNK_DB/hadoopidx/db
maxTotalDataSizeMB = 20480
thawedPath = $SPLUNK_DB/hadoopidx/thaweddb
[provider:eihadoop]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.dfs.namenode.kerberos.principal = hdfs/_HOST@HADOOP.company.COM
vix.env.HADOOP_HOME = /user/splunkdev
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/java/jdk1.8.0_102
vix.family = hadoop
vix.fs.default.name = hdfs://SLPP02.HADOOP.company.COM:8020
vix.hadoop.security.authentication = kerberos
vix.hadoop.security.authorization = 1
vix.javaprops.java.security.krb5.kdc = SLP013.HADOOP.company.COM
vix.javaprops.java.security.krb5.realm = HADOOP.company.COM
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /user/splunkdev/hadoopanalytics/
vix.yarn.nodemanager.principal = yarn/_HOST@HADOOP.company.COM
vix.yarn.resourcemanager.address = https://SLPP08.HADOOP.company.COM:8090/cluster
vix.yarn.resourcemanager.principal = yarn/_HOST@HADOOP.company.COM
vix.yarn.resourcemanager.scheduler.address = https://SLPP015.HADOOP.company.COM:8090/cluster/scheduler
vix.mapreduce.jobtracker.kerberos.principal = mapred/_HOST@HADOOP.company.COM
vix.kerberos.keytab = /export/home/splunkdev/splunkdev.keytab
vix.kerberos.principal = splunkdev@TSS.company.COM
[splunk_index_archive]
vix.output.buckets.from.indexes = hadoopidx
vix.output.buckets.older.than = 172800
vix.output.buckets.path = /user/splunkdev/splunk_index_archive
vix.provider = eihadoop
Please Help me on this! Thanks
... View more
Labels:
01-10-2018
01:16 PM
I wanted to switch from Hadoop 1.2.1 to Hadoop 2.2. In my project I'm using Maven and it can handle <dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency> without any problems, however changing the version to 2.2 in not working as it is not available in the central maven repository.
Any ideas how can I include Hadoop 2.2. in my maven-ized project? Regards Sarahjohn
... View more
- Tags:
- hadoop
- Hadoop Core
Labels:
12-07-2017
06:45 AM
After installing Hadoop when I am trying to start start-dfs.sh it is showing following error message.
I have searched a lot and found that WARN is because I am using UBUNTU 64bit OS and Hadoop is compiled against 32bit. So its not an issue to work on.
But the Incorrect configuration is something I am worried about. And also not able to start the primary and secondary namenodes. sameer@sameer-Compaq-610:~$ start-dfs.sh
15/07/27 07:47:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
Starting namenodes on []
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
Starting secondary namenodes [0.0.0.0]
0.0.0.0: ssh: connect to host 0.0.0.0 port 22: Connection refused**
15/07/27 07:47:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable My current configuration: hdfs-site.xml <configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/sameer/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/sameer/mydata/hdfs/datanode</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name </name>
<value> hdfs://localhost:9000 </value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration> Please find what I am doing wrong in configuration or somewhere else.? Thanks, Nicolewells
... View more
Labels:
11-24-2017
08:44 AM
I get the following error when I try to run nutch-1.5 on hadoop 1.03.
hadoop jar nutch-1.5.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5 **Caused by: java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus** I see the bug report https://issues.apache.org/jira/browse/NUTCH-1084 on nutch-1.3 but it seems that is not yet resolved. Any help is appreciated.
I follow this tutorials: 1. http://wiki.apache.org/nutch/NutchHadoopTutorial 2. http://wiki.apache.org/nutch/NutchTutorial 3. http://wiki.apache.org/hadoop/HowToConfigure EDIT
I follow this tutorial http://www.rui-yang.com/develop/build-nutch-1-4-cluster-with-hadoop/ and it works for me. I don't know what exactly fix the problem. I run hadoop in a single node. I make this changes:
1.copy the hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, master, slaves from hadoop/conf to nutch/conf and rebuild nutch
2.export CLASSPATH=:$NUTCH_HOME/runtime/local/lib
I create the following tutorial http://dataspider.blogspot.com.es/2012/09/instalacion-de-hadoop.html | JIRA Tutorial
... View more
Labels:
10-26-2017
06:48 AM
I'm very new to python. I'm working in the area of hydrology and I want to learn python to assist me with processing hydrological data. At the moment I write a script to extract bits of information from a Informatica Big Data set. I have three csv files: Complete_borelist.csv Borelist_not_interested.csv Elevation_info.csv I want to create a file with has all the bores that are in complete_borelist.csv but not in borelist_not_interested.csv. I also want to grab some information from complete_borelist.csv and Elevation_info.csv for those bores which satisfy the first criteria. My Python script is as follow: not_interested_list =[]
outfile1 = open('output.csv','w')
outfile1.write('Station_ID,Name,Easting,Northing,Location_name,Elevation')
outfile1.write('\n')with open ('Borelist_not_interested.csv','r')as f1:for line in f1:ifnot line.startswith('Station'):#ignore header
line = line.rstrip()
words = line.split(',')
station = words[0]
not_interested_list.append(station)with open('Complete_borelist.csv','r')as f2:
next(f2)#ignore headerfor line in f2:
line= line.rstrip()
words = line.split(',')
station = words[0]ifnot station in not_interested_list:
loc_name = words[1]
easting = words[4]
northing = words[5]
outfile1.write(station+','+easting+','+northing+','+loc_name+',')with open ('Elevation_info.csv','r')as f3:
next(f3)#ignore headerfor line in f3:
line = line.rstrip()
data = line.split(',')
bore_id = data[0]if bore_id == station:
elevation = data[4]
outfile1.write(elevation)
outfile1.write ('\n')
outfile1.close() I have two issues with the script: The first is the Elevation_info.csv doesn't have information for all the bore in the Complete_borelist.csv. When my loop get to the station where it can't find Elevation record for it, the script doesn't write "null" but continue to write the information for the next station in the same line. Can anyone help me to fix this please? The second is my complete borelist is about >200000 rows and my script runs through them very slow. Can anyone have any suggestion to make it run faster?
... View more
Labels:
10-03-2017
11:03 AM
I understand Splunk Hadoop Connect is a free app and Hunk License depends on the no of Tasktrackers. We have Splunk Administration Enterprise in our organisation and the goal is to perform analytics on Hadoop data and send archived data to Hadoop from Indexes. I can achieve this via both Splunk Hadoop Connect and Hunk, but my doubt is what's the difference between these two w.r.t licensing, other than the bidirectional data movement that Hadoop Connect provides? Now if I get Splunk Hadoop Connect app, then the licensing will depend on what parameters?
... View more
Labels: