Member since
03-21-2017
18
Posts
2
Kudos Received
0
Solutions
06-25-2020
02:48 AM
I will check our spark 2.4.5 application code compatibility with spark 2.3.2 version. Is Ambari & HDP going to be discontinued in near future as part of cloudera and hortonworks merger going ? We need to plan our choice of softwares accordingly.
... View more
06-25-2020
01:57 AM
Thanks for the reply. Can we install hadoop & spark 2.4.5 packages on multi node cluster without using hdp, ambari & cloudera ? We already have spark applications running on spark 2.4.5 version and we do not want to go back to backward versions. Even we are planning to upgrade them soon to spark 3 because of better delta lake compatibility. If we install hadoop and spark packages manually on each node of the cluster, can there be any maintanance issues at later stage in production ?
... View more
06-24-2020
02:31 AM
Hi, I need to setup a 5 node cluster with Hadoop 3.1.0 and Spark 2.4.5 . Someone recommended to use Ambari to do so. I checked Ambari but it seems Ambari can be used only to install HDP and latest HDP do not support Spark 2.4.5 version. Please suggest in this aspect, what will be the best way to setup the required big data cluster.
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
-
Apache Spark
08-13-2017
02:39 PM
Use mapPartitions if we want to add header in all files or if there is single partition. topPriceResultsDF
.map(x => x.mkString(","))
.mapPartitions(iter => Iterator(header) ++ iter)
.saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv") Use mapPartitionsWithIndex if we want to add header in only first file topPriceResultsDF.map(x => x.mkString(","))
.repartition(2)
.mapPartitionsWithIndex ({
case (0, iter) => Iterator(header) ++ iter
case (_, iter) => iter
})
.saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv")
... View more
08-11-2017
11:32 AM
Hi All,
How can we add a header to Spark SQL Query results before saving the results in a textfile? Spark version is 1.6 val topPriceResultsDF = sqlContext.sql("SELECT * FROM retail_db.yahoo_stock_orc WHERE open_price > 40 AND high_price > 40 ORDER BY date ASC")
topPriceResultsDF.map(x => x.mkString(",")).saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40_results(comma).csv") It saves only data but I need to add header like
(date,open_price,high_price,low_price,close_price,volume,adj_price) as well . Please help if anyone has idea !! I cannot use databricks library. O/P should be like
date,open_price,high_price,low_price,close_price,volume,adj_price
1997-07-09,40.75008,45.12504,40.75008,43.99992,37545600,1.83333 Thanks !!
... View more
Labels:
- Labels:
-
Apache Spark
04-01-2017
01:40 AM
I got the same issue in hortonworks sandbox environment. Script was correct but was throwing this error Unable to open iterator foralias I found Jobhistory server was not working by default. I could not relate the connection between the two but after starting histoyserver , my pig script worked in both tez and mapreduce mode. Try it if it works for yoou as well. [mapred@sandbox ~]$ cd /usr/hdp/current/hadoop-mapreduce-historyserver/sbin
[mapred@sandbox sbin]$ ls
mr-jobhistory-daemon.sh
[mapred@sandbox sbin]$ mr-jobhistory-daemon.sh start historyserver
... View more
03-30-2017
05:17 AM
2 Kudos
Hi All, I have downloaded millionsongssubset data from http://static.echonest.com/millionsongsubset_full.tar.gz and tried to upload it and print sample songs = LOAD '/user/root/datasets/millionsongsubset_full.tar.gz'
songs_limit = LIMIT songs 10;
DUMP songs_limit; Records are displayed as below. Please suggest how to upload above downloaded data in right format grunt> DUMP songs_limit;2017-03-30 05:03:24,383 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT2017-03-30 05:03:24,458 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,459 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}2017-03-30 05:03:24,474 [main] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 12017-03-30 05:03:24,479 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,507 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 12017-03-30 05:03:24,524 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz]2017-03-30 05:03:24,609 [main] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://sandbox.technocrafty:8020/tmp/temp-607255022/tmp-1815565156/_temporary/0/task__0001_m_0000012017-03-30 05:03:24,646 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized2017-03-30 05:03:24,655 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12017-03-30 05:03:24,656 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1(MillionSongSubset/0000755000175000017500000000000011516357374014450 5ustar thierrythierryMillionSongSubset/AdditionalFiles/0000755000175000017500000000000011516366075017501 5ustar thierrythierryMillionSongSubset/AdditionalFiles/subset_unique_tracks.txt0000644000175000017500000317201311516365717024515 0ustar thierrythierryTRAAAAW128F429D538<SEP>SOMZWCG12A8C13C480<SEP>Casual<SEP>I Didn't Mean To)(TRAAABD128F429CF47<SEP>SOCIWDW12A8C13D406<SEP>The Box Tops<SEP>Soul Deep)(TRAAADZ128F9348C2E<SEP>SOXVLOJ12AB0189215<SEP>Sonora Santanera<SEP>Amor De Cabaret)(TRAAAEF128F4273421<SEP>SONHOTT12A8C13493C<SEP>Adam Ant<SEP>Something Girls)(TRAAAFD128F92F423A<SEP>SOFSOCN12A8C143F5D<SEP>Gob<SEP>Face the Ashes)(TRAAAMO128F1481E7F<SEP>SOYMRWW12A6D4FAB14<SEP>Jeff And Sheri Easter<SEP>The Moon And I (Ordinary Day Album Version))(TRAAAMQ128F1460CD3<SEP>SOMJBYD12A6D4F8557<SEP>Rated R<SEP>Keepin It Real (Skit))(TRAAAPK128E0786D96<SEP>SOHKNRJ12A6701D1F8<SEP>Tweeterfriendly Music<SEP>Drop of Rain)(TRAAARJ128F9320760<SEP>SOIAZJW12AB01853F1<SEP>Planet P Project<SEP>Pink World)(TRAAAVG12903CFA543<SEP>SOUDSGM12AC9618304<SEP>Clp<SEP>Insatiable (Instrumental Version)) Thanks !!
... View more
Labels:
- Labels:
-
Apache Pig
03-21-2017
08:51 AM
Thanks Jay !! I got it. I need not to make change in Ambari, I can do it simply through CLI. It worked. Making change in Ambari was not reflected in "/etc/hive/conf/hive-site.xml". I don't know why. hive> set hive.exec.post.hooks;
hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook
hive> set hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.ATSHook;
hive> select current_database();
OK
default
Time taken: 3.074 seconds, Fetched: 1 row(s)
... View more
03-21-2017
08:22 AM
hive> select current_database();
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.registerDatabase(HiveMetaStoreBridge.java:109)
at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.registerTable(HiveMetaStoreBridge.java:270)
at org.apache.atlas.hive.hook.HiveHook.registerProcess(HiveHook.java:309)
at org.apache.atlas.hive.hook.HiveHook.fireAndForget(HiveHook.java:202)
at org.apache.atlas.hive.hook.HiveHook.run(HiveHook.java:160)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1522)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
... View more
03-21-2017
08:18 AM
In Advanced Settings > General > Property name "hive.exec.post.hooks" , I removed "org.apache.atlas.hive.hook.HiveHook" entry Still same error !
... View more