Member since
08-01-2017
19
Posts
0
Kudos Received
0
Solutions
07-30-2018
10:28 AM
can somebody please help? I am completely stuck
... View more
07-27-2018
04:35 PM
I want spark api which can write data into excel file not in CSV file. Best solution to write data into excel file directly.
... View more
Labels:
- Labels:
-
Apache Spark
07-27-2018
04:33 PM
We have created spark application for client reporting . Client wants report in CSV format. We have coded it that way and it is generating desired output with requested format. When we see result data in log it shows correct format and correct data(I.e date format requested is 2018-07-26 11:19:04.0 and it is correct format shows in log but when we see same data in CSV file format is getting changed. It shows 6/7/2018 12:27 format. Why this issue with csv file when we see log file it shows correct results and same we have written to csv file through flie write command, it shows fomat changed. How to resolve this? Sample Code: val selectedData = dataFrame3.select(concat(col("ticket_number"),lit("-"),date_format(col("as_of_date"),"yyMMdd")).as("transref"),col("newmCanc").as("newmCanc"), when(col("trade_action") === "CXL",concat(col("master_ticket_num"),lit("-"),date_format(col("as_of_date"),"yyMMdd"))).otherwise("").as("relTransref"), col("trader_name").as("portfolioIdAm"), col("portfolioIdKvg"), col("name").as("portfolioName"), when(col("buy_sell_desc") === "Buy", "BUY").when(col("buy_sell_desc") === "Sell", "SELL").otherwise("OTHER").as("buyisell"), col("trade_feed_trade_amount").as("quantity"), col("secIdType"), col("id_isin").as("secId"), when(col("instrument_name").isNotNull,col("instrument_name")).otherwise(col("security_name")).as("secName"), format_number(col("trade_price").cast("Double"),2).as("price"), col("currency").as("tradeCCY"), format_number(col("settlement_costs_in_settlement_currency").cast("Double"),2).as("tradeComm"), format_number(col("Transaction_Cost_2_Amount").cast("Double"),2).as("fees"), format_number(col("Transaction_Cost_3_Amount").cast("Double"),2).as("tax"), format_number(col("Transaction_Cost_5_Amount").cast("Double"),2).as("others"), format_number(col("Accrued_Interest"),2).as("interest"), format_number(col("settlement_total_in_settlement_currency").cast("Double"),2).as("settlAmount"), when(col("Number_of_days_accrued_interest").isNull, "0").otherwise(col("Number_of_days_accrued_interest")).as("interestDays"), date_format(col("as_of_date"),"yyyy-MM-dd").cast("String").as("tradeDate"), date_format(col("receiveddate"),"yyyy-MM-dd HH:mm:ss").cast("String").as("executionTimestamp"), date_format(col("settlement_date"), "yyyy-MM-dd").cast("String").as("settlementDate")
... View more
Labels:
- Labels:
-
Apache Spark
03-05-2018
12:07 PM
We have a project where currently Shell script, Hive, Execution engine: TEZ is being used. For POC purpose we tried replacing shell scripts with spark and we executed HQLs through spark . One of the client cam back with a question that why would we need spark application as we can set spark as an execution engine and we can run our regular shell scripts and oozie workflow. What is the better option to choose just choose
set hive.execution.engine=spark; OR make spark application and execute HQLs with spark APIs. If performance is same for both of them then why do we need to write code in Spark? What is the advantage of writing spark application using SPARK SQL?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
08-13-2017
12:01 PM
I need to process CSV file through spark , I need load CSV file into hive tables through spark however my files itself has comma in data not as a separator but as a content at several places in this case there are three questions 1) How will spark identify that this is not a separator and consider this comma as a content of data 2) How can we process such data and load into hive including comma which is content and not a separator Please share some techniques to achieve above points.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
08-02-2017
11:21 AM
I have spark job and while submitting I am giving X number of executors and Y memory however somebody else is also using same cluster and they also want to run several jobs during that time only with X number of executors and Y memory and both of them do not know about each other. In this case how number of executors/memory should be calculated and given to our spark job?
... View more
Labels:
- Labels:
-
Apache Spark
08-02-2017
10:40 AM
I see people ask that what are the optimization techniques you use for your spark job , what are these optimization techniques we can use for spark jobs? Like while writing spark job code or for submitting or to run job with optimal resources.
... View more
Labels:
- Labels:
-
Apache Spark
08-02-2017
09:29 AM
@jfrazee Thank you very much. It will help me a lot to explore Spark & Scala. Thanks for guidance.
... View more
08-01-2017
12:39 PM
Can somebody guide me with Spark+ Scala complete examples? Is there any documents or online link or any book where I can see some complete examples of Spark + Scala
... View more
Labels:
- Labels:
-
Apache Spark
08-01-2017
06:35 AM
Can somebody tell me what would be the real time cluster configuration? I have setup Hortonworks in my home system however it is standalone in real time project what would be the cluster configuration like how many nodes, Cluster memory & RAM, Node memory & RAM, backup of cluster and all? And while submitting Spark job to YARN how can we decide executors, memory and all those properties?
... View more
Labels:
- Labels:
-
Apache YARN
07-26-2017
10:28 AM
I need to schedule jobs which imports data from external tables to HDFS through SQOOP using incremental mode however it should not load duplicate values or if it is duplicate it should overwrite the existing data available in the file
... View more
Labels:
- Labels:
-
Apache Sqoop
07-26-2017
10:11 AM
I have two relations Relation A(1 w 2 x 3 y) Relation B( 1 z
2 x 3 y 4 K 5 L) Want to merge these relations in third relation with no duplicates using pig script Relation C 1 w
2 x 3 y 4 K 5 L
... View more
Labels:
- Labels:
-
Apache Pig
07-10-2017
02:21 PM
First of all Thanks Geoffrey for your quick response, hope I have addressed you name correctly. Suppose I have one CSV file, I want to process it through SPARK , I submit it on YARN and I need data to be loaded in HIVE tables . In this case where would I write my spark code(I will code write in eclipse however on which machine?), how would I submit it on YARN and How would I access my hive tables, all components would be distributed? or SPARK and HIVE would be on same node? If they are on same node then why do we need other 3 data nodes if one edge node can do all stuff
... View more
07-10-2017
07:06 AM
Hello, I have trained myself on hadoop, I know how to work with MR,Pig,HIVE,SPARK,SCALA,SQOOp and all however I worked on all these components in my personal system and in singlenode architecture. Now I need to know that how real time LIVE project works? How multi node structure works? If I am trying to process one CSV file then How do I access spark and hive and all which are installed on different nodes? And How do I access those? I need detailed documents if somebody have or any article that anyone is aware of which shows complete steps and process to access different components. I feel helpless as nobody in my group or in my connection works on real time Hadoop ecosystem
... View more
Labels:
06-30-2017
05:07 PM
Thanks , it helped a lot to clear my confusion.
... View more
06-29-2017
01:39 PM
I have gone through below URL to understand how to load data into HIVE using spark in orc format. I understood how to create table in HIVE using spark howvere I have one question that how would spark identify that in which database this table should be created or if I have same table name in two different HIVE DB in which table spark is going to insert values I have gone through below URL: https://hortonworks.com/tutorial/using-hive-with-orc-from-apache-spark/
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
04-04-2017
09:39 AM
I have installed HDP 2.5 and I am able to login into sandbox in cent using root uname and hadoop password but how would I start working with hadoop components like pig,hive and all, Do I need to install all hadoop compnents and configure it or it is configured and installed by default? How would I get a prompt of pig, hive spark, scala and all? also I am unable to login into Ambari using admin/admin uname password?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)