Member since
08-05-2018
73
Posts
0
Kudos Received
0
Solutions
10-21-2018
06:28 AM
Hi Shu, thanks for responding. The solution you provided appears a little difficult for something that I thought would be relatively simple. I will try your solution and let you know how I get on. In the meantime, have you seen the solution provided here: https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html?childToView=12091
... View more
10-20-2018
02:46 PM
Hello Community, I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time. example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults") example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv")
carl = example1.show() example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json") example1.repartition(1).write.format("csv").mode("overwrite").save("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/thefile.csv") Can someone show me how write code that will result in a single file that is overwritten without changing the filename?
... View more
10-19-2018
10:13 AM
Hi guys, I'm sorry if the question seems a little confusing. Basically, I would just like to be able to save to a single file and the file to be overwritten each time it is saved. Thanks
... View more
10-19-2018
09:49 AM
Sorry guys, I forgot to add the code: example1 = spark.sql("""SELECT
CF.CountryName AS CountryCarsSold
,COUNT(CF.CountryName) AS NumberCountry
,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")
... View more
10-18-2018
10:45 PM
Sorry, I for to add the query, example1 = spark.sql("""SELECT
CF.CountryName AS CountryCarsSold
,COUNT(CF.CountryName) AS NumberCountry
,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")
... View more
10-18-2018
10:19 PM
Hello comnunity, I'm using to following script to output the results of sparkql query to a file in Azure Data Store. However, instead creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid ... see image. Can someone let me know how to make sure the file is created and overwritten each time the pyspark query is run? Thanks
... View more
08-13-2018
04:11 PM
Hi Sandeep, thanks. It works very well. Thank you
... View more
08-13-2018
12:19 PM
Hi Sandeep, I should be clear about what I'm trying to achieve. I would like the output to include only the delta change. I thought that having the current date would be sufficient, but I just realized that having just the currentdate won't let me know if there has been a change to the data. Therefore, while your helping me could you also help me figure out how to include the currentdate and the delta change in data? Much appreciated. Cheers
... View more
08-13-2018
12:08 PM
I'm using python version 3 and print(currentate) worked. Thanks. However, when I run the full query I get the following error: ipython-input-22-8c743396e037> in <module>() 18FROMHumanResources_vEmployeeDepartment 19 ORDER BY FirstName, LastName DESC""")
---> 20counts.coalesce(1).write.csvCONCAT("/home/packt/Downloads/myresults7-"+currentdate+".csv")
'DataFrameWriter' object has no attribute 'csvCONCAT'
... View more
08-13-2018
11:20 AM
I now get the following error: File "<ipython-input-13-588f4561c3f0>", line 7 print currentdate() ^SyntaxError: invalid syntax The invalid syntax is currentdate() Without the parentheses I get the following error: File "<ipython-input-14-8d268659919b>", line 1 print currentdate ^SyntaxError: Missing parentheses in call to 'print'
... View more
08-13-2018
11:05 AM
The syntax error is with 'currentdate'
... View more
08-13-2018
11:04 AM
Sandeep, Thanks for reaching out. I'm getting the following error from the import function Append ResultsClear Results File "<ipython-input-7-3dab170099f6>", line 3 import datetime currentdate = datetime.datetime.now().strftime("%Y-%m-%d") ^SyntaxError: invalid syntax
... View more
08-13-2018
09:52 AM
Hello community,
I have created the following pyspark query: from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_vEmployeeDepartment.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('HumanResources_vEmployeeDepartment')
counts = spark.sql("""SELECT
FirstName
,LastName
,JobTitle
FROM HumanResources_vEmployeeDepartment
ORDER BY FirstName, LastName DESC""")
counts.coalesce(1).write.csv("/home/packt/Downloads/myresults3.csv") I would like to add the current date and time to the file called myresults3.
I think the code would look something like the following: counts.coalesce(1).write.csvCONCAT("/home/packt/Downloads/'myresults3'-CURRENTDATE.csv") I'm sure I'm way off the mark with the above attempt, but I'm sure you can see what I'm trying to achieve.
Any help will be appreciated. Cheers
Carlton
... View more
Labels:
- Labels:
-
Apache Spark
08-13-2018
09:45 AM
Hello community, I have created the
following pyspark query: from
pyspark.sql import SparkSession spark
= SparkSession.builder.appName('ops').getOrCreate()df
=
spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_vEmployeeDepartment.csv',inferSchema=True,header=True)df.createOrReplaceTempView('HumanResources_vEmployeeDepartment')counts
= spark.sql("""SELECTFirstName,LastName,JobTitleFROM
HumanResources_vEmployeeDepartmentORDER
BY FirstName, LastName DESC""")counts.coalesce(1).write.csv("/home/packt/Downloads/myresults3.csv") I would like to add
the current date and time to the file called myresults3. I think the code would
look something like the following: counts.coalesce(1).write.csvCONCAT("/home/packt/Downloads/'myresults3'-CURRENTDATE.csv")
I'm sure I'm way off
the mark with the above attempt, but I'm sure you can see what I'm trying to
achieve. Any help will be
appreciated. Cheers Carlton
... View more
Labels:
- Labels:
-
Apache Spark
08-13-2018
09:42 AM
Hello community, I have created the following pyspark query: from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_vEmployeeDepartment.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('HumanResources_vEmployeeDepartment')
counts = spark.sql("""SELECT
FirstName
,LastName
,JobTitle
FROM HumanResources_vEmployeeDepartment
ORDER BY FirstName, LastName DESC""")
counts.coalesce(1).write.csv("/home/packt/Downloads/myresults3.csv") I would like to add the current date and time to the file called myresults3. I think the code would look something like the following: counts.coalesce(1).write.csvCONCAT("/home/packt/Downloads/'myresults3'-CURRENTDATE.csv") I'm sure I'm way off the mark with the above attempt, but I'm sure you can see what I'm trying to achieve. Any help will be appreciated. Cheers Carlton
... View more
Labels:
- Labels:
-
Apache Spark
08-06-2018
09:02 PM
Is there a way to get the results with the header info?
... View more
08-06-2018
08:56 PM
Felix, thank you so much. It worked like a dream
... View more
08-06-2018
11:32 AM
Hello community, The output from the pyspark query below produces the following output The pyspark query is as follows: #%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_vEmployeeDepartment.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('HumanResources_vEmployeeDepartment')
myresults = spark.sql("""SELECT
FirstName
,LastName
,JobTitle
FROM HumanResources_vEmployeeDepartment
ORDER BY FirstName, LastName DESC""")
myresults.show() Can someone show me how to save the results to a text / csv file ( or any file please) Thanks Carlton
... View more
Labels:
- Labels:
-
Apache Spark
08-05-2018
05:15 PM
ok, as I'm not getting much assistance with my original question I thought I would try and figure out the problem myself. So I rewrote the pyspark.sql as follows: #%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
myresults = spark.sql("""SELECT
PersonType
,COUNT(PersonType) AS `Person Count`
FROM Person_Person
GROUP BY PersonType""")
myresults.collect()
result = myresults.collect()
result
result.saveAsTextFile("test") However, I'm now getting the following error message: AttributeError : 'list' object has no attribute 'saveAsTextFile' I think this could be an easier situation to help resolve. So, if someone could help resolve this issue that would be most appreciated Thanks
... View more
08-05-2018
02:41 AM
Hello community, My first post here, so please let me know if I'm not following protocol. I have written a pyspark.sql query as shown below. I would like the query results to be sent to a textfile but I get the error: AttributeError : 'DataFrame' object has no attribute 'saveAsTextFile' Can someone take a look at the code and let me know where I'm going wrong: #%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('aggs').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/sales_info.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('sales_info')
example8 = spark.sql("""SELECT
*
FROM sales_info
ORDER BY Sales DESC""")
example8.saveAsTextFile("juyfd")
main() Any help would be appreciated carlton
... View more
Labels:
- Labels:
-
Apache Spark
02-03-2018
08:11 AM
Hello Community, My apologies for the confusing subject question. I have created the following hadoop hql script and deployed the script in both hadoop on Microsoft Azure and Ambari. DROP TABLE IF EXISTS HiveSampleIn;
CREATE EXTERNAL TABLE HiveSampleIn
(
anonid int,
eprofileclass int,
fueltypes STRING,
acorn_category int,
acorn_group STRING,
acorn_type int,
nuts4 STRING,
lacode STRING,
nuts1 STRING,
gspgroup STRING,
ldz STRING,
gas_elec STRING,
gas_tout STRING
)
partitioned by ( acorn_category int, acorn_categorycount int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/samplein/';
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut
(
acorn_category int,
acorn_categorycount int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';
INSERT OVERWRITE TABLE HiveSampleOut
Select
acorn_category,
count(*) as acorn_categorycount
FROM HiveSampleIn Group by acorn_category The results in Ambari provides a .csv file looks as follows: (notice the column headings in RED) However, in Azure the results are provided as a textfile (which is fine) but it doesn't have the column headings as shown Can someone please let me know how to include the column headings in the textfile?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
02-02-2018
10:26 PM
Hello Community, Can someone please provide a link to good resources where I can learn how to write hive queries - HQL. ? Or if you could recommend some good books that would also be great. Cheers Carlton
... View more
Labels:
- Labels:
-
Apache Hive
02-01-2018
11:59 PM
Hi Jay, I think I'm going to call it a night. Thanks for your help tonight mate. I think the only solution left is a complete rebuild
... View more
02-01-2018
11:59 PM
Hi Jay, I think I'm going to call it a night. Thanks for your help tonight mate. I think the only solution left is a complete rebuild
... View more
02-01-2018
11:51 PM
Jay, I just found the following link that might explain it https://github.com/hortonworks/data-tutorials/issues/411
... View more
02-01-2018
11:34 PM
Hi Jay, I have highlighted port 2222 from the command you suggested, see image
... View more
02-01-2018
11:16 PM
Hi Jay, It's hosted on ESXi 5.5 and I access it using vSphere client
... View more
02-01-2018
11:10 PM
Hi Jay, Will a reboot allow me to enter on port 2222 ? At the moment I keep on getting the error below
... View more