Created 01-17-2018 07:59 AM
HDP-2.6.3.0
/usr/hdp/current/spark2-client/bin/spark-submit dpi_test.py --queue load --driver-memory 10g --num-executors 6 --executor-memory 30G
dpi_test.py : from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql import HiveContext from pyspark.sql.functions import udf, col from pyspark.sql.types import BooleanType import re import time warehouse_location = abspath('spark-warehouse') spark = SparkSession \ .builder \ .appName("Python Spark SQL Hive integration example") \ .config("spark.sql.warehouse.dir", warehouse_location) \ .enableHiveSupport() \ .getOrCreate() cont=udf(lambda x,y: bool(re.match('^((.)*([^0-9A-Za-z])+)*'+x+'(([^0-9A-Za-z])+(.)*)*$',y))&True, BooleanType()) spark.sql("select d.msisdn, d.end_time from other_sources.stg_dpi_other_day d where msisdn='380988526911'").groupby('msisdn').agg({'end_time':'min', 'end_time':'max','end_time':'count'}).show(n=50)
Created 01-17-2018 08:04 AM
spark2-error.txt <<< Errors detail
Created 01-17-2018 04:07 PM
Why are you using 10g of driver memory? What is the size of your dataset and how many partitions does it have?
I would suggest using the config below:
--executor-memory 32G \
--num-executors 20 \
--driver-memory 4g \
--executor-cores 3 \
--conf spark.driver.maxResultSize=3g \