Reply
Explorer
Posts: 9
Registered: ‎08-17-2017

pyspark Y2K issue?

I am getting a odd error from spark processing, which appears to be related to Y2K.

 

Here is the error and steps to recreate issue.

>>> transformed_df.show(120)

[Stage 22:===============================================>     (108 + 11) / 120]18/06/01 01:16:14 WARN scheduler.TaskSetManager: Lost task 117.0 in stage 22.0 (TID 1121, ip-10-0-0-78.ec2.internal, executor 118): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/worker.py", line 111, in main

    process()

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/worker.py", line 106, in process

    serializer.dump_stream(func(split_index, iterator), outfile)

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/serializers.py", line 263, in dump_stream

    vs = list(itertools.islice(iterator, batch))

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/sql/types.py", line 541, in toInternal

    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/sql/types.py", line 541, in <genexpr>

    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/sql/types.py", line 435, in toInternal

    return self.dataType.toInternal(obj)

  File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/sql/types.py", line 191, in toInternal

    else time.mktime(dt.timetuple()))

ValueError: year out of range

 

 

 

 

Steps to recreate:

 

1. copy test_data.csv to hdfs 

 

hadoop fs -put test_data.csv /user/tryme/

 

2. start pyspark shell with spark-csv package

 

pyspark --packages com.databricks:spark-csv_2.10:1.5.0

 

3. copy and past the following script into shell

 

 

from pyspark.sql import Row
import re
from datetime import datetime

date_formats_no_separator = ['%Y%m%d', '%m%d%Y', '%d%m%Y', '%d%b%Y', '%b%d%Y', '%Y%b%d']
date_patterns_no_separator = ['yyyyMMdd', 'MMddyyyy', 'ddMMyyyy', 'ddMMMyyyy', 'MMMddyyyy', 'yyyyMMMdd']
date_time_separator = [' ', 'T', ':']
time_formats = ['%H:%M:%S', '%H:%M:%S.%f', '%H:%M', '%I:%M:%S %p', '%I:%M:%S.%f %p','%I:%M %p']
time_patterns = ['HH:mm:SS', 'HH:mm:SS.MS', 'HH:mm', 'HH:mm:SS AM/PM', 'HH:mm:SS.MS AM/PM', 'HH:mm AM/PM']


def transform_data(row):
    element = row[0]
    count = row[1]
    element = str(element).strip()
    col_len = len(element)
    # Don't check for date time with date slash or dash formats.
    # Only check for date time with no separators and space or colon
    # date time separators.
    for i, date_format in enumerate(date_formats_no_separator):
        for seperator in date_time_separator:
            for j, time_format in enumerate(time_formats):
                col_format = date_format + seperator + time_format
                try:
                    casted_value = datetime.strptime(element, col_format)
                    truncated_value = element.split(seperator)[0]
                    data_type = 'DATE'
                    col_pattern = date_patterns_no_separator[i] + seperator + time_patterns[j]
                    return Row(element=element, count=count, data_type=data_type,casted_value=casted_value,
                               truncated_value=truncated_value,col_pattern=col_pattern, col_len=col_len)
                except:
                    pass
    col_pattern = re.sub('[A-z]', 'A', re.sub('[0-9]', '#', element))
    return Row(element=element, count=count, data_type='TEXT', casted_value='',
               truncated_value='', col_pattern=col_pattern, col_len=col_len)


df = sqlContext.read.load('/user/tryme/test_data.csv',
                          format='com.databricks.spark.csv',
                          header='true',
                          inferSchema='true')

transformed_df = df.map(transform_data).toDF()
transformed_df.show(77)
transformed_df.show(78)
transformed_df.show(120)

 

 

 

 

test_data.csv

date_of_birth,count
01JAN1951:00:00:00,2953855
01JAN1952:00:00:00,2916927
01JAN1953:00:00:00,2738573
01JAN1954:00:00:00,2615708
01JAN1950:00:00:00,2612917
01JAN1949:00:00:00,2601401
01JAN1955:00:00:00,2407693
01JAN1956:00:00:00,2370881
01JAN1948:00:00:00,2362028
01JAN1947:00:00:00,2309837
01JAN1957:00:00:00,2267749
01JAN1958:00:00:00,2060797
01JAN1946:00:00:00,1893506
01JAN1959:00:00:00,1855708
01JAN1960:00:00:00,1688705
01JAN1961:00:00:00,1668868
01JAN1962:00:00:00,1532313
01JAN1945:00:00:00,1480809
01JAN1963:00:00:00,1440424
01JAN1964:00:00:00,1353597
01JAN1944:00:00:00,1324771
01JAN1943:00:00:00,1264511
01JAN1965:00:00:00,1217577
01JAN1968:00:00:00,1185374
01JAN1967:00:00:00,1149991
01JAN1966:00:00:00,1134881
01JAN1971:00:00:00,1123190
01JAN1942:00:00:00,1105879
01JAN1970:00:00:00,1100365
01JAN1969:00:00:00,1083852
01JAN1972:00:00:00,973147
01JAN1941:00:00:00,950754
01JAN1974:00:00:00,887508
01JAN1973:00:00:00,884436
01JAN1940:00:00:00,865721
01JAN1939:00:00:00,809181
01JAN1975:00:00:00,801027
01JAN1977:00:00:00,775361
01JAN1976:00:00:00,758922
01JAN1938:00:00:00,751923
01JAN1978:00:00:00,736465
01JAN1979:00:00:00,725492
01JAN1991:00:00:00,694575
01JAN1937:00:00:00,685409
01JAN1935:00:00:00,680434
01JAN1989:00:00:00,677417
01JAN1990:00:00:00,675438
01JAN1992:00:00:00,667714
01JAN1980:00:00:00,663517
01JAN1982:00:00:00,648663
01JAN1981:00:00:00,646560
01JAN1934:00:00:00,638448
01JAN1936:00:00:00,635111
01JAN1993:00:00:00,628645
01JAN1987:00:00:00,626258
01JAN1983:00:00:00,616168
01JAN1932:00:00:00,615887
01JAN1986:00:00:00,606736
01JAN1988:00:00:00,605636
01JAN1930:00:00:00,604866
01JAN1985:00:00:00,596504
01JAN1928:00:00:00,585602
01JAN1994:00:00:00,582868
01JAN1984:00:00:00,574620
01JAN1933:00:00:00,573276
01JAN1931:00:00:00,569792
01JAN1995:00:00:00,561352
01JAN1996:00:00:00,540879
01JAN1929:00:00:00,535435
01JAN1927:00:00:00,531715
01JAN1997:00:00:00,519491
01JAN1926:00:00:00,497725
01JAN1998:00:00:00,482663
01JAN1999:00:00:00,481169
01JAN2002:00:00:00,477012
01JAN1925:00:00:00,465883
01JAN2001:00:00:00,457355
01JAN2000:00:00:00,456811
01JAN1924:00:00:00,441628
01JAN2003:00:00:00,432664
01JAN2004:00:00:00,421687
01JAN1923:00:00:00,413741
01JAN2005:00:00:00,379712
01JAN1922:00:00:00,374686
01JAN2006:00:00:00,370478
01JAN2007:00:00:00,359461
01JAN1921:00:00:00,343762
01JAN2008:00:00:00,324248
01JAN2009:00:00:00,316046
01JAN1920:00:00:00,292220
01JAN2010:00:00:00,280843
01JAN2011:00:00:00,262548
01JAN1919:00:00:00,250663
01JAN1918:00:00:00,233284
01JAN2012:00:00:00,223214
01JAN1917:00:00:00,201334
01JAN2013:00:00:00,194642
01JAN1916:00:00:00,161944
01JAN2014:00:00:00,161270
01JAN1915:00:00:00,145328
01JAN2015:00:00:00,140228
01JAN1914:00:00:00,122558
01JAN1913:00:00:00,97799
01JAN2016:00:00:00,84585
01JAN1912:00:00:00,79759
01JAN1910:00:00:00,43033
01JAN1911:00:00:00,40000
01JAN1909:00:00:00,16710
01JAN2017:00:00:00,14586
01JAN1908:00:00:00,14163
01JAN1907:00:00:00,6513
01JAN1906:00:00:00,4830
01JAN1905:00:00:00,3994
01JAN1901:00:00:00,3587
01JAN1904:00:00:00,1672
01JAN1903:00:00:00,1139
01JAN1902:00:00:00,692
01JAN1900:00:00:00,275
01JAN1897:00:00:00,135
01JAN1884:00:00:00,19

Highlighted
Cloudera Employee
Posts: 59
Registered: ‎11-16-2015

Re: pyspark Y2K issue?

Hi @rabk

 

Interesting issue. I am not sure if this has to do with Y2K. I can read till this row in the CSV file 

 

|1927-01-01 00:00:...| 18|ddMMMyyyy:HH:mm:SS| 531715| DATE|01JAN1927:00:00:00| 01JAN1927|
|1997-01-01 00:00:...| 18|ddMMMyyyy:HH:mm:SS| 519491| DATE|01JAN1997:00:00:00| 01JAN1997|

|1998-01-01 00:00:...| 18|ddMMMyyyy:HH:mm:SS| 482663| DATE|01JAN1998:00:00:00| 01JAN1998|
+--------------------+-------+------------------+-------+---------+------------------+---------------+
only showing top 73 rows

 

After that, it gives the same error "ValueError: year out of range" (thanks for sharing the how to reproduce part!)

Spliting the file into 2 (0-73 and 74-121) and then running the same code works for the first part and then works fine for next 30 lines of the second part after which it spits the same error.

 

If I read it without the transform_data function, I can read the entire file fine making me think if this has something to do with the logic in the transform function(?)

Announcements