Archive for category Python

Pandas.read_csv makes up dates for nulls

With the pandas library, if you have read in a csv file with a date field that is sometimes empty and are getting the error:

TypeError: can't compare offset-naive and offset-aware datetimes

It may be caused by dateutil.parser.parse which is the default function for parsing dates when the csv file is read. This function returns the current non-timezone aware date when given an empty string as input. According to the dateutil documentation “the default value is the current date, at 00:00:00am.” This causes confusion in the context of pandas for three reasons:

  1. the data in the DataFrame is not derived from the source CSV file
  2. the expected value of an empty field is numpy.NaN.
  3. the returned datetime object is not time zone aware.

Fortunately pandas.read_csv has the date_parser argument which allows you to include your own parsing function. One might thing that the following function is the right fix…

def date_or_nan_parse(str):
   if not(str):
       return numpy.NaN
   return dateutil.parser.parse(str)

BUT NO. If you then try to compare a date against the pandas.Series of the dates the same “TypeError: can’t compare offset-naive and offset-aware datetimes” may get thrown. In fact, both of these will fail:

df['datefield'] > dateutil.parser.parse("January 1 1901 00:00 UTC")
df['datefield'] > dateutil.parser.parse("January 1 1901 00:00")

This time the problem is that the comparison fails on the numpy.NaN values.

My fix is:

OLD_DATE = dateutil.parser.parse("January 1, 1901 00:00 UTC")
def date_or_nan_parse(str):
   if not(str):
       return OLD_DATE
   return dateutil.parser.parse(str)

In this case I introduce a new value rather than (the more correct) numpy.NaN. The new value is a date so it doesn’t fail during comparison operations (assuming that OLD_DATE has the same time zone awareness as the rest of your data). And at least it is a date that I’ve explicitly chosen, rather than just the current date.

I hope this saves you some confusion.

Leave a comment