Python has incredibly scalable options for exploring data. With Pandas or Dask, you can scale Jupyter up to big data. But what about small data? Personal data? Private data?
JupyterLab and Jupyter Notebook provide a great environment to scrutinize my laptop-based life.
My exploration is powered by the fact that almost every service I use has a web application programming interface (API). I use many such services: a to-do list, a time tracker, a habit tracker, and more. But there is one that almost everyone uses: a calendar. The same ideas can be applied to other services, but calendars have one cool feature: an open standard that almost all web calendars support: CalDAV
.
Parsing your calendar with Python in Jupyter
Most calendars provide a way to export into the CalDAV
format. You may need some authentication for accessing this private data. Following your service's instructions should do the trick. How you get the credentials depends on your service, but eventually, you should be able to store them in a file. I store mine in my root directory in a file called .caldav
:
import os
with open(os.path.expanduser("~/.caldav")) as fpin:
username, password = fpin.read().split()
Never put usernames and passwords directly in notebooks! They could easily leak with a stray git push
.
The next step is to use the convenient PyPI caldav library. I looked up the CalDAV server for my email service (yours may be different):
import caldav
client = caldav.DAVClient(url="https://caldav.fastmail.com/dav/", username=username, password=password)
CalDAV has a concept called the principal
. It is not important to get into right now, except to know it's the thing you use to access the calendars:
principal = client.principal()
calendars = principal.calendars()
Calendars are, literally, all about time. Before accessing events, you need to decide on a time range. One week should be a good default:
from dateutil import tz
import datetime
now = datetime.datetime.now(tz.tzutc())
since = now - datetime.timedelta(days=7)
Most people use more than one calendar, and most people want all their events together. The itertools.chain.from_iterable
makes this straightforward:
import itertools
raw_events = list(
itertools.chain.from_iterable(
calendar.date_search(start=since, end=now, expand=True)
for calendar in calendars
)
)
Reading all the events into memory is important, and doing so in the API's raw, native format is an important practice. This means that when fine-tuning the parsing, analyzing, and displaying code, there is no need to go back to the API service to refresh the data.
But "raw" is not an understatement. The events come through as strings in a specific format:
print(raw_events[12].data)
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CyrusIMAP.org/Cyrus
3.3.0-232-g4bdb081-fm-20200825.002-g4bdb081a//EN
BEGIN:VEVENT
DTEND:20200825T230000Z
DTSTAMP:20200825T181915Z
DTSTART:20200825T220000Z
SUMMARY:Busy
UID:
1302728i-040000008200E00074C5B7101A82E00800000000D939773EA578D601000000000
000000010000000CD71CC3393651B419E9458134FE840F5
END:VEVENT
END:VCALENDAR
Luckily, PyPI comes to the rescue again with another helper library, vobject:
import io
import vobject
def parse_event(raw_event):
data = raw_event.data
parsed = vobject.readOne(io.StringIO(data))
contents = parsed.vevent.contents
return contents
parse_event(raw_events[12])
{'dtend': [<DTEND{}2020-08-25 23:00:00+00:00>],
'dtstamp': [<DTSTAMP{}2020-08-25 18:19:15+00:00>],
'dtstart': [<DTSTART{}2020-08-25 22:00:00+00:00>],
'summary': [<SUMMARY{}Busy>],
'uid': [<UID{}1302728i-040000008200E00074C5B7101A82E00800000000D939773EA578D601000000000000000010000000CD71CC3393651B419E9458134FE840F5>]}
Well, at least it's a little better.
There is still some work to do to convert it to a reasonable Python object. The first step is to have a reasonable Python object. The attrs library provides a nice start:
import attr
from __future__ import annotations
@attr.s(auto_attribs=True, frozen=True)
class Event:
start: datetime.datetime
end: datetime.datetime
timezone: Any
summary: str
Time to write the conversion code!
The first abstraction gets the value from the parsed dictionary without all the decorations:
def get_piece(contents, name):
return contents[name][0].value
get_piece(_, "dtstart")
datetime.datetime(2020, 8, 25, 22, 0, tzinfo=tzutc())
Calendar events always have a start, but they sometimes have an "end" and sometimes a "duration." Some careful parsing logic can harmonize both into the same Python objects:
def from_calendar_event_and_timezone(event, timezone):
contents = parse_event(event)
start = get_piece(contents, "dtstart")
summary = get_piece(contents, "summary")
try:
end = get_piece(contents, "dtend")
except KeyError:
end = start + get_piece(contents, "duration")
return Event(start=start, end=end, summary=summary, timezone=timezone)
Since it is useful to have the events in your local time zone rather than UTC, this uses the local timezone:
my_timezone = tz.gettz()
from_calendar_event_and_timezone(raw_events[12], my_timezone)
Event(start=datetime.datetime(2020, 8, 25, 22, 0, tzinfo=tzutc()), end=datetime.datetime(2020, 8, 25, 23, 0, tzinfo=tzutc()), timezone=tzfile('/etc/localtime'), summary='Busy')
Now that the events are real Python objects, they really should have some additional information. Luckily, it is possible to add methods retroactively to classes.
But figuring which day an event happens is not that obvious. You need the day in the local timezone:
def day(self):
offset = self.timezone.utcoffset(self.start)
fixed = self.start + offset
return fixed.date()
Event.day = property(day)
print(_.day)
2020-08-25
Events are always represented internally as start/end, but knowing the duration is a useful property. Duration can also be added to the existing class:
def duration(self):
return self.end - self.start
Event.duration = property(duration)
print(_.duration)
1:00:00
Now it is time to convert all events into useful Python objects:
all_events = [from_calendar_event_and_timezone(raw_event, my_timezone)
for raw_event in raw_events]
All-day events are a special case and probably less useful for analyzing life. For now, you can ignore them:
# ignore all-day events
all_events = [event for event in all_events if not type(event.start) == datetime.date]
Events have a natural order—knowing which one happened first is probably useful for analysis:
all_events.sort(key=lambda ev: ev.start)
Now that the events are sorted, they can be broken into days:
import collections
events_by_day = collections.defaultdict(list)
for event in all_events:
events_by_day[event.day].append(event)
And with that, you have calendar events with dates, duration, and sequence as Python objects.
Reporting on your life in Python
Now it is time to write reporting code! It is fun to have eye-popping formatting with proper headers, lists, important things in bold, etc.
This means HTML and some HTML templating. I like to use Chameleon:
template_content = """
<html><body>
<div tal:repeat="item items">
<h2 tal:content="item[0]">Day</h2>
<ul>
<li tal:repeat="event item[1]"><span tal:replace="event">Thing</span></li>
</ul>
</div>
</body></html>"""
One cool feature of Chameleon is that it will render objects using its html
method. I will use it in two ways:
- The summary will be in bold
- For most events, I will remove the summary (since this is my personal information)
def __html__(self):
offset = my_timezone.utcoffset(self.start)
fixed = self.start + offset
start_str = str(fixed).split("+")[0]
summary = self.summary
if summary != "Busy":
summary = "<REDACTED>"
return f"<b>{summary[:30]}</b> -- {start_str} ({self.duration})"
Event.__html__ = __html__
In the interest of brevity, the report will be sliced into one day's worth.
import chameleon
from IPython.display import HTML
template = chameleon.PageTemplate(template_content)
html = template(items=itertools.islice(events_by_day.items(), 3, 4))
HTML(html)
When rendered, it will look something like this:
2020-08-25
- <REDACTED> -- 2020-08-25 08:30:00 (0:45:00)
- <REDACTED> -- 2020-08-25 10:00:00 (1:00:00)
- <REDACTED> -- 2020-08-25 11:30:00 (0:30:00)
- <REDACTED> -- 2020-08-25 13:00:00 (0:25:00)
- Busy -- 2020-08-25 15:00:00 (1:00:00)
- <REDACTED> -- 2020-08-25 15:00:00 (1:00:00)
- <REDACTED> -- 2020-08-25 19:00:00 (1:00:00)
- <REDACTED> -- 2020-08-25 19:00:12 (1:00:00)
Endless options with Python and Jupyter
This only scratches the surface of what you can do by parsing, analyzing, and reporting on the data that various web services have on you.
Why not try it with your favorite service?
Comments are closed.