As of sometime mid-October 2005, Deutsche Welle has started producing their own RSS feed with enclosures. Although it makes this project unnecessary, I’m glad they got around to filling the void that they had left. This feed no longer exists and will redirect to their own Langsam Gesprochene Nachrichten RSS Feed.
This is a Python script to scrape a page from Deutsche Welle’s website and generate an RSS(Really Simple Syndication) file containing the contents of it. The page in question contains ~10 minutes summaries of the day’s news slowly spoken in German with a transcript. The RSS file contains the transcript with the mp3 summary attached as an enclosure. To best take advantage of the feed, use an aggregator that can handle enclosures such as iPodder.
The script is setup to run every day at 2pm GMT. The news doesn’t seem to be posted at exactly 10am like it says it is, so I’ve pushed back the time so I can be sure that it catches that days report. Check out the results: Slowly Spoken German News.
The script itself uses BeautifulSoup to do the screen scraping and PyRSS2Gen to do the RSS generation. In between it uses pickle to maintain a cache of the last parsed pages so that Deutsche Welle’s site is only hit once per news page and once per 12-hour period for the index. Hopefully this should keep me under their radar.
Check out the source at http://www.scompt.com/svn/projects/dwellerss/gen-dwelle.py and the final RSS file at http://www.scompt.com/dwelle.rss.
Things I’ve learned while doing this:
- Screen scraping is nasty!
- I’m still very much a Java programmer. Pythonization is definitely a TODO.
- Not speaking German for 9 months may have an impact on one’s ability to speak and understand German.
TODO
- Pythonize the script a lot more. The whole thing is basically dealing with lists, which Python is built to do. I need to take advantage of this more.
- Document and clean code.
- Ping a couple sites whenever a new entry is added.
- Incorporate the audio and text posted on this DW page.