The NSFPatents project correllates patent generation with funding from the National Science Foundation. I worked on the Ruby application, which downloads and scrapes US Patent and Trade office datasets for patent registrations that cite NSF as a contributor.
This project generated data that had never before been available, and helps the Foundation show that the funding it receives from the federal government to give out in research grants is worthwile.
The datasets are in a scarcely documented XML format with a Document Type Definition (DTD) which changes as we move through the decades. The sheer volume of patent information was also a challenge, and we came up with several optimizations to improve the parser’s speed and reliability.