Immense troves of data remain solely available on ink & paper. Information that has been computerized remains in private intranets. Even data that is online, organized and available remains in a format that prevents semantic contextualization - either by storing documents in image files (TIFF) or difficult to decipher compressed formats (PDF or XPS). And in the rare cases where government agencies have made information public, semantically decipherable and accessible over the internet the problem remains of indexing that data using a common schema and storing it in related and centralized locations where people know where to find it.
In plain English: the government keeps "public" records secret by publishing them where no one knows where to find them, where its incredibly complex or expensive to read them and by scattering them all over the country.
Despite a budget in the trillions, solving these problems has largely been left to private, non-profit organizations. I currently serve as the Director of Technology for one such organization - and I am very excited to begin collaborating with the Open Knowledge Labs initiative to further my work in this area, as well as to improve the portability of some of the tools I have put together so that like-minded developers & admins can get their sticky claws on 'em.
Open Knowledge Labs is a collaborative platform that allows for sharing tools and code that can help developers overcome obstacles to meaningfully presenting abstract information. Labs is a smaller part of the larger Open Knowledge project which is involved with a number of other sites, like opengovernmentdata.org and datahub.io
I'm working on a couple of projects that I look forward to making available through the Lab. Here are a few of the scripts I will be uploading in the immediate-to-near future, in case any one wants to help out:
- on the top of the list at the moment is a script to strip executable code from large volumes of compressed files (like PDF) in order to prevent the eventuality of malicious software being injected into sensitive data dumps.
- a method of chunking, hashing a comparing massive datasets to confirm file tampering.
- a ton of odds and ends for AWS management that I have lying around
Interested readers should check out the Labs and the related projects on the site. There is some good stuff there.