Calling all data geeks and enthusiasts! Testing is now open on DocHive—get started on GitHub.
DocHive is an open source Ruby on Rails project for capturing data from image-based PDFs. Created for journalists and other professionals who need a more efficient way to extract meaning for tedious data, DocHive is in development and ready for testing in the community.
During the initial expansion of testing, my small team and I encountered problems with the application's prerequisites. To remedy that, we prepared a virtual machine and installed the prerequisites.
These are the credentials for logging into the virtual machine and for the MySQL database installed on the VM.
Virtual Machine Credentials
1. Start Virtual Machine (This is the virtual machine. Clicking Download will try to scan the file for viruses. It will not scan because of the file size. Next clicking 'Download Anyway' will download the file.)
2. Log in as dochive (password is pr3v13w)
3. Double-click start.sh to start the server and processing of background jobs (on Desktop)
4. Launch Firefox (icon on the lower menu bar)
5. Create user account (local to the virtual machine)
6. Begin uploading documents
Download and install Oracle's VirtualBox.
You must have an Internet connection during running. DocHive is able to make connections with Google charts.
Testing has expanded from structured form-based data to general data acquisition for researchers. I am working with a research partner, Jeff Provencher, and we are in the process of digitizing the text is his collection of image based PDFs. The video shows the process of extracting one of his United Nations documents.
Watch this video of extracting from a five page PDF: walkthrough.mp4 - 04:15
Common questions and problems
It's not working: Check that you are connected to the Internet.
It's suddenly not working: Did you change locations? Restart the server and worker jobs by closing the console window and double-clicking start.sh again.
The client side display languages settings do not work: The only OCR engine installed is English, the others are not developed yet. Contact me to walk you through how to install a language pack in Tesseract.
Selfie is the same as a template, except it is only used once.
Refresh the browser: The software does not auto refresh when the job is complete.
Automatic template matching: It is coming soon.
My data files have duplicates: There may be an error in the background job, or each template is a selfie.
Leave me a comment if you have any questions or want to chat.