Copy PDFs and add to Lucene Index

Now what I’m about to publish below is one of my more proud accomplishments and honestly was one of my favorite things to work on. If you’re not familiar with Lucene then the best way I can describe it is that it’s a very powerful and robust search indexer that can be used in so many different ways it’s unbelievable. If you’ve ever wondered how some websites can return a massive amount of search results so quickly it’s probably achieved using enterprise level computing resources and Lucene indexing.

Lucene basically uses what it calls “tokenizers” based on whatever analyzer you instantiate in your code to review input text and store it as a “document”. I quote and use the term document loosely in this context because you’re not actually storing any files; rather, you create an instance of the Document class and call the Add method to insert the input text into a field. The text is stored against an appropriately named field that you will later search against in that index. This is a very simplified version of how Lucene functions but it gets the general idea across.

I won’t lie; most of the more up to date documentation is actually published for the Java implementation. It seemed to me at the time that the C# port of Lucene was behind a couple of versions, and, some critical things were different enough that I had to scavenge the web for other users trying to implement similar scenarios to finally get this axle to turn.

My below solution was developed out of a need for the college to provide a method for searching course syllabi and other pertinent information such as instructor CVs and course/instructor evaluations. The solution depends on the administrative assistants of various departments to have uploaded the necessary documents into their respective folder structures and that the WebDAV share for particular course to be mapped at the workstation machine. This console app runs in the background on a scheduler task and searches that WebDAV directly for all PDF files, copies them over into a share on the Blackboard application server, adds them into the index, and then finally cleans up the index of any documents that have actually been removed from the course and don’t need to be searched.

 

Leave a Reply

Your email address will not be published. Required fields are marked *