From 85f2dfffa69f33058e643ffd7edd01e6137a5059 Mon Sep 17 00:00:00 2001 From: Jashank Jeremy Date: Wed, 31 Oct 2012 22:19:59 +1100 Subject: [PATCH 1/2] faster_lsi: Massively accelerate LSI performance. Currently, Classifier::LSI rebuilds the index every time an entry is added. This runs into massive performance overheads on my website; theoretically, disabling automatic index rebuilds, and explicitly rebuilding the LSI index at the end of the LSI repopulation should speed things up nicely. As a side note, here, I use pandoc-ruby to provide a more featureful Markdown transformer, so be mindful that the numbers I quote here have artifically imposed I/O overheads. With just the 76 posts I wrote this year (abysmal, I know), I come up with the following figures: Without faster_lsi: jekyll --lsi 16.91s user 0.88s system 97% cpu 18.302 total With faster_lsi: jekyll --lsi 2.72s user 0.77s system 88% cpu 3.940 total With 109 posts, we begin to see even better improvements: Without faster_lsi: jekyll --lsi 51.00s user 1.47s system 98% cpu 53.060 total With faster_lsi: jekyll --lsi 5.04s user 1.12s system 91% cpu 6.735 total At this point, we begin to see I/O overheads being slower than LSI when faster_lsi is active. I call that fairly conclusive. But wait, there's more. I have 273 posts lying around... I wonder what happens if I feed them all in. With faster_lsi, it was nice and clippy. Without it, I simply gave up, and went and refilled my cup of tea. And it was still going. Without faster_lsi: jekyll --lsi 1277.86s user 10.90s system 99% cpu 21:30.29 total With faster_lsi: jekyll --lsi 34.62s user 4.43s system 96% cpu 40.430 total That is, in anyone's books, a major improvement. Note, however, that I don't know just how well this will perform with `jekyll --auto` because I don't know how it does the LSI rebuilds. I _think_ (but please, don't commit me on this) that the LSI is rebuilt every time Jekyll picks up a file change. So, all up, the performance improvement is massive, and scales depending on how many files you have. At the last point, the improvement is just on 3200%. A more optimal solution would be to cache the LSI index and/or content data somehow. I'll leave that to when faster_lsi takes over ten minutes to run. --- lib/jekyll/post.rb | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/lib/jekyll/post.rb b/lib/jekyll/post.rb index a84c9ab9..d028a290 100644 --- a/lib/jekyll/post.rb +++ b/lib/jekyll/post.rb @@ -162,9 +162,12 @@ module Jekyll if self.site.lsi self.class.lsi ||= begin - puts "Running the classifier... this could take a while." - lsi = Classifier::LSI.new + puts "Starting the classifier..." + lsi = Classifier::LSI.new :auto_rebuild => false + $stdout.print(" Populating LSI... ");$stdout.flush posts.each { |x| $stdout.print(".");$stdout.flush;lsi.add_item(x) } + $stdout.print("\n Rebuilding LSI index... ") + lsi.build_index puts "" lsi end From 68333cd221ad7f1c0138aa21481d76005f09842d Mon Sep 17 00:00:00 2001 From: Jashank Jeremy Date: Fri, 11 Jan 2013 20:02:31 +1100 Subject: [PATCH 2/2] Slight stylistic tweak to LSI initialisation. Recommended-by: parkr --- lib/jekyll/post.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/jekyll/post.rb b/lib/jekyll/post.rb index d028a290..91eacc6d 100644 --- a/lib/jekyll/post.rb +++ b/lib/jekyll/post.rb @@ -163,7 +163,7 @@ module Jekyll if self.site.lsi self.class.lsi ||= begin puts "Starting the classifier..." - lsi = Classifier::LSI.new :auto_rebuild => false + lsi = Classifier::LSI.new(:auto_rebuild => false) $stdout.print(" Populating LSI... ");$stdout.flush posts.each { |x| $stdout.print(".");$stdout.flush;lsi.add_item(x) } $stdout.print("\n Rebuilding LSI index... ")