annotate rhodecode/lib/indexers/daemon.py @ 2031:82a88013a3fd

merge 1.3 into stable
author Marcin Kuzminski <marcin@python-works.com>
date Sun, 26 Feb 2012 17:25:09 +0200
parents bf263968da47 324ac367a4da
children dc2584ba5fbc
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
1 # -*- coding: utf-8 -*-
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
2 """
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
3 rhodecode.lib.indexers.daemon
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
5
1377
78e5853df5c8 fixed daemon typos
Marcin Kuzminski <marcin@python-works.com>
parents: 1206
diff changeset
6 A daemon will read from task table and run tasks
947
99850ac883d1 Fixed whoosh daemon, for depracated walk method
Marcin Kuzminski <marcin@python-works.com>
parents: 902
diff changeset
7
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
8 :created_on: Jan 26, 2010
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
9 :author: marcink
1824
89efedac4e6c 2012 copyrights
Marcin Kuzminski <marcin@python-works.com>
parents: 1711
diff changeset
10 :copyright: (C) 2010-2012 Marcin Kuzminski <marcin@python-works.com>
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
11 :license: GPLv3, see COPYING for more details.
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
12 """
1206
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
13 # This program is free software: you can redistribute it and/or modify
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
14 # it under the terms of the GNU General Public License as published by
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
15 # the Free Software Foundation, either version 3 of the License, or
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
16 # (at your option) any later version.
947
99850ac883d1 Fixed whoosh daemon, for depracated walk method
Marcin Kuzminski <marcin@python-works.com>
parents: 902
diff changeset
17 #
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
18 # This program is distributed in the hope that it will be useful,
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
19 # but WITHOUT ANY WARRANTY; without even the implied warranty of
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
20 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
21 # GNU General Public License for more details.
947
99850ac883d1 Fixed whoosh daemon, for depracated walk method
Marcin Kuzminski <marcin@python-works.com>
parents: 902
diff changeset
22 #
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
23 # You should have received a copy of the GNU General Public License
1206
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
24 # along with this program. If not, see <http://www.gnu.org/licenses/>.
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
25
1154
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
26 import os
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
27 import sys
1154
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
28 import logging
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
29 import traceback
1154
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
30
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
31 from shutil import rmtree
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
32 from time import mktime
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
33
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
34 from os.path import dirname as dn
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
35 from os.path import join as jn
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
36
547
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
37 #to get the rhodecode import
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
38 project_path = dn(dn(dn(dn(os.path.realpath(__file__)))))
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
39 sys.path.append(project_path)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
40
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
41
691
7486da5f0628 Refactor codes for scm model
Marcin Kuzminski <marcin@python-works.com>
parents: 683
diff changeset
42 from rhodecode.model.scm import ScmModel
1154
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
43 from rhodecode.lib import safe_unicode
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
44 from rhodecode.lib.indexers import INDEX_EXTENSIONS, SCHEMA, IDX_NAME
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
45
2007
324ac367a4da Added VCS into rhodecode core for faster and easier deployments of new versions
Marcin Kuzminski <marcin@python-works.com>
parents: 1995
diff changeset
46 from rhodecode.lib.vcs.exceptions import ChangesetError, RepositoryError, \
1711
b369bec5d468 fixes issue with whoosh reindexing files that were removed or renamed
Marcin Kuzminski <marcin@python-works.com>
parents: 1451
diff changeset
47 NodeDoesNotExistError
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
48
1154
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
49 from whoosh.index import create_in, open_dir
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
50
36fe593dfe4b simplified str2bool, and moved safe_unicode out of helpers since it was not html specific function
Marcin Kuzminski <marcin@python-works.com>
parents: 1036
diff changeset
51
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
52 log = logging.getLogger('whooshIndexer')
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
53 # create logger
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
54 log.setLevel(logging.DEBUG)
491
fefffd6fd5f4 Added some more tests, rewrite testing schema, to autogenerate fresh db, new index.
Marcin Kuzminski <marcin@python-works.com>
parents: 483
diff changeset
55 log.propagate = False
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
56 # create console handler and set level to debug
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
57 ch = logging.StreamHandler()
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
58 ch.setLevel(logging.DEBUG)
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
59
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
60 # create formatter
1183
514efe34c255 fixes issue #146
Marcin Kuzminski <marcin@python-works.com>
parents: 1171
diff changeset
61 formatter = logging.Formatter("%(asctime)s - %(name)s -"
514efe34c255 fixes issue #146
Marcin Kuzminski <marcin@python-works.com>
parents: 1171
diff changeset
62 " %(levelname)s - %(message)s")
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
63
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
64 # add formatter to ch
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
65 ch.setFormatter(formatter)
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
66
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
67 # add ch to logger
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
68 log.addHandler(ch)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
69
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
70
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
71 class WhooshIndexingDaemon(object):
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
72 """
1377
78e5853df5c8 fixed daemon typos
Marcin Kuzminski <marcin@python-works.com>
parents: 1206
diff changeset
73 Daemon for atomic jobs
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
74 """
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
75
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
76 def __init__(self, indexname=IDX_NAME, index_location=None,
894
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
77 repo_location=None, sa=None, repo_list=None):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
78 self.indexname = indexname
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
79
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
80 self.index_location = index_location
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
81 if not index_location:
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
82 raise Exception('You have to provide index location')
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
83
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
84 self.repo_location = repo_location
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
85 if not repo_location:
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
86 raise Exception('You have to provide repositories location')
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
87
1036
405b80e4ccd5 Major refactoring, removed when possible calls to app globals.
Marcin Kuzminski <marcin@python-works.com>
parents: 947
diff changeset
88 self.repo_paths = ScmModel(sa).repo_scan(self.repo_location)
894
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
89
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
90 if repo_list:
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
91 filtered_repo_paths = {}
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
92 for repo_name, repo in self.repo_paths.items():
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
93 if repo_name in repo_list:
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
94 filtered_repo_paths[repo_name] = repo
894
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
95
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
96 self.repo_paths = filtered_repo_paths
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
97
465
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
98 self.initial = False
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
99 if not os.path.isdir(self.index_location):
763
0dad296d2a57 extended trending languages to more entries, implemented new faster and "fancy"
Marcin Kuzminski <marcin@python-works.com>
parents: 691
diff changeset
100 os.makedirs(self.index_location)
465
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
101 log.info('Cannot run incremental index since it does not'
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
102 ' yet exist running full build')
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
103 self.initial = True
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
104
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
105 def get_paths(self, repo):
683
341beaa9edba Implemented whoosh index building as paster command.
Marcin Kuzminski <marcin@python-works.com>
parents: 662
diff changeset
106 """recursive walk in root dir and return a set of all path in that dir
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
107 based on repository walk function
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
108 """
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
109 index_paths_ = set()
567
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
110 try:
947
99850ac883d1 Fixed whoosh daemon, for depracated walk method
Marcin Kuzminski <marcin@python-works.com>
parents: 902
diff changeset
111 tip = repo.get_changeset('tip')
99850ac883d1 Fixed whoosh daemon, for depracated walk method
Marcin Kuzminski <marcin@python-works.com>
parents: 902
diff changeset
112 for topnode, dirs, files in tip.walk('/'):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
113 for f in files:
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
114 index_paths_.add(jn(repo.path, f.path))
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
115
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
116 except RepositoryError, e:
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
117 log.debug(traceback.format_exc())
567
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
118 pass
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
119 return index_paths_
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
120
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
121 def get_node(self, repo, path):
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
122 n_path = path[len(repo.path) + 1:]
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
123 node = repo.get_changeset().get_node(n_path)
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
124 return node
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
125
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
126 def get_node_mtime(self, node):
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
127 return mktime(node.last_changeset.date.timetuple())
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
128
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
129 def add_doc(self, writer, path, repo, repo_name):
683
341beaa9edba Implemented whoosh index building as paster command.
Marcin Kuzminski <marcin@python-works.com>
parents: 662
diff changeset
130 """Adding doc to writer this function itself fetches data from
341beaa9edba Implemented whoosh index building as paster command.
Marcin Kuzminski <marcin@python-works.com>
parents: 662
diff changeset
131 the instance of vcs backend"""
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
132 node = self.get_node(repo, path)
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
133
886
0736230c7d91 #92 removed content of binary files for whoosh indexer
Marcin Kuzminski <marcin@python-works.com>
parents: 885
diff changeset
134 #we just index the content of chosen files, and skip binary files
0736230c7d91 #92 removed content of binary files for whoosh indexer
Marcin Kuzminski <marcin@python-works.com>
parents: 885
diff changeset
135 if node.extension in INDEX_EXTENSIONS and not node.is_binary:
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
136
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
137 u_content = node.content
885
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
138 if not isinstance(u_content, unicode):
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
139 log.warning(' >> %s Could not get this content as unicode '
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
140 'replacing with empty content', path)
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
141 u_content = u''
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
142 else:
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
143 log.debug(' >> %s [WITH CONTENT]' % path)
94f7585af8a1 fixes to #92, updated changelog
Marcin Kuzminski <marcin@python-works.com>
parents: 777
diff changeset
144
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
145 else:
436
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
146 log.debug(' >> %s' % path)
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
147 #just index file name without it's content
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
148 u_content = u''
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
149
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
150 writer.add_document(owner=unicode(repo.contact),
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
151 repository=safe_unicode(repo_name),
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
152 path=safe_unicode(path),
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
153 content=u_content,
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
154 modtime=self.get_node_mtime(node),
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
155 extension=node.extension)
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
156
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
157 def build_index(self):
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
158 if os.path.exists(self.index_location):
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
159 log.debug('removing previous index')
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
160 rmtree(self.index_location)
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
161
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
162 if not os.path.exists(self.index_location):
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
163 os.mkdir(self.index_location)
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
164
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
165 idx = create_in(self.index_location, SCHEMA, indexname=IDX_NAME)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
166 writer = idx.writer()
894
1fed3c9161bb fixes #90 + docs update
Marcin Kuzminski <marcin@python-works.com>
parents: 886
diff changeset
167
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
168 for repo_name, repo in self.repo_paths.items():
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
169 log.debug('building index @ %s' % repo.path)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
170
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
171 for idx_path in self.get_paths(repo):
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
172 self.add_doc(writer, idx_path, repo, repo_name)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
173
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
174 log.debug('>> COMMITING CHANGES <<')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
175 writer.commit(merge=True)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
176 log.debug('>>> FINISHED BUILDING INDEX <<<')
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
177
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
178 def update_index(self):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
179 log.debug('STARTING INCREMENTAL INDEXING UPDATE')
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
180
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
181 idx = open_dir(self.index_location, indexname=self.indexname)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
182 # The set of all paths in the index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
183 indexed_paths = set()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
184 # The set of all paths we need to re-index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
185 to_index = set()
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
186
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
187 reader = idx.reader()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
188 writer = idx.writer()
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
189
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
190 # Loop over the stored fields in the index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
191 for fields in reader.all_stored_fields():
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
192 indexed_path = fields['path']
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
193 indexed_paths.add(indexed_path)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
194
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
195 repo = self.repo_paths[fields['repository']]
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
196
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
197 try:
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
198 node = self.get_node(repo, indexed_path)
1711
b369bec5d468 fixes issue with whoosh reindexing files that were removed or renamed
Marcin Kuzminski <marcin@python-works.com>
parents: 1451
diff changeset
199 except (ChangesetError, NodeDoesNotExistError):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
200 # This file was deleted since it was indexed
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
201 log.debug('removing from index %s' % indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
202 writer.delete_by_term('path', indexed_path)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
203
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
204 else:
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
205 # Check if this file was changed since it was indexed
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
206 indexed_time = fields['modtime']
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
207 mtime = self.get_node_mtime(node)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
208 if mtime > indexed_time:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
209 # The file has changed, delete it and add it to the list of
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
210 # files to reindex
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
211 log.debug('adding to reindex list %s' % indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
212 writer.delete_by_term('path', indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
213 to_index.add(indexed_path)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
214
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
215 # Loop over the files in the filesystem
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
216 # Assume we have a function that gathers the filenames of the
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
217 # documents to be indexed
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
218 for repo_name, repo in self.repo_paths.items():
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
219 for path in self.get_paths(repo):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
220 if path in to_index or path not in indexed_paths:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
221 # This is either a file that's changed, or a new file
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
222 # that wasn't indexed before. So index it!
1171
2ab211e0aecd changes for #56
Marcin Kuzminski <marcin@python-works.com>
parents: 1154
diff changeset
223 self.add_doc(writer, path, repo, repo_name)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
224 log.debug('re indexing %s' % path)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
225
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
226 log.debug('>> COMMITING CHANGES <<')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
227 writer.commit(merge=True)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
228 log.debug('>>> FINISHED REBUILDING INDEX <<<')
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
229
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
230 def run(self, full_index=False):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
231 """Run daemon"""
465
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
232 if full_index or self.initial:
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
233 self.build_index()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
234 else:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
235 self.update_index()