annotate rhodecode/lib/indexers/daemon.py @ 627:8d78d79b67a3

fixed bug when whoosh failed indexing an new repository.
author Marcin Kuzminski <marcin@python-works.com>
date Wed, 20 Oct 2010 17:57:55 +0200
parents 95a502d94860
children 7e536d1af60d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
1 #!/usr/bin/env python
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
2 # encoding: utf-8
549
f99075170eb4 more renames for rhode code !!
Marcin Kuzminski <marcin@python-works.com>
parents: 547
diff changeset
3 # whoosh indexer daemon for rhodecode
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
4 # Copyright (C) 2009-2010 Marcin Kuzminski <marcin@python-works.com>
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
5 #
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
6 # This program is free software; you can redistribute it and/or
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
7 # modify it under the terms of the GNU General Public License
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
8 # as published by the Free Software Foundation; version 2
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
9 # of the License or (at your opinion) any later version of the license.
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
10 #
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
11 # This program is distributed in the hope that it will be useful,
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
12 # but WITHOUT ANY WARRANTY; without even the implied warranty of
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
13 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
14 # GNU General Public License for more details.
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
15 #
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
16 # You should have received a copy of the GNU General Public License
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
17 # along with this program; if not, write to the Free Software
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
18 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
19 # MA 02110-1301, USA.
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
20 """
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
21 Created on Jan 26, 2010
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
22
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
23 @author: marcink
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
24 A deamon will read from task table and run tasks
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
25 """
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
26 import sys
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
27 import os
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
28 from os.path import dirname as dn
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
29 from os.path import join as jn
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
30
547
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
31 #to get the rhodecode import
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
32 project_path = dn(dn(dn(dn(os.path.realpath(__file__)))))
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
33 sys.path.append(project_path)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
34
547
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
35 from rhodecode.lib.pidlock import LockHeld, DaemonLock
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
36 from rhodecode.model.hg_model import HgModel
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
37 from rhodecode.lib.helpers import safe_unicode
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
38 from whoosh.index import create_in, open_dir
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
39 from shutil import rmtree
547
1e757ac98988 renamed project to rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 497
diff changeset
40 from rhodecode.lib.indexers import INDEX_EXTENSIONS, IDX_LOCATION, SCHEMA, IDX_NAME
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
41
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
42 from time import mktime
567
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
43 from vcs.exceptions import ChangesetError, RepositoryError
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
44
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
45 import logging
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
46
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
47 log = logging.getLogger('whooshIndexer')
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
48 # create logger
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
49 log.setLevel(logging.DEBUG)
491
fefffd6fd5f4 Added some more tests, rewrite testing schema, to autogenerate fresh db, new index.
Marcin Kuzminski <marcin@python-works.com>
parents: 483
diff changeset
50 log.propagate = False
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
51 # create console handler and set level to debug
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
52 ch = logging.StreamHandler()
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
53 ch.setLevel(logging.DEBUG)
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
54
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
55 # create formatter
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
56 formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
57
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
58 # add formatter to ch
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
59 ch.setFormatter(formatter)
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
60
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
61 # add ch to logger
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
62 log.addHandler(ch)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
63
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
64 def scan_paths(root_location):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
65 return HgModel.repo_scan('/', root_location, None, True)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
66
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
67 class WhooshIndexingDaemon(object):
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
68 """
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
69 Deamon for atomic jobs
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
70 """
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
71
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
72 def __init__(self, indexname='HG_INDEX', repo_location=None):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
73 self.indexname = indexname
411
9b67cebe6609 some fixes to whoosh indexer daemon
Marcin Kuzminski <marcin@python-works.com>
parents: 407
diff changeset
74 self.repo_location = repo_location
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
75 self.repo_paths = scan_paths(self.repo_location)
465
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
76 self.initial = False
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
77 if not os.path.isdir(IDX_LOCATION):
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
78 os.mkdir(IDX_LOCATION)
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
79 log.info('Cannot run incremental index since it does not'
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
80 ' yet exist running full build')
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
81 self.initial = True
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
82
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
83 def get_paths(self, repo):
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
84 """
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
85 recursive walk in root dir and return a set of all path in that dir
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
86 based on repository walk function
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
87 """
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
88 index_paths_ = set()
567
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
89 try:
627
8d78d79b67a3 fixed bug when whoosh failed indexing an new repository.
Marcin Kuzminski <marcin@python-works.com>
parents: 603
diff changeset
90 tip = repo.get_changeset()
8d78d79b67a3 fixed bug when whoosh failed indexing an new repository.
Marcin Kuzminski <marcin@python-works.com>
parents: 603
diff changeset
91
603
95a502d94860 removed soon deprecated walk method on repository instance
Marcin Kuzminski <marcin@python-works.com>
parents: 567
diff changeset
92 for topnode, dirs, files in tip.walk('/'):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
93 for f in files:
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
94 index_paths_.add(jn(repo.path, f.path))
567
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
95 for dir in dirs:
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
96 for f in files:
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
97 index_paths_.add(jn(repo.path, f.path))
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
98
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
99 except RepositoryError:
80dc0a23edf7 fixed whoosh failure on new repository
Marcin Kuzminski <marcin@python-works.com>
parents: 561
diff changeset
100 pass
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
101 return index_paths_
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
102
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
103 def get_node(self, repo, path):
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
104 n_path = path[len(repo.path) + 1:]
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
105 node = repo.get_changeset().get_node(n_path)
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
106 return node
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
107
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
108 def get_node_mtime(self, node):
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
109 return mktime(node.last_changeset.date.timetuple())
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
110
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
111 def add_doc(self, writer, path, repo):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
112 """Adding doc to writer"""
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
113 node = self.get_node(repo, path)
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
114
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
115 #we just index the content of chosen files
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
116 if node.extension in INDEX_EXTENSIONS:
436
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
117 log.debug(' >> %s [WITH CONTENT]' % path)
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
118 u_content = node.content
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
119 else:
436
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
120 log.debug(' >> %s' % path)
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
121 #just index file name without it's content
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
122 u_content = u''
441
c59c4d4323e7 added support for broken symlinks in whoosh indexer
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
123
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
124 writer.add_document(owner=unicode(repo.contact),
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
125 repository=safe_unicode(repo.name),
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
126 path=safe_unicode(path),
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
127 content=u_content,
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
128 modtime=self.get_node_mtime(node),
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
129 extension=node.extension)
441
c59c4d4323e7 added support for broken symlinks in whoosh indexer
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
130
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
131
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
132 def build_index(self):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
133 if os.path.exists(IDX_LOCATION):
560
3072935bdeed rewrote whoosh indexing to run internal repository.walk() instead of filesystem.
Marcin Kuzminski <marcin@python-works.com>
parents: 557
diff changeset
134 log.debug('removing previous index')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
135 rmtree(IDX_LOCATION)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
136
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
137 if not os.path.exists(IDX_LOCATION):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
138 os.mkdir(IDX_LOCATION)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
139
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
140 idx = create_in(IDX_LOCATION, SCHEMA, indexname=IDX_NAME)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
141 writer = idx.writer()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
142
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
143 for cnt, repo in enumerate(self.repo_paths.values()):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
144 log.debug('building index @ %s' % repo.path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
145
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
146 for idx_path in self.get_paths(repo):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
147 self.add_doc(writer, idx_path, repo)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
148
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
149 log.debug('>> COMMITING CHANGES <<')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
150 writer.commit(merge=True)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
151 log.debug('>>> FINISHED BUILDING INDEX <<<')
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
152
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
153
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
154 def update_index(self):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
155 log.debug('STARTING INCREMENTAL INDEXING UPDATE')
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
156
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
157 idx = open_dir(IDX_LOCATION, indexname=self.indexname)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
158 # The set of all paths in the index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
159 indexed_paths = set()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
160 # The set of all paths we need to re-index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
161 to_index = set()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
162
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
163 reader = idx.reader()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
164 writer = idx.writer()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
165
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
166 # Loop over the stored fields in the index
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
167 for fields in reader.all_stored_fields():
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
168 indexed_path = fields['path']
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
169 indexed_paths.add(indexed_path)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
170
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
171 repo = self.repo_paths[fields['repository']]
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
172
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
173 try:
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
174 node = self.get_node(repo, indexed_path)
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
175 except ChangesetError:
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
176 # This file was deleted since it was indexed
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
177 log.debug('removing from index %s' % indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
178 writer.delete_by_term('path', indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
179
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
180 else:
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
181 # Check if this file was changed since it was indexed
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
182 indexed_time = fields['modtime']
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
183 mtime = self.get_node_mtime(node)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
184 if mtime > indexed_time:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
185 # The file has changed, delete it and add it to the list of
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
186 # files to reindex
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
187 log.debug('adding to reindex list %s' % indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
188 writer.delete_by_term('path', indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
189 to_index.add(indexed_path)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
190
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
191 # Loop over the files in the filesystem
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
192 # Assume we have a function that gathers the filenames of the
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
193 # documents to be indexed
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
194 for repo in self.repo_paths.values():
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
195 for path in self.get_paths(repo):
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
196 if path in to_index or path not in indexed_paths:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
197 # This is either a file that's changed, or a new file
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
198 # that wasn't indexed before. So index it!
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
199 self.add_doc(writer, path, repo)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
200 log.debug('re indexing %s' % path)
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
201
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
202 log.debug('>> COMMITING CHANGES <<')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
203 writer.commit(merge=True)
561
5f3b967d9d10 fixed reindexing, and made some optimizations to reuse repo instances from repo scann list.
Marcin Kuzminski <marcin@python-works.com>
parents: 560
diff changeset
204 log.debug('>>> FINISHED REBUILDING INDEX <<<')
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
205
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
206 def run(self, full_index=False):
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
207 """Run daemon"""
465
e01a85f9fc90 fixed initial whoosh indexer. Build full index on first run even with incremental flag
Marcin Kuzminski <marcin@python-works.com>
parents: 452
diff changeset
208 if full_index or self.initial:
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
209 self.build_index()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
210 else:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
211 self.update_index()
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
212
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
213 if __name__ == "__main__":
451
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
214 arg = sys.argv[1:]
452
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
215 if len(arg) != 2:
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
216 sys.stderr.write('Please specify indexing type [full|incremental]'
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
217 'and path to repositories as script args \n')
451
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
218 sys.exit()
452
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
219
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
220
451
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
221 if arg[0] == 'full':
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
222 full_index = True
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
223 elif arg[0] == 'incremental':
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
224 # False means looking just for changes
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
225 full_index = False
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
226 else:
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
227 sys.stdout.write('Please use [full|incremental]'
452
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
228 ' as script first arg \n')
451
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
229 sys.exit()
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
230
452
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
231 if not os.path.isdir(arg[1]):
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
232 sys.stderr.write('%s is not a valid path \n' % arg[1])
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
233 sys.exit()
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
234 else:
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
235 if arg[1].endswith('/'):
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
236 repo_location = arg[1] + '*'
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
237 else:
f19d3ee89335 updated whoosh indexer to take path as second argument
Marcin Kuzminski <marcin@python-works.com>
parents: 451
diff changeset
238 repo_location = arg[1] + '/*'
451
d726f62f886e updated whoosh indexer to take index building argument type
Marcin Kuzminski <marcin@python-works.com>
parents: 443
diff changeset
239
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
240 try:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
241 l = DaemonLock()
436
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
242 WhooshIndexingDaemon(repo_location=repo_location)\
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 411
diff changeset
243 .run(full_index=full_index)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
244 l.release()
483
a9e50dce3081 Removed config names from whoosh and celery,
Marcin Kuzminski <marcin@python-works.com>
parents: 465
diff changeset
245 reload(logging)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
246 except LockHeld:
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
247 sys.exit(1)
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
248