id: 01936196 dt: j an: 01936196 au: Lee, Dik Lun; Ren, Liming ti: Document ranking on weight-partitioned signature files. so: ACM Trans. Inf. Syst. 14, No. 2, 109-137 (1996). py: 1996 pu: Association for Computing Machinery (ACM), New York, NY la: EN cc: H.3.3 H.2.2 H.3.6 H.3.1 ut: algorithms; design; experimentation; performance ci: li: doi:10.1145/226163.226164 ab: Summary: \BeginparA signature file organization, called the weight-partitioned signature file, for supporting document ranking is proposed. It employs multiple signature files, each of which corresponds to one term frequency, to represent terms with different term frequencies. Words with the same term frequency in a document are grouped together and hashed into the signature file corresponding to that term frequency. This eliminates the need to record the term frequency explicitly for each word. We investigate the effect of false drops on retrieval effectiveness if they are not eliminated in the search process. We have shown that false drops introduce insignificant degradation on precision and recall when the false-drop probability is below a certain threshold. This is an important result since false-drop elimination could become the bottleneck in systems using fast signature file search techniques. We perform an analytical study on the performance of the weight-partitioned signature file under different search strategies and configurations. An optimal formula is obtained to determine for a fixed total storage overhead the storage to be allocated to each partition in order to minimize the effect of false drops on document ranks. Experiments were performed using a document collection to support the analytical results.% \Endpar (Provider: ACM) Review: \BeginparA modified vector-space technique for ranking documents during retrieval is described. The ranking is based on similarity to a user-supplied query string. In order to improve retrieval response time on large repositories, the authors introduce weighted-partitioned signature files, with one signature file per term frequency. With an optimal signature configuration, postprocessing ({“}false-drop{”} elimination) can be removed by assigning different lengths to the signature files. The lengths are dependent on the frequency weights of the files. Higher-frequency files have longer lengths. Therefore, the authors show that, by eliminating term postprocessing, they can still reduce the effect of false-drops without sacrificing recall and precision effectiveness.\Endpar \BeginparWhile the authors briefly describe both inversion and signature files, they do not discuss existing vector space and clustering models [1]. Their approach is a modification to the existing vector space techniques. They introduce an implied search order of vector groups produced by a document. Each signature grouping has a common term frequency and signature length. As a result, any document-to-document vector similarity function, such as the cosine similarity function, can be used. A comparison to the existing vector cluster searching methods would have been appropriate. One example is latent semantic indexing (LSI), which applies singular value decomposition [2].\Endpar \BeginparOverall, the paper{’}s ideas are well organized and well presented. I recommend it to anyone interested in document retrieval techniques. The authors{’} idea of weighted-partitioned files is a good extension of vector models that support weight-based querying. However, the lack of comparison to other indexing techniques, such as LSI, which can use spatial access methods, is a significant flaw.\Endpar (Provider: ACM) rv: