Jump to content

Semantic search

From Wikipedia, the free encyclopedia

Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query.[1] Semantic search seeks to improve search accuracy by understanding the searcher's intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.

Some authors regard semantic search as a set of techniques for retrieving knowledge from richly structured data sources like ontologies and XML as found on the Semantic Web.[2] Such technologies enable the formal articulation of domain knowledge at a high level of expressiveness and could enable the user to specify their intent in more detail at query time.[3] The articulation enhances content relevance and depth by including specific places, people, or concepts relevant to the query.[4]

Models and tools

[edit]

Tools like Google's Knowledge Graph provide structured relationships between entities to enrich query interpretation.[5]

Models like BERT and Sentence-BERT convert words or sentences into dense vectors for similarity comparison.[6]

Semantic ontologies like Web Ontology Language, Resource Description Framework, and Schema.org organize concepts and relationships, allowing systems to infer related terms and deeper meanings.[7]

Hybrid search models combine lexical retrieval (e.g., BM25) with semantic ranking using pretrained transformer models for optimal performance.[8]

Algorithms and Techniques for Vector Databases

[edit]

Text Representation & Embedding Algorithms

[edit]

Term Frequency–Inverse Document Frequency (TF–IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection (corpus). It is widely used for information retrieval and text mining.[9]

Latent Semantic Analysis (LSA) is a technique in natural language processing that uncovers hidden relationships between words by reducing the dimensionality of the term-document matrix using Singular Value Decomposition (SVD).[10]

Word2Vec is a neural network-based model that learns continuous vector representations of words (embeddings) by predicting surrounding words (context).[11]

GloVe is an unsupervised learning algorithm for generating word embeddings by leveraging word co-occurrence statistics across a large corpus.[12]

BERT is a transformer-based model that pre-trains deep bidirectional representations by conditioning on both left and right context in all layers.[13]

Sentence-BERT (S-BERT) extends BERT by fine-tuning it with a Siamese and triplet network structure to generate semantically meaningful sentence embeddings.[14]

Similarity Measurement Algorithms

[edit]

Cosine Similarity measures the cosine of the angle between two vectors, commonly used for text similarity and recommendation systems.[15]

Euclidean Distance is the straight-line distance between two vectors in a multidimensional space.[16]

Dot-Product Similarity calculates similarity between vectors as the inner product, often used in neural networks and recommender systems.[17]

Indexing & Retrieval Algorithms

[edit]

Approximate Nearest Neighbor (ANN) search algorithms aim to retrieve points close to a query point in high-dimensional space while trading off exact accuracy for speed and scalability.[18]

Hierarchical Navigable Small World (HNSW) is a graph-based ANN method that constructs a hierarchical, navigable small-world graph to achieve logarithmic search complexity.[19]

Locality-Sensitive Hashing (LSH) is a method that hashes input vectors into buckets so that similar items are more likely to map to the same bucket, enabling sub-linear search time.[20]

Inverted File Index (IVF) is a traditional indexing technique where a mapping is maintained from content (words, features) to their document or vector identifiers. It is often combined with clustering (e.g., k-means) for ANN search.[21]


Applications

[edit]
  • Web Search: Google and Bing integrate semantic models into their ranking algorithms.
  • E-commerce: Intent-based product searches improve conversion and discovery.[22]
  • Enterprise Search: Corporate systems use it for document retrieval, customer support, and knowledge management.[23]
  • Healthcare and Legal Research: Facilitates retrieval of case law, research articles, and clinical data.[24][25]

Challenges

[edit]
  • Ambiguity and Polysemy (e.g., "jaguar" as an animal or a car brand)
  • Bias in Training Data[26]
  • Computational Costs of deep semantic models[27]
  • Multilingual Performance[28]

Future directions

[edit]
  • Conversational Search and voice interfaces
  • Multimodal Search: Incorporating video, image, and text together[29]
  • Explainability and ethical transparency in semantic systems

See also

[edit]

References

[edit]
  1. ^ Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). "Semantic search on text and knowledge bases". Foundations and Trends in Information Retrieval. 10 (2–3): 119–271. doi:10.1561/1500000032. Retrieved 1 December 2018.
  2. ^ Dong, Hai (2008). A survey in semantic search technologies. IEEE. pp. 403–408. Retrieved 1 May 2009.
  3. ^ Ruotsalo, T. (May 2012). "Domain Specific Data Retrieval on the Semantic Web". The Semantic Web: Research and Applications. Eswc2012. Lecture Notes in Computer Science. Vol. 7295. pp. 422–436. doi:10.1007/978-3-642-30284-8_35. ISBN 978-3-642-30283-1.
  4. ^ Nowak, Ken (2024). What is semantic seo?. WeAreKinetica. Retrieved 21 June 2024.
  5. ^ Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Blog. https://blog.google/products/search/introducing-knowledge-graph-things-not/
  6. ^ Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084
  7. ^ Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267–D270.
  8. ^ Lin, J., et al. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. https://arxiv.org/abs/2010.06467
  9. ^ Jones, Karen Spärck (1972). "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation. 28 (1): 11–21. doi:10.1108/eb026526.
  10. ^ Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
  11. ^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
  12. ^ Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. (2014). GloVe: Global Vectors for Word Representation (PDF). Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543.
  13. ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
  14. ^ Reimers, Nils; Gurevych, Iryna (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 [cs.CL].
  15. ^ Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". IEEE Data Engineering Bulletin. 24 (4): 35–43.
  16. ^ Cover, T. M.; Hart, Peter E. (1967). "Nearest Neighbor Pattern Classification". IEEE Transactions on Information Theory. 13 (1): 21–27. doi:10.1109/TIT.1967.1053964.
  17. ^ Goldberg, David (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley. Bibcode:1989gaso.book.....G.
  18. ^ Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.
  19. ^ Malkov, Yu. A.; Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. arXiv:1603.09320. doi:10.1109/TPAMI.2018.2889473. PMID 30602420.
  20. ^ Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.
  21. ^ Zobel, Justin; Moffat, Alistair (2006). "Inverted Files for Text Search Engines". ACM Computing Surveys. 38 (2): 6. doi:10.1145/1132956.1132959.
  22. ^ Amazon Science. (2021). Using neural retrieval for semantic product search. https://www.amazon.science/blog/using-neural-retrieval-for-semantic-product-search
  23. ^ IBM. (2020). Using AI and machine learning for smarter enterprise search. https://www.ibm.com/blogs/research/2020/11/ai-enterprise-search/
  24. ^ Wang, Q., et al. (2020). COVID-19 literature retrieval with semantic search. Nature, 582, 560–561.
  25. ^ Chalkidis, I., et al. (2020). LEGAL-BERT. https://arxiv.org/abs/2010.02559
  26. ^ Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922
  27. ^ Schwartz, R., et al. (2019). Green AI. Communications of the ACM, 63(12), 54–63.
  28. ^ Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? https://arxiv.org/abs/1906.01502
  29. ^ Radford, A., et al. (2021). CLIP: Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020
[edit]