Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query.^[1] Semantic search seeks to improve search accuracy by understanding the searcher's intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.

Some authors regard semantic search as a set of techniques for retrieving knowledge from richly structured data sources like ontologies and XML as found on the Semantic Web.^[2] Such technologies enable the formal articulation of domain knowledge at a high level of expressiveness and could enable the user to specify their intent in more detail at query time.^[3] The articulation enhances content relevance and depth by including specific places, people, or concepts relevant to the query.^[4]

Models and tools

Tools like Google's Knowledge Graph provide structured relationships between entities to enrich query interpretation.^[5]

Models like BERT and Sentence-BERT convert words or sentences into dense vectors for similarity comparison.^[6]

Semantic ontologies like Web Ontology Language, Resource Description Framework, and Schema.org organize concepts and relationships, allowing systems to infer related terms and deeper meanings.^[7]

Hybrid search models combine lexical retrieval (e.g., BM25) with semantic ranking using pretrained transformer models for optimal performance.^[8]

Algorithms and Techniques for Vector Databases

Text Representation & Embedding Algorithms

Term Frequency–Inverse Document Frequency (TF–IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection (corpus). It is widely used for information retrieval and text mining.^[9]

Latent Semantic Analysis (LSA) is a technique in natural language processing that uncovers hidden relationships between words by reducing the dimensionality of the term-document matrix using Singular Value Decomposition (SVD).^[10]

Word2Vec is a neural network-based model that learns continuous vector representations of words (embeddings) by predicting surrounding words (context).^[11]

GloVe is an unsupervised learning algorithm for generating word embeddings by leveraging word co-occurrence statistics across a large corpus.^[12]

BERT is a transformer-based model that pre-trains deep bidirectional representations by conditioning on both left and right context in all layers.^[13]

Sentence-BERT (S-BERT) extends BERT by fine-tuning it with a Siamese and triplet network structure to generate semantically meaningful sentence embeddings.^[14]

Similarity Measurement Algorithms

Cosine Similarity measures the cosine of the angle between two vectors, commonly used for text similarity and recommendation systems.^[15]

Euclidean Distance is the straight-line distance between two vectors in a multidimensional space.^[16]

Dot-Product Similarity calculates similarity between vectors as the inner product, often used in neural networks and recommender systems.^[17]

Indexing & Retrieval Algorithms

Approximate Nearest Neighbor (ANN) search algorithms aim to retrieve points close to a query point in high-dimensional space while trading off exact accuracy for speed and scalability.^[18]

Hierarchical Navigable Small World (HNSW) is a graph-based ANN method that constructs a hierarchical, navigable small-world graph to achieve logarithmic search complexity.^[19]

Locality-Sensitive Hashing (LSH) is a method that hashes input vectors into buckets so that similar items are more likely to map to the same bucket, enabling sub-linear search time.^[20]

Inverted File Index (IVF) is a traditional indexing technique where a mapping is maintained from content (words, features) to their document or vector identifiers. It is often combined with clustering (e.g., k-means) for ANN search.^[21]

Applications

Web Search: Google and Bing integrate semantic models into their ranking algorithms.
E-commerce: Intent-based product searches improve conversion and discovery.^[22]
Enterprise Search: Corporate systems use it for document retrieval, customer support, and knowledge management.^[23]
Healthcare and Legal Research: Facilitates retrieval of case law, research articles, and clinical data.^[24]^[25]

Challenges

Ambiguity and Polysemy (e.g., "jaguar" as an animal or a car brand)
Bias in Training Data^[26]
Computational Costs of deep semantic models^[27]
Multilingual Performance^[28]

Future directions

Conversational Search and voice interfaces
Multimodal Search: Incorporating video, image, and text together^[29]
Explainability and ethical transparency in semantic systems

References

^ Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). "Semantic search on text and knowledge bases". Foundations and Trends in Information Retrieval. 10 (2–3): 119–271. doi:10.1561/1500000032. Retrieved 1 December 2018.
^ Dong, Hai (2008). A survey in semantic search technologies. IEEE. pp. 403–408. Retrieved 1 May 2009.
^ Ruotsalo, T. (May 2012). "Domain Specific Data Retrieval on the Semantic Web". The Semantic Web: Research and Applications. Eswc2012. Lecture Notes in Computer Science. Vol. 7295. pp. 422–436. doi:10.1007/978-3-642-30284-8_35. ISBN 978-3-642-30283-1.
^ Nowak, Ken (2024). What is semantic seo?. WeAreKinetica. Retrieved 21 June 2024.
^ Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Blog. https://blog.google/products/search/introducing-knowledge-graph-things-not/
^ Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084
^ Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267–D270.
^ Lin, J., et al. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. https://arxiv.org/abs/2010.06467
^ Jones, Karen Spärck (1972). "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation. 28 (1): 11–21. doi:10.1108/eb026526.
^ Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
^ Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. (2014). GloVe: Global Vectors for Word Representation (PDF). Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543.
^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
^ Reimers, Nils; Gurevych, Iryna (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 [cs.CL].
^ Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". IEEE Data Engineering Bulletin. 24 (4): 35–43.
^ Cover, T. M.; Hart, Peter E. (1967). "Nearest Neighbor Pattern Classification". IEEE Transactions on Information Theory. 13 (1): 21–27. doi:10.1109/TIT.1967.1053964.
^ Goldberg, David (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley. Bibcode:1989gaso.book.....G.
^ Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.
^ Malkov, Yu. A.; Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. arXiv:1603.09320. doi:10.1109/TPAMI.2018.2889473. PMID 30602420.
^ Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.
^ Zobel, Justin; Moffat, Alistair (2006). "Inverted Files for Text Search Engines". ACM Computing Surveys. 38 (2): 6. doi:10.1145/1132956.1132959.
^ Amazon Science. (2021). Using neural retrieval for semantic product search. https://www.amazon.science/blog/using-neural-retrieval-for-semantic-product-search
^ IBM. (2020). Using AI and machine learning for smarter enterprise search. https://www.ibm.com/blogs/research/2020/11/ai-enterprise-search/
^ Wang, Q., et al. (2020). COVID-19 literature retrieval with semantic search. Nature, 582, 560–561.
^ Chalkidis, I., et al. (2020). LEGAL-BERT. https://arxiv.org/abs/2010.02559
^ Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922
^ Schwartz, R., et al. (2019). Green AI. Communications of the ACM, 63(12), 54–63.
^ Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? https://arxiv.org/abs/1906.01502
^ Radford, A., et al. (2021). CLIP: Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020

External links

This Internet-related article is a stub. You can help Wikipedia by expanding it.

[1] Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). "Semantic search on text and knowledge bases". Foundations and Trends in Information Retrieval. 10 (2–3): 119–271. doi:10.1561/1500000032. Retrieved 1 December 2018.

[2] Dong, Hai (2008). A survey in semantic search technologies. IEEE. pp. 403–408. Retrieved 1 May 2009.

[3] Ruotsalo, T. (May 2012). "Domain Specific Data Retrieval on the Semantic Web". The Semantic Web: Research and Applications. Eswc2012. Lecture Notes in Computer Science. Vol. 7295. pp. 422–436. doi:10.1007/978-3-642-30284-8_35. ISBN 978-3-642-30283-1.

[4] Nowak, Ken (2024). What is semantic seo?. WeAreKinetica. Retrieved 21 June 2024.

[GoogleKG20122-5] Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Blog. https://blog.google/products/search/introducing-knowledge-graph-things-not/

[Reimers20192-6] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084

[Bodenreider20042-7] Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267–D270.

[Lin20212-8] Lin, J., et al. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. https://arxiv.org/abs/2010.06467

[9] Jones, Karen Spärck (1972). "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation. 28 (1): 11–21. doi:10.1108/eb026526.

[10] Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

[11] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].

[12] Pennington, Jeffrey; Socher, Richard; Manning, Christopher D. (2014). GloVe: Global Vectors for Word Representation (PDF). Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543.

[13] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].

[14] Reimers, Nils; Gurevych, Iryna (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 [cs.CL].

[15] Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". IEEE Data Engineering Bulletin. 24 (4): 35–43.

[16] Cover, T. M.; Hart, Peter E. (1967). "Nearest Neighbor Pattern Classification". IEEE Transactions on Information Theory. 13 (1): 21–27. doi:10.1109/TIT.1967.1053964.

[17] Goldberg, David (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley. Bibcode:1989gaso.book.....G.

[18] Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.

[19] Malkov, Yu. A.; Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. arXiv:1603.09320. doi:10.1109/TPAMI.2018.2889473. PMID 30602420.

[20] Indyk, Piotr; Motwani, Rajeev (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. pp. 604–613. doi:10.1145/276698.276876.

[21] Zobel, Justin; Moffat, Alistair (2006). "Inverted Files for Text Search Engines". ACM Computing Surveys. 38 (2): 6. doi:10.1145/1132956.1132959.

[AmazonSemantic20212-22] Amazon Science. (2021). Using neural retrieval for semantic product search. https://www.amazon.science/blog/using-neural-retrieval-for-semantic-product-search

[IBM20202-23] IBM. (2020). Using AI and machine learning for smarter enterprise search. https://www.ibm.com/blogs/research/2020/11/ai-enterprise-search/

[Wang20202-24] Wang, Q., et al. (2020). COVID-19 literature retrieval with semantic search. Nature, 582, 560–561.

[Chalkidis20202-25] Chalkidis, I., et al. (2020). LEGAL-BERT. https://arxiv.org/abs/2010.02559

[Bender20212-26] Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922

[Schwartz20192-27] Schwartz, R., et al. (2019). Green AI. Communications of the ACM, 63(12), 54–63.

[Pires20192-28] Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? https://arxiv.org/abs/1906.01502

[Radford20212-29] Radford, A., et al. (2021). CLIP: Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

v t e Internet search
Types	Web search engine (List) Metasearch engine Multimedia search Collaborative search engine Cross-language search Local search Vertical search Social search Image search Audio search Video search engine Enterprise search Semantic search Natural language search engine Voice search
Tools	Cross-language information retrieval Search by sound Search engine marketing Search engine optimization Evaluation measures Search oriented architecture Selection-based search Document retrieval Text mining Web crawler Multisearch Federated search Search aggregator Index/Web indexing Focused crawler Spider trap Robots exclusion standard Distributed web crawling Web archiving Website mirroring software Web query Web query classification
Protocols and standards	Z39.50 Search/Retrieve Web Service Search/Retrieve via URL OpenSearch Representational State Transfer Wide area information server
See also	Search engine Desktop search Online search