What is NLP?

9 min readSep 20, 2023

Natural language processing (NLP) is a branch of artificial intelligence (AI). It enables computers to understand, generate and process human language, and supports users to use natural language text or speech to interrogate data, so it is also called “language”. Enter (language in)”. In reality, most consumers have probably interacted with NLP, just without realizing it. For example, the core technology behind virtual assistants such as Oracle Digital Assistant (ODA), Siri, Cortana, and Alexa is NLP. It is based on NLP technology that these virtual assistants can understand the user’s request when the user asks a question and respond using natural language. NLP supports text and speech and works across all human languages. In addition to virtual assistants, other NLP-based tools include web search, spam filtering, automatic text or speech translation, document summarization, sentiment analysis, grammar/spelling checking, and more. For example, some email programs can use NLP technology to read, analyze, and respond to messages, automatically providing suggestions based on message content to help users respond to emails more efficiently.

There are terms that are essentially the same in meaning as NLP, such as natural language understanding (NLU) and natural language generation (NLG) — which refer to the use of computers to understand and generate human language, respectively. Among them, NLG can provide a verbal description of what happened, so it is also called “language output”, that is, using “graphical grammar” to summarize meaningful information into text.

In practice, people often use NLU to represent NLP. It is precisely because computers can understand the structure and meaning of all human language (i.e., NLU) that developers and users can use natural sentences and expressions to interact with computers (i.e., NLP). If computational linguistics (CL) aims to study human language computation, NLP is an engineering discipline that understands, generates or processes human language by building computational artifacts.

Research on NLP began in the 1950s, shortly after the birth of digital computers, and involves the two fields of linguistics and artificial intelligence. However, the major breakthroughs in NLP in the past few years have been driven by machine learning (a branch of artificial intelligence that aims to develop systems to learn from data and generalize). Among them, deep learning (a form of machine learning) can learn highly complex patterns in large data sets and is well suited for learning the complexity of natural language in data sets from the web.

Natural language processing applications

Automate routine tasks: NLP-based chatbots can replace human agents in handling a host of routine tasks, freeing up employees to handle more challenging and interesting tasks. For example, chatbots and digital assistants can identify various user requests, then find matching entries from corporate databases and create targeted responses for the user.

Optimized search: For document and FAQ retrieval, NLP can optimize keyword matching searches, including disambiguation based on context (for example, “carrier” means different meanings in the biomedical and industrial fields); matching synonyms (for example, when users search for “automobile” when retrieving documents that mention “car”); consider morphological changes (very important for non-English queries). Using an NLP-based academic search system, doctors, lawyers, and experts in other fields can more easily and conveniently obtain highly relevant cutting-edge research information.

Search Engine Optimization: NLP helps businesses optimize content through search analytics to improve their organization’s ranking in online searches. Nowadays, search engines generally use NLP technology to rank results. If a company understands how to effectively use NLP technology, it can obtain a higher ranking than its competitors, thereby increasing visibility.

Analyze and organize large document collections: NLP techniques such as document clustering and topic modeling help you easily understand the diversity of content in large document collections, such as corporate reports, news articles, or scientific documents. These techniques are often used for legal forensic purposes.

Social media analytics: NLP can analyze customer reviews and social media comments to help businesses understand large amounts of information more effectively. For example, sentiment analysis can identify positive and negative comments in social media comment streams, providing a direct, real-time measure of customer sentiment. This can provide huge rewards for businesses, such as increased customer satisfaction and revenue.

Market Insights: Businesses can use NLP to analyze their customers’ language to meet their needs more effectively and learn how to better communicate with them. For example, aspect-oriented sentiment analysis can detect sentiment in social media about a specific aspect or product (e.g., “The keyboard is nice, but the screen is too dark”), providing actionable insights for product design and marketing.

Moderate content: If your business attracts a large number of user or customer reviews, NLP can help you moderate this content to ensure high quality and good etiquette by analyzing the wording, tone, and intent of the reviews.

Industrial applications of natural language processing

NLP can streamline and drive automation of a variety of business processes, especially those involving large amounts of unstructured text (e.g. emails, surveys, social media conversations, etc.). Using NLP, businesses can better analyze data and make the right decisions. Here are some real-world application examples of NLP:

Healthcare: Healthcare systems around the world are now adopting electronic medical records and need to process large amounts of unstructured data. NLP can help healthcare organizations analyze data and capture new insights about health records.

Law: When faced with a case, lawyers often spend hours studying a large number of documents and searching for relevant materials. NLP technology can automate legal discovery processes, saving lawyers time and reducing human error by quickly scrutinizing large volumes of documents.

Finance: The pace of change in the financial industry is so rapid that any competitive advantage will play a key role. NLP can help traders automatically mine information from company documents and news reports, extracting information that is highly relevant to their own portfolios and trading decisions.

Customer service: Many large businesses today use virtual assistants or chatbots to respond to basic customer inquiries and requests for information (e.g. FAQs) — and refer complex questions to real employees when necessary.

Insurance: Large insurance companies use NLP to streamline business operations by sifting through claims-related documents and reports.

NLP technology overview

Machine learning models for NLP: Modern NLP relies heavily on machine learning, an artificial intelligence technology. Machine learning can generalize from examples in a data set to make predictions. The data set is called the training data — the machine learning algorithm uses the training data to train and generate a machine learning model that can complete the target task.

For example, sentiment analysis training data contains sentences and their corresponding sentiments (such as positive, negative, or neutral sentiment). A machine learning algorithm will read the data set, generate a model that accepts sentence “inputs” and returns sentiment. Because this model accepts sentences or documents as “input” and returns a corresponding label, it is also called a document classification model. Additionally, document classification programs can classify documents by topic (e.g. sports, finance, politics, etc.).

Another model identifies and classifies entities in documents. For every word in the document, it predicts whether it is part of an entity mention, and if so, what kind of entity it refers to. For example, in “XYZ Corp stock traded at $28 yesterday,” “XYZ Corp” is the corporate entity, “$28” is the monetary amount, and “yesterday” is the date. For entity recognition, its training data is a text collection, in which each word will be labeled to indicate the type of entity it represents. Because this model can generate a label for each input word, it is also called a sequence labeling model.

Sequence-to-sequence models (or seq2seq) are a model that has only recently begun to be applied to NLP. It uses a whole sentence or document as input (as in a document classification program) and can produce as output a sentence or some other sequence (as in a computer program). In contrast, document classification programs can only generate a single symbol. Applications of sequence-to-sequence models include computer translation (e.g., take an English sentence as input and return a French sentence); document aggregation (take in content, output a summary); and semantic parsing (take an English query or request as input, and output an implementation of that request) computer program).

Deep learning, pre-trained models, and transfer learning: Deep learning is a type of machine learning widely used in NLP. In the 1980s, researchers developed neural networks by combining a large number of original machine learning models into a single network. If a neural network is a person’s brain, then a simple machine learning model is a “neuron”. These neurons are arranged in layers. A multi-layer neural network is a deep neural network, and machine learning based on a deep neural network model is deep learning.

Deep neural networks are very complex and usually require a large amount of data to train, and their processing also requires a lot of computing power and time. Modern deep neural network NLP models are trained using data from diverse information sources (such as the entire content of Wikipedia and data scraped from the web), and the training data may be as much as 10 GB or more; and even at high On a performance cluster, training can also take a week or more. Researchers have found that using larger datasets to train deeper models can achieve higher performance, so there is a race to increase the size of datasets and the depth of models.

Deep neural networks have very high data and computational requirements, which can severely limit their usefulness. In this regard, transfer learning can further train a trained deep neural network to complete new tasks with less training data and computing power. Among them, the simplest transfer learning is called fine-tuning, which uses a large general dataset (such as Wikipedia) for the first model training, and then uses a smaller, task-specific dataset labeled by the actual target task. Further training. The size of the data set required for fine-tuning may be very small (maybe only a few hundred or even dozens of training examples) and may only take a few minutes to run on a single CPU. With transfer learning, enterprises can easily deploy deep learning models across their entire organization.

Today, enterprises can obtain pre-trained deep learning models trained on various combinations of languages, datasets, and pre-training tasks through a complete ecosystem of providers. They can download these pre-trained models and fine-tune them according to their target tasks.

Examples of NLP preprocessing techniques

Tokenization: Tokenization refers to segmenting raw text (such as a statement or document) into a sequence of strings (such as words or words), and is usually the first step in the NLP processing pipeline. Among them, a string is generally a repeated sequence of text (which will be treated as an atomic unit in subsequent processing), which may be a word or a sub-word (called a morpheme, such as the “un-” prefix and “-ing” in English. “ suffix), maybe even a single character.

Bag-of-words model: The bag-of-words model treats a document as an unordered collection of strings or words. In other words, a bag of words is like a collection, but it keeps track of the number of times each element appears. The bag-of-words model completely ignores the order of words and therefore can confuse statements such as “dog bites man” with “man bites dog.” However, in large-scale information retrieval tasks such as search engines, the bag-of-words model can improve efficiency. Faced with longer documents, it can output the most up-to-date results possible.

Stop word removal: “Stop words” are strings of characters that can be ignored in subsequent processing, usually short, high-frequency words such as “a”, “the”, and “an”. Bag-of-words models and search engines generally ignore stop words to shorten processing time and reduce the storage burden on the database. Deep neural networks typically take word order into account (and are therefore not bag-of-words models) and do not remove stop words: stop words can convey subtle differences in meaning, such as “the package was lost” and “a package is lost” It’s the same after removing the stop words, but their meaning is different.

Stemming and lemmatization: Morphemes are the smallest semantic elements in a language, usually smaller than words. For example, “revisited” consists of the prefix “re-”, the stem “visit” and the past tense suffix “-ed”. Stemming and lemmatization, which map words to their stem forms (e.g. “revisit” + “PAST”), are key steps in pre-deep learning models. However, deep learning models typically learn these regularities from training data, without the need for explicit stemming or lemmatization steps.

Part-of-speech tagging and syntactic analysis: Part-of-speech (PoS) tagging refers to the process of tagging each word with its part of speech (such as noun, verb, adjective, etc.), while syntactic analysis aims to identify how words are combined into phrases, clauses, and entire sentences . Among them, the former is a sequence labeling task, and the latter is an extended sequence labeling task. Deep neural networks are advanced part-of-speech tagging and syntactic analysis technology. Before the advent of deep learning, part-of-speech tagging and syntactic analysis were the basic steps for sentence understanding. However, modern deep learning NLP models generally make little (if any) use of part-of-speech or syntactic information, so part-of-speech tagging and syntactic analysis are not used much in deep learning NLP.

NLP programming language

Python:
NLP libraries and toolkits are generally available in Python, and most NLP projects are currently developed using Python. Python’s interactive development environment allows users to easily develop and test new code.

Java and C++:
C++ and Java code are more efficient and are often the languages of choice when working with large amounts of data.