AI technologies have been widely applied to different fields, but have you ever heard of using AI technologies to monitor the defacement of webpages? WebOrion is glad to introduce a new engine to the WebOrion defacement monitoring platform – AI Natural Language Processing (NLP) Engine. This engine analyzes webpage-changes and tells you whether the change is malicious or not directly. Rather than simply sending out warnings about some changes that are found in the webpage.
Why AI NLP?
Websites continue to be targeted by attackers because these are easily available attack surfaces to probe and hack. Web defacements continue to be very prevalent, ranging from 400,000 to 1 million public websites being hacked globally every year. These range from government websites, enterprise websites, non-profit websites, small business websites, etc. There are significant business and security implications of defaced websites including reputational impact, loss of business and trust and defacements are sometimes symbolic of deeper security issues of the website owners.
Some of our observations of include the following:
- Defaced webpages can show hacker images, texts or even audio/video clips
- Defaced webpages typically have black background, although we are seeing quite a few white or no background images
- Some common keywords used in defaced webpages include “hacked”, “owned”, “pwned”, etc.
- Defacement text may or may not be in English.
We decided to use AI to learn from >10million samples of real defaced webpages over the past 10 years. This allows WebOrion to add a new level of intelligence beyond just webpage change or keyword detection. We have also explored using both AI computer vision (CV) and text analysis techniques. Our researchers realized that as defaced webpages do not have representative color patterns, the use of text analysis is a better choice than CV. Thus, we decided to use AI NLP to learn from past defacements and analyze/classify if a change is a true defacement or not.
Our AI NLP Model-T Considerations
In the field of natural language processing (NLP), we factored in the following considerations during our research and development of AI model:
- Language detection. We analyze and check for the language probability and proportion of English in a sample HTML text content. If there is a major change in language in the monitored webpage (eg from English to non-English), we will trigger alerts as webmasters typically do not modify the language of a webpage.
- Stopwords. Most stopwords are removed from our processing to enhance the accuracy of the AI model.
- Syntax and Semantics. Beyond just detecting keywords, our AI model will learn from the syntax and semantics of the sentences.
- Enhanced training from a well-known Transformer Attention Model for NLP. We have given our first AI model, the name of Model-T. It leverages on proven AI NLP models, with our own testing and enhancements with defaced data, to provide the optimal accuracy to classify defacements.
Model-T Training and Inference
Based on the above considerations, we trained up an optimal AI NLP model to provide a classification score based on any input HTML text given to it. Our internal testing reveals an accuracy of >90% for real world HTML text given to it.
Once the final optimal Model-T Classifier has been trained, we integrate Model-T with the WebOrion monitoring platform so it can perform inference each time it checks a webpage.
In a production scenario, WebOrion will check for changes on the monitored webpage on a regular basis. Any HTML text change will then be pre-processed (eg. language checks, removal of stopwords, etc) and sent to Model-T classifier. Model-T will then do its AI NLP analysis and give a defacement classification (yes/no) and confidence level (eg. 80%). Changes classified as defaced will get a high alert while non-defaced text will get regular or no alerts. In this way, WebOrion customers can choose to respond promptly to high alerts while at the same time, filtering out regular alerts or reducing false positives with no alerts.
How to Use and Configure AI NLP?
The new AI NLP Engine focuses on analyzing the text contents inside a webpage to determine whether the webpage is defaced or not. We use two state-of-the-art AI NLP classifier models, Model T and Model X. Both of the models have been trained on the defaced and non-defaced datasets collected over the past 10 years. For users new to the AI NLP system, we will recommend using the Model-T Classifier only as a start.
Model T is an enhanced version of a well-known transformer based model, which adopts the mechanism of self-attention that simulates humans reading the text. It differentially weighs the significance of each part of the input data. It is able to analyze the context and semantics to give out the prediction.
The defacement confidence threshold is set to 50% by default, which has been optimized during our AI NLP training phase. If the defacement confidence exceeds this threshold, that means the model considers the webpage as defaced. Then an alert will be sent out to notify the customers as high alert.
For more advanced users, Model X is a decision-tree-based model that outperforms other algorithms or frameworks. It takes the tf-idf in the document as input, analyzes the frequencies compared to the defaced and non defaced web pages to determine whether the webpage is defaced or not.
By training these two models on the defaced and non-defaced web pages, these two models can learn the patterns from the training data. Then these models are able to analyze the context as well as the word frequencies, each model will give a defacement confidence score to indicate their analysis about whether the webpage is defaced or not.
Since there are two different models provided, we provide different notification policies. You can choose to get notified only when the selected model considers the webpage as defaced. Or if you want a more strict rule, you can choose to get notified when either/both of the models considers the webpage as defaced.
Beyond just providing an AI blackbox where we just provide the classification results, we have added a really cool AI NLP Explainability Tool as well. This is extremely useful to help you to decipher some of the results of our AI NLP engine.
For example, in a typical scenario, you may be alerted by the AI NLP Engine even though you are doing normal updates. This is because the AI NLP Engine might not have learned the words like the company names or product names. Thus it may classify the webpage as defaced when encountering these words.
Fortunately, we have a solution to that issue which is a tool to explain prediction results. The tool is meant to help understand the reason why the AI engine considers the webpage as defaced. It gives each word a different weight, which is an indication of how much that word is leading the model to make such a prediction. You can ignore the words with the highest weights for future predictions so that the number of false positives can be reduced. You can thus use the AI NLP Explainability Tool to help with any fine tuning required.
Benefits and Conclusions of Using AI NLP Classifier
Using the AI NLP Classifier provides the following benefits:
Beyond change or keyword detection, using AI NLP Classifier adds a new level of intelligence to analyze and classify the type of changes on your monitored webpages.
- Detection of the language used in your webpage.
- Explainability tool to analyze and help you to understand your AI classification of your webpage contents
- Provides differentiated alerts for higher risk changes
- Potential to reduce false positives if the system is configured to send no alerts for HTML text changes that are not classified as defaced.
The launch of the AI NLP engine complements our other advanced engine to make WebOrion Defacement Monitor currently the most comprehensive offering in the industry. The AI NLP engine is easily available in our WebOrion Monitor AI Pack. Do reach out to us or our sales team if you are interested to find out more about our AI NLP engine!