DarkBERT: The AI trained on Dark Web data to fight cybercrimals is here
Researchers are training artificial intelligence to index data on the Dark Web to fight cybercrime and track malicious activities.
web3
Highlights
- The AI has been trained on Dark Web data
- Has been created to boost cybersecurity
- The tool is based on Facebook’s RoBERTa
Large language models like ChatGPT and Bard are extremely popular right now. They are trained on various types of text available on the internet, including websites, articles, and books, which makes their responses diverse and impressive. However, some researchers have experimented with training models like ‘DarkBERT’ on data from the Dark Web instead. This has led to interesting and unexpected outcomes.
“DarkBERT: A Language Model for the Dark Side of the Internet” [1].
— Brian Roemmele (@BrianRoemmele) May 21, 2023
A new academic paper on a LLM trained from the dark web.
Yes it is exactly what you fear, but no license or law will stop it.
So we study it to understand humanity and our failings.
[1]… pic.twitter.com/wwAxPCqMtK
But, what is Dark Web?
The dark web refers to the part of the internet that is not easily accessible and not indexed by search engines like Google. It is called the “Dark Web" because it is known for being a hidden and anonymous space where people can engage in various questionable activities privately. Unlike the regular internet, which is easily accessible through search engines and requires no special tool, the dark web requires special software, such as ‘Tor’ browser to access it.
Tor helps to protect the identity and location of users, by masking their IP address, making it difficult for others to track them. The dark web is often associated with illegal activities, such as buying and selling drugs, weapons, or stolen data. It is also known for hosting various types of marketplaces where illegal goods and services can be exchanged.
However, it is important to note that not everything on the dark web is illegal. It also serves as a platform for whistleblowers, journalists, and activists who need to communicate anonymously and securely.
What is DarkBert?
A group of researchers from South Korea has published a paper outlining the development of a large language model(LLM) using a vast collection of data sourced from the Dark Web, specifically obtained by crawling the Tor network.
The dataset encompassed a range of questionable websites spanning various categories such as cryptocurrency, pornography, hacking, weaponry, and more. However, to address ethical concerns and prevent the potential extraction of sensitive information by malicious individuals, the team has ensured some guardrails are put in place to refine the pre-training corpus through filtering before employing DarkBert.
The name DarkBert was chosen for the language model as it is built upon the RoBERTa architecture, which was originally introduced by Facebook researchers in 2019. RoBERTa is a transformer-based model that serves as the foundation for DarkBERT, providing a framework for its development and functioning.
What is the purpose of DarkBERT?
DarkBERT, despite its sinister-sounding name, is designed for security and law enforcement purposes rather than engaging in malicious activities. Since it was trained using data from the dark web, which contains numerous illicit websites where large sets of stolen passwords are often found, DarkBERT outperforms existing language models in cybersecurity and cyber threat intelligence applications.
Researchers who developed DarkBERT have showcased its effectiveness in detecting ransomware leak sites. Hackers and ransomware groups frequently upload compromised sensitive information, such as passwords and financial data, to the dark web with the intention of selling it.
The research paper proposes that DarkBERT can assist security researchers in automatically identifying these websites. Furthermore, it can crawl through various dark web forums to monitor them for any illicit exchanges of information. However, the researchers acknowledge that because there is limited publicly available data specific to dark web tasks, certain tasks may still require further customisation and fine-tuning of DarkBERT.
What’s next?
There's been a lot of progress with the development of DarkBERT. Researchers are working on adding multiple languages to the pre-trained model. By using the latest language in the model, DarkBERT is expected to perform even better and gather more data from different sources.
COMMENTS 0