Enhancing Machine Learning Models in Cybersecurity for combating cunning malware in the year EMBER2024.
In the realm of cybersecurity, the challenge of developing effective solutions to combat malware continues to evolve. The latest advancement comes in the form of EMBER2024, an update to the original EMBER dataset, which was first released in 2018.
EMBER2024, presented at the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025) in Toronto in August 2025, builds on the innovative and influential original, delivering a leap forward in capability. An academic paper, titled "EMBER2024: A Benchmark Dataset for Holistic Evaluation of Malware Classifiers", details this new dataset.
The new version of the dataset, developed by researchers Florian Biggio, Ambra Demontis, Battista Biggio, and Fabio Roli, includes a challenge set of 6,315 files that were initially undetected as malicious by any AV products in VirusTotal but later qualified as malicious. This set reflects the difficulties of training and shipping a real commercial AV solution by highlighting the hardest files to classify.
The EMBER2024 dataset, spanning over 3.2 million files from six different file formats, includes metadata, labels, and calculated features. The feature calculation code was updated to use the most recent version of the pefile library instead of LIEF, ensuring compatibility with future versions of Python.
EMBER2024 also includes a collection of advanced malware that has demonstrated its ability to evade antivirus products. The dataset provides data scientists conducting cybersecurity research with an extensive, modern dataset to support the training and evaluation of machine learning models for malware detection.
Moreover, the dataset features seven different types of labels and tags that support training classifiers on seven common tasks, including malicious/benign detection, malware family classification, and malware behavior identification. These labels provide a comprehensive view of the malware landscape, aiding in the development of more robust and versatile detection systems.
The paper includes 14 benchmark models trained on different subsets of the data and varying classification tasks. The results of these models serve as a benchmark for future research in the field.
The popularity of the original EMBER dataset has led to related projects like EMBERSim and now EMBER2024. This infrastructure code allows for the potential creation of a larger dataset for larger models or studies about the evolution of benign and malicious software over time.
Open source initiatives like EMBER2024 exemplify industry-wide cooperation that drives innovation and supports continuous product improvement. The EMBER2024 public release includes the code used to construct the dataset, allowing researchers with access to VirusTotal to replicate the dataset in the future.
The EMBER2024 project reflects the ongoing commitment of the website to research in the cybersecurity industry. As of this writing, the original EMBER paper has been cited in academic research over 700 times since its original publication in 2018, underscoring its significance in the field.
In conclusion, the EMBER2024 dataset offers a valuable resource for researchers and practitioners in the field of cybersecurity, providing a comprehensive and up-to-date dataset for training and evaluating malware detection models. The dataset's focus on challenging files and advanced malware makes it an invaluable tool in the ongoing battle against cyber threats.