Event Details

Malware Detection and Categorization Using ML and LLMs

Presenter: Damanpreet Singh
Supervisor:

Date: Tue, December 2, 2025
Time: 12:30:00 - 00:00:00
Place: Zoom - see below.

ABSTRACT

Join Zoom Meeting

https://uvic.zoom.us/j/88492710303?pwd=363d5QLjoowCZRoHXadtHtgxiSpyyy.1

 

Meeting ID: 884 9271 0303

Password: 349942

One tap mobile

+17789072071,,88492710303# Canada

+16475580588,,88492710303# Canada

 

Abstract: The rapid growth of malware attacks has created an urgent need for automated systems capable of accurately detecting and understanding malicious behavior. This project presents a comprehensive work for Malware Detection and Categorization using Machine Learning and Large Language Models (LLMs). The system's goal is to improve cybersecurity by not just detecting malware but also producing concise, intelligible descriptions of every threat it finds. The Microsoft Malware Classification dataset, which comprises approximately 21,000 malware samples grouped into nine primary families with corresponding .byte and.asm files, was adopted for the project. Since only malicious samples were present in the original dataset, roughly 15,000 benign files were added to enable binary categorization of malicious and non-malicious programs. XGBoost, LightGBM, SVM (RBF), and KNN were among the machine learning models that were trained and tested independently on both datasets. By applying the SMOTE technique, the dataset imbalance was reduced, thereby improving classification accuracy and mitigating bias toward the majority ofmalware families. Using the SMOTE technique, class imbalance was addressed. The models were evaluated for both binary classification (malicious vs. benign) and multi-class family prediction, achieving high detection performance. To enhance interpretability, an LLM-based explanation module was integrated. Following classification, the anticipated malware family is sent to an LLM (through Ollama), which produces a natural-language synopsis outlining the traits, actions, and defenses of the malware. Users can upload files, view predictions, and read the generated explanations in real time thanks to an intuitive Gradio interface. In order to provide both technical accuracy and human interpretability, the developed system successfully blends large language models for explainable analysis with machine learning for precise detection. By assisting researchers and security analysts in proactive malware defense, this method advances the field of intelligent cybersecurity by bridging the gap between detection and comprehension.