NSF REU: Big Data Security and Privacy
Published:
When malicious information—such as links, videos, photographs, microblogs, and metadata—is posted online, one fundamental question arises: can it propagate to viral proportions?
In this project, we demonstrate whether social influence features extracted from a sequence of posts in a forum thread can distinguish between a viral and non-viral cascade. Through anonymous communities on secure sites and forums, there is now an unprecedented flow of ideas, malware, and exploits. Without visibility into this new offensive industrial base, the production pipeline is abstracted from defenders. This has prompted the development of machine learning models to identify viral cascades in their infancy, which can then be leveraged by security specialists to prevent mass-adoptions of malware.
The motivation behind this research is to construct classification models that anticipate hacktivism campaigns and mass-adoptions of cyber threats using structural features extracted from the cascade [1,2]. This task can be accomplished through the identification of a viral cascade in its early stages through binary classification. Previous research has focused on identifying cascades in social networks through a combination of temporal and structural features. However, we seek to extend the application of these features to malicious forums from the Dark Web, using features extracted from the structural characteristics of the cascade utilizing social network analysis. In addition, this analysis of our forums will be strictly topology-driven using the social network metrics (no content information from the forum posts was analyzed). Given a sequence of user posts in a thread, our problem consists in predicting whether this cascade will achieve viral proportions. For the scope of our research, a “viral” cascade is defined as any thread which displays a multiplier increase in user adoptions.
To test our approach, we train a variety of classifiers: Random Forest, AdaBoost, Naive Bayes, etc. In the data collected through the CYR3CON API, it is important to note an imbalance in the classes. The ratio between the positive (viral) class and the negative (nonviral) class heavily leans towards the negative class across datasets with various levels of cascade growth. In our training of the model, we address any bias through the one-to-one sampling of our training dataset.
This work provides the following main contributions:
- First application of machine learning for virality detection on malicious cascades.
- Research into the extraction of social features from malicious cascades, both viral and nonviral.
- Comparison of different machine learning models with different levels of cascade growth.
My poster can be found here.
- REFERENCES
- [1] Cheng, J., Adamic, L., Dow, P. A., Kleinberg, J. M., & Leskovec, J. (2014, April). Can cascades be predicted?. In Proceedings of the 23rd international conference on World wide web (pp. 925-936). [2] Ericsson Marin, Ruocheng Guo, and Paulo Shakarian. 2020. Measuring Time-Constrained Influence to Predict Adoption in Online Social Networks. ACM Trans. Soc. Comput. 3, 3, Article 13 (May 2020), 26 pages.
Note Source Code can be found here: https://github.com/DhanushKarthikeyan/Network-Analysis-Research
