Author Identification Based on NLP (Published)
The amount of textual content is increasing exponentially, especially through the publication of articles; the issue is further complicated by the increase in anonymous textual data. Researchers are looking for alternative methods to predict the author of an unknown text, which is called Author Identification. In this research, the study is performed with Bag of Words (BOW) and Latent Semantic Analysis (LSA) features. The “All the news” dataset on Kaggle is used for experimentation and to compare BOW and LSA for the best performance in the task of author identification. Support vector machine, random forest, Bidirectional Encoder Representations from Transformers (BERT), and logistic regression classification algorithms are used for author prediction. For first scope that have 20 authors, for each author 100 articles, the greatest accuracy is seen from logistic regression using bag-of-words, followed by random forest, also using bag-of-words; in all algorithms, bag-of-words scored better than LSA. Ultimately, BERT model was applied in this research and achieved 70.33% accuracy performance. For second scope that increase the number of articles till 500 articles per author and decrees the number of authors till 10, the BOW achieves better performance results with the logistic regression algorithm at 93.86%. Moreover, the best accuracy performance is with LR at 94.9% when merged the feature together and it proved that it is better than applied BOW and LSA individual, with an improvement by almost 0.1% comparing with BOW only. Ultimately, BRET achieved result by 86.56% accuracy performance and 0.51 log los.
Empirical Study of Features and Unsupervised Sentiment Analysis Techniques for Depression Detection in Social Media (Published)
This study provides an empirical evaluation of diverse traditional learning, deep learning, and unsupervised techniques based on diverse sets of features for the problem of depression detection among Twitter and Reddit users. The main objective of this study is to investigate the most appropriate features, document representations, and text classifiers for the significant problem of depression detection on social media microblogs, such as tweets, as well as macroblogs, such as posts on Reddit. The study’s investigation will concentrate on the linguistic characteristics, blogging behavior, and topics for features, multi-word, and word embeddings for document representation as well as on unsupervised learning for text clustering. This study will select the best approaches in the literature as baselines to practically examine them on the depressive and non-depressive dataset of blogs designed for this work. The study’s integrations and ensembles of the selected baselines will be experimented as well to recommend a design for an effective social media blog classifier based on unsupervised learning and WE document representation. The study concluded that the experiments proved that a stacking ensemble of Adam Deep Learning with SOM clustering followed by Agglomerative Hierarchical clustering with topic features and pre-trained word2vec embeddings achieved an accuracy more than 92% on Twitter and Reddit depression analysis datasets.
The primary purpose is to discuss the prediction of student admission to university based on numerous factors and using logistic regression. Many prospective students apply for Master’s programs. The admission decision depends on criteria within the particular college or degree program. The independent variables in this study will be measured statistically to predict graduate school admission. Exploration and data analysis, if successful, would allow predictive models to allow better prioritization of the applicants screening process to Master’s degree programme which in turn provides the admission to the right candidates.