Challenge
Escalating incidents of misinformation circulating on the social media impacts a wide range of societal issues. This includes influencing elections, exacerbating volatile community situations (sometimes spurring groups to violence), and spreading falsehoods about serious public health issues. In 2020, misinformation about the COVID-19 pandemic was distributed across many social media channels, largely unchecked.
Twitter*, in particular, has inadvertently been a large-scale purveyor of misinformation due to the ease with which retweets can circulate through the population of users.1 MIT Sloan Professor Sinan Aral and MIT Media Lab researchers Deb Roy and Soroush Vosoughi conducted a study in 2018, which was published in Science magazine.1 In this study, they determined that falsehoods are 70 percent more likely to be retweeted than the truth. More alarmingly, falsehoods reach the first 1,500 people six times faster.
Efforts by Twitter administrators to remove fake news and ban inauthentic sources still allow debunked content to escape detection and continue to circulate. Manual techniques to identify and remove misinformation before it can spread are time-consuming and ineffective. Identifying content that needs fact-checking is a difficult and laborious process. In the case of COVID-19, factually incorrect information can cause unnecessary deaths.2
Dr. Durga Toshniwal is a professor at the Indian Institute of Technology, Roorkee, India. With deep knowledge and expertise in data mining across numerous fields, AI, machine learning (ML), and big data analytics, Dr. Durga sought an AI-based solution to detect social media misinformation. “One of the primary challenges in our work,” Dr. Durga said, “is data collection and preprocessing. Social media generates a huge amount of raw data, which needs to be parsed to extract the relevant features. These features are preprocessed to a form appropriate for AI/ML modeling. This process requires high computing power.”
“We tried to address these challenges,” she continued, “using accelerated AI/ML libraries and other high-performance computing capabilities to speed up the data analysis.”
Solution
AI has been successfully applied to a broad range of large-scale pattern recognition tasks across numeric, text, and image datasets. Text analytics includes a set of AI technologies that deal with natural language processing (NLP). This enables machines to learn and respond to the text-based data. Social media gained popularity as it allows people to express their views freely in the form of written unstructured text. Individuals and society at large benefit from the analysis of such text data.
In collaboration with one of her students, Aarzoo Dhiman, Dr. Durga proposed a framework to analyze Twitter posts containing content about COVID-19. The NLP framework consisted of three basic steps.
- The framework converted tweets into word vectors.
- Critical clusters of word vectors were created using unsupervised machine learning to identify tweets with the highest popularity and influence. Transfer learning using the pretrained language representation model was employed to generate the tweet vector representations. For this purpose, they used a deep learning model called JoSE (Joint Spherical Embedding). This model trains the embeddings while maintaining the contextual and spherical characteristics in the form of directional information. In addition, the spherical clustering model SK-means was used for clustering the tweets. This model was used in alignment with the spherical embeddings generated in the first step. Previous research indicates that spherical embeddings tend to show better performance while using spherical clustering approaches.
- The veracity of the tweets was assessed by cross-checking credible sources based on four categories. For this purpose, they used a keyword search approach coupled with the dynamic extension of the seed lists. The list was based on the user description of the verified users in the incoming data. The framework factors news sources, medical organizations, medical influencers, and verified accounts to evaluate the credibility of content.
Figure 1. Steps in the proposed framework
Dr. Durga Toshniwal and Aarzoo Dhiman authored a research paper, An Unsupervised Misinformation Detection Framework to Analyze the Users Using COVID-19 Twitter Data,3 on this topic. The paper was presented at the 2020 IEEE International Conference on Big Data (IEEE Big Data 2020). The project is part of a larger ongoing effort to explore and refine the optimal techniques to detect and address the spread of false content on social media at an early stage. The identified content and the source user can then be removed from the social media.
History
Data collection using the Twitter API began in January 2020 and continued until April 2020, ingesting about 62 million tweets.
“We proposed an NLP-based framework,” Dr. Durga said, “to identify the misinformed content and the source of the spread. We also compared the geographic spread and emotional intensity of the misinformed tweets with the other tweets. The manual search of the identified fake news-spreading users showed they were already suspended from Twitter. We also compared our model with other state-of-the-art fake news detection models, where our method achieved the best precision and recall.”
Figure 2 shows a breakdown across five user categories compared according to the emotion intensity scores. This data was acquired from March 27, 2020 to April 10, 2020.
Figure 2. Proportion of tweets with emotion intensity scores greater than 0.5
“The proposed methodology leverages the popularity of the tweets in the form of reposts to identify the influential content on social media.” says Dr. Durga.
“This influential content,” she continued, “is extracted in the form of critical clusters. Along with textual information, the credibility of the source of popular tweets is assessed by using four primary credible sources: news, medical organizations, medical influencers, and verified users. These categories are dynamically updated using the semantic similar textual information from the user descriptions of these user categories. The framework has been tested on two public annotated datasets from FakeNewsNet and has been compared with the baseline classification, clustering, and hybrid models for misinformation detection.”
Figure 3 shows the breakdown of user interactions globally on Twitter during March 7, 2020 to March 26, 2020.
Figure 3. Top user interactions compiled from the Twitter COVID-19 research
Dr. Durga noted that the project is unique as it can automatically ingest and process massive amounts of unlabeled data using unsupervised methods, thus eliminating laborious manual annotation of data. Beyond academic and research applications, the project has potential commercial applications.
Enabling Technologies
Hardware based on Intel® architecture accessed through the Intel® DevCloud3 delivered the processing power to contend with the large volumes of data during the coding and testing stages of the project.
Dr. Durga observed, “As we progressed, we could get Intel support in the form of oneAPI resources and toolkits, and DevCloud [sic], which helped us accelerate the implementation.”
Rather than dividing the processing between CPUs and GPUs, Dr. Durga only worked with the CPU, an Intel® Xeon® Scalable processor available through the Intel DevCloud. “The model’s Xeon [sic] performance,” she said, “was already much better than competing platforms and models.” Dr. Durga relied on the Intel® VTune™ Profiler to analyze and optimize the memory usage.
Summarizing the project and the supporting role played by Intel to efficiently enable the AI and NLP work, Dr. Durga said, “The hardware and software support provided by Intel improved all the dimensions of the project. Intel oneAPI data science and AI toolkits (Intel® oneAPI AI Analytics Toolkit) provide optimized Python* libraries for accelerated machine learning algorithms, including scikit-learn*. XGBoost—optimized for Intel—and deep learning frameworks, such as TensorFlow* and PyTorch*, enhance performance across multiple architectures. Intel DevCloud provides easy access to powerful hardware and these optimized AI toolkits, providing a unified cloud computing platform for project development.”
The work is ongoing and Dr. Durga and her students continue to explore techniques and technologies to enhance the accuracy and speed of AI processes and the development of effective frameworks.
Figure 4. Professor Durga Toshniwal during discussions with PhD students (left to right) Aarzoo Dhiman, Narayan Chaturvedi, and Deeksha Arya
Conclusion
The milestones that Dr. Durga Toshniwal and Aarzoo Dhiman achieved with this project demonstrate a viable path forward to reduce the instances of misinformation on social media networks in a reliable and automated way, powered by AI. To contend with the tremendous volumes of data involved in the detection of misinformation on Twitter during a four-month period, the Intel® Xeon® Gold 6128 processor, 3.40 GHz (24 cores) with 200 GB RAM, accessed through the Intel DevCloud and using the Intel® oneAPI toolkits proved efficient and capable. It helped to accelerate and optimize the complex AI operations.
Resources and Recommendations
Intel® DevMesh
Find out more about the underlying technologies for this project.
Intel DevCloud for oneAPI
Get started quickly in this free environment that allows you to develop, test, and optimize remotely, and avoid system configuration headaches. Sign up to get access.
Intel VTune Profiler
Locate and fix application bottlenecks.
oneAPI Training Catalog
Access self-paced training for oneAPI and Data Parallel C++ (DPC++).
Simplify AI with Intel® Artificial Intelligence Technologies
Discover the latest technologies for building AI solutions.
References
- MIT Sloan School of Management. March 8, 2018. Study: False News Spreads Faster Than the Truth
- BBC News. August 12, 2020. 'Hundreds Dead' Because of COVID-19 Misinformation
- 2020 IEEE International Conference on Big Data (Big Data). 2020. An Unsupervised Misinformation Detection Framework to Analyze the Users Using COVID-19 Twitter Data. Volume: 1, Pages: 679-688. DOI Bookmark: 0.1109/BigData50022.2020.9378250
- Intel DevCloud