The Project
PBnJ is a Chrome extension that rates the political bias of online news articles and then finds and summarizes similar articles from across the political spectrum.
Our Goals: Accessibility and Granularity
Instead of requiring users to visit a separate page to cross-reference publications on a limited set of topics, we enable them to directly compare the articles they regularly view through our Chrome extension. This flexibility comes from our use of machine learning models instead of relying on human editorial reviews to collect and present source bias. Additionally, we measure bias on a more granular spectrum rather than a single Left/Right/Center label and provide a summary of the key takeaways from articles across the political spectrum.​
Bias is a social construction that is difficult to define as the political "mainstream" shifts and changes over time. We have chosen to emulate AllSides's source-level rating system which combines ratings from:
-
Editorial Review
-
Blind Surveys
-
Third-Party Analysis
-
Independent Review
-
Community Feedback
and have thus created a proof-of-concept model that demonstrates that machine learning can be effectively applied to political bias using predefined rules and metrics.​
Defining Bias
The Dataset
The dataset we used consists of ~37,000 articles annotated by AllSides that are categorized into 3 classes labeled at the source-level by their publishers. We used the "Holdout Method" to separate subsets of the dataset dataset for training and testing.​
​
While we ultimately want to be able to classify political bias on an article-level, for this project we did not have access to sufficiently large dataset and would like to pursue this route in the future given more time and resources.
Data Pipeline
Classifying Bias Model
Our bias classification model is an ensemble of three TF-IDF random forests, dubbed “the random forest forest.” Each forest is a binary classifier assigned to identifying a specific label (i.e. right versus not right), and we take the maximum the positive probability of each class to determine the final label and score of an article.
The summarization of related alternative articles uses the gensim summary function to condense articles into a few sentences.