Databases and Information Systems


Subhabrata Mukherjee

Subhabrata (Subho) Mukherjee

Machine Learning Scientist

 email: subhabrata.mukherjee.ju[aT]


I am a Machine Learning Scientist at Amazon (Seattle) leading the information extraction efforts to build the Amazon Product Knowledge Graph. I work on developing machine learning and deep learning models for information extraction and natural language understanding. My doctoral thesis at Max Planck Institute for Informatics (Germany) on misinformation and fact-checking obtained the prestigious SIGKDD 2018 Doctoral Dissertation Award Runner-up (one of the top-3 best doctoral dissertations world-wide in data mining). My research interests involve representation learning and graphical models to capture the joint interaction between structure, content, and dynamics of information --- with a particular focus on interpretability and user-centric information needs.

In my PhD dissertation, I worked on probabilistic graphical models to extract "credible", "trustworthy" and "expert" information from large-scale, non-expert, user-generated online content. I developed machine learning models that exploit the joint interaction between users, language, and their evolution in online communities for tasks like: credibility analysis, personalized content recommendation, (latent) experience-aware item recommendation, finding (latent) topic-specific experts in online communities, spam and anomaly detection etc.

[PhD Thesis on Credibility Analysis]     [SIGKDD 2018 Dissertation Talk Slides]     [CV]

Research Interests

  • Information Extraction (IE) (specifically: user-centric IE and IR)

  • Natural Language Understanding

  • Applied Machine Learning (specifically: Deep Learning, Generative Models, Graphical and Topic Models)

  • Misinformation, Fact-checking

Recent News

  • Paper accepted as an oral presentation in WWW 2019, USA (18% acceptance rate). We present GhostLink an unsupervised probabilistic graphical model to automatically learn the latent influence network underlying a review community -- given only the temporal traces (timestamps) of users' posts and their content -- to improve item recommendation and detect influential users. Congrats to my co-author Stephan Guennemann from TU Munich.

  • Serving in the PC of SIGKDD 2019 (Research Track + Applied Data Science Track)

  • SIGKDD (2018) Doctoral Dissertation Award Runner-up (one of the top-3 best doctoral dissertations world-wide in data mining). Dissertation talk at SIGKDD 2018 [Slides]

  • Paper accepted as an oral presentation in EMNLP 2018, Brussels, Belgium (26% acceptance rate). We present "DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning" --- an end-to-end neural network model for fact-checking arbitrary textual claims that also generates human-interpretable evidence for its verdict. Congrats to all my co-authors Kashyap Popat, Andrew Yates and Gerhard Weikum from Max Planck Institute.

  • Paper accepted as an oral presentation in SIGKDD 2018, London (8% acceptance rate). Our work "OpenTag: Open Attribute Value Extraction from Product Profiles" [Slides] brings deep learning and active learning together for state-of-the-art imputation and open entity extraction system. Congrats to all my co-authors Guineng Zheng (Univ. of Utah), Xin Luna Dong (Amazon), and Feifei Li (Univ. of Utah).

  • Presenting 2 tutorials in SIGKDD 2018, London, UK:

  • Our demo "CredEye: A Credibility Lens for Analyzing and Explaining Misinformation" accepted in WWW '18. Try it out [here] and the video on how it works [here]. Congrats to the co-authors Kashyap, Jannik, Gerhard!

  • Joined Amazon (Seattle) in Oct, 2017 as a Machine Learning Scientist to build the Amazon Product Knowledge Graph

  • Defended my [PhD Thesis on Credibility Analysis] with summa cum laude, July 2017
    Dissertation Committee: Prof. Dr. Gerhard Weikum, Prof. Dr. Jiawei Han, Prof. Dr. Stephan Guennemann, Prof. Dr. Dietrich Klakow


Selected Invited Talks

Probabilistic Graphical Models for Credibility Analysis in Evolving Online Communities
  • SIGKDD 2018 Doctoral Dissertation Award Talk, London, UK
  • MIT Media Lab, Cambridge, USA
  • Amazon, Seattle, USA
  • Bell Labs, Cambridge, UK
  • IBM Research Lab, Zurich, Switzerland
Recorded Talks

Recent Positions

  • Oct 2017 - :
    Machine Learning Scientist at Amazon (Seattle)
    • Leading the information extraction (IE) efforts to build the Amazon Product Knowledge Graph --- the authoritative Knowledge Base of every item in the world. Developing large-scale machine learning and deep learning models to extract structured knowledge from large-scale unstructured data for tasks like Named Entity Recognition (NER) and Data Imputation, OpenIE and Common Sense Knowledge Integration etc.

  • Mar 2017 - Sep 2017:
    Postdoctoral Researcher at Max Planck Institute for Informatics
    • Areas: Credibility Analysis, Recommender Systems, Influence Networks

  • Aug 2015 - Dec 2015:
    Intern in Google Research (Mountain View, CA) in Machine Learning and Intelligence
    • Worked on semantic annotation of large-scale datasets (audio, video, web-tables, map-reduce job logs etc.) with Knowledge Graph to improve Google Datasearch by making it aware of the salient semantic types of the entities present in any dataset.

  • Oct 2012 - Oct 2013:
    Research Engineer in IBM Research (India) in Human Language Technolgies
    • Domain Cartridge: Unsupervised framework for constructing domain ontologies from a corpus of knowledge articles that improves the recall of Question-Answering systems (e.g., Watson) by making it aware of domain-specific entities and their relations.
    • Self-Assist Systems: Unsupervised framework for self-assist systems that can serve as virtual call center agents to guide the customer in performing various domain-dependent tasks (like troubleshooting a problem, changing settings in devices, etc.).
    • Personalized Sentiment Analysis: Generative models for personalized recommendation that take into account user preferences, intent, latent item facets etc.
    • Intent Classification for Voice Search: Intent classification of voice queries on mobile devices (e.g., map, command-and-control, navigational, and knowledge-based queries for voice search).

  • July 2012 - Sep 2012:
    Technology Analyst in Credit Suisse Business Analytics Pvt. Ltd. (India)
    • Worked in High Frequency Trading

Academic Service

  • Organizer: Domain Specific Speech and Language Understanding Workshop, Amazon Machine Learning Conference (AMLC 2018); Knowledge Graphs: Construction, Management and Querying, Semantic Web Journal (Editorial Board Member)

  • Panelist: National Science Foundation (NSF)

  • Program Committee: SIGKDD 2019 (Research Track + Applied Data Science Track), Amazon Research Awards (ARA 2017, ARA 2018), Amazon Machine Learning Conference (AMLC 2018, AMLC 2019), Humanizing Artificial Intelligence (IJCAI 2018), Natural Language Interfaces for Web of Data (ISWC 2018), Exploiting AI for Data Management Systems (SIGMOD 2018), Interactive Data Exploration and Analytics (KDD 2017), Social Aspects in Personalization and Search (ECIR 2018)

  • Journal Reviewer: ACM Transactions on Knowledge Discovery from Data (TKDD), IEEE Transactions on Knowledge and Data Engineering (TKDE), Information Systems (Journal), Data Mining and Knowledge Discovery (DAMI), Artificial Intelligence (Journal), IEEE Transactions on Computational Social Systems (TCSS), Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Journal of Web Semantics, Journal of Human-Computer Studies

Mentees (Interns and PhD Collaborators)

  • Dongxu Zhang, University of Massachusetts Amherst (Topic: OpenIE Knowledge Integration, Universal Schema for Information Extraction)

  • Guineng Zheng, University of Utah (Topic: Deep Sequence Tagging, Named Entity Recognition)

  • Hyun Ah Song, Carnegie Mellon University (Topic: Wisdom Graph for Common Sense Reasoning)

  • Kashyap Popat , Max Planck for Institute for Informatics (Topic: Misinformation, Fact-checking)

  • Rakshit Trivedi, Georgia Institute of Technology (Topic: Wisdom Graph for Common Sense Reasoning)

Research Areas and Publications [DBLP] [Google Scholar]

Information Extraction, Representation Learning

Finding Experts, Personalized Recommendation, User / Community Evolution, Topic / Generative Models, Review Communities

Credibility Analysis, Conditional Random Fields (CRF)

Domain Ontology, Sentiment Aggregation

Sentiment Analysis

Dialogue Systems, Intent Classification

  • Help Yourself: A Virtual Self-Assist Agent [Tags: @IBM Research]
    Subhabrata Mukherjee and Sachindra Joshi
    In WWW 2014, Seoul, South Korea [Demo Paper] [Slide V1] [Slide V2]

  • Intent Classification of Voice Queries on Mobile Devices [Tags: Voice Search, @IBM Research]
    Subhabrata Mukherjee, Ashish Verma and Kenneth W. Church
    In WWW 2013, Rio de Janeiro, Brazil [Poster] [Slide]

  • YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data & User Comments using WordNet & Wikipedia
    Subhabrata Mukherjee and Pushpak Bhattacharyya
    In COLING 2012, Mumbai, India [Paper] [Slides] (Acceptance rate: 16%)