Ph.D. student
Office ISI 1038
Information Sciences Institute (ISI), University of Southern California
4676 Admiralty Way, Marina del Rey, CA, 90292
Email: binhlvu@usc.edu or binhvu@isi.edu

Biography

I am a Ph.D. student in Computer Science at University of Southern California (USC), advised by Prof. Craig Knoblock. Before joining USC in 2016, I earned my B.E. in Computer Science from HCMC University of Technology (honor program) in 2015.

I am generally interested in machine learning, mainly focusing on techniques for knowledge graph construction. My current research is about semantic modeling for automatically publishing structured data sources to knowledge graphs. More information is available on my resume.

Projects

SAND: A Tool for Creating Semantic Descriptions of Structured Sources

Building semantic descriptions of tables is a vital step in data integration. However, this task is expensive and time-consuming as users often need to examine the table data, its metadata, and ontologies to find the most appropriate description. In this paper, we present SAND, a tool for creating semantic descriptions semi-automatically. SAND makes it easy to integrate with semantic modeling systems to predict or suggest semantic descriptions to the users, as well as to use different knowledge graphs (KGs). Besides its modeling capabilities, SAND is equipped with browsing/querying tools to enable users to explore data in the table and discover how it is often modeled in KGs.

Paper Github
A Graph-based Approach for Inferring Semantic Descriptions of Wikipedia Tables

There are millions of high-quality tables available in Wikipedia. These tables cover many domains and contain useful information. To make use of these tables for data discovery or data integration, we need precise descriptions of the concepts and relationships in the data, known as semantic descriptions. However, creating semantic descriptions is a complex process requiring considerable manual effort and can be error prone. In this paper, we present a novel probabilistic approach for automatically building semantic descriptions of Wikipedia tables. Our approach leverages hyperlinks in a Wikipedia table and existing knowledge in Wikidata to construct a graph of possible relationships in the table and its context, and then it uses collective inference to distinguish genuine and spurious relationships to form the final semantic description. In contrast to existing methods, our solution can handle tables that require complex semantic descriptions of n-ary relations (e.g., the population of a country in a particular year) or implicit contextual values to describe the data accurately. In our empirical evaluation, our approach outperforms state-of-the-art systems on the SemTab2020 dataset and outperforms those systems by as much as 28% in F1 score on a large set of Wikipedia tables.

Paper Github Video
Dataset Representation Language for Reading Heterogeneous Datasets to RDF or JSON

  • Reading public datasets is a laborious task and frequently requires to write custom code because data are often stored in many different formats (CSV, JSON, Spreadsheet, NetCDF, etc) with different layouts (row-based, matrix, hierarchy).
  • To address the problem, we create D-REPR, a language to represent heterogeneous datasets, and a very efficient D-REPR processor to read the datasets from their own formats to a common representation.

Paper Github
Learning Semantic Models of Data Sources Using Probabilistic Graphical Models

  • The problem is to automatically build semantic descriptions of data for publishing structured sources to knowledge graphs.
  • We use beam search to generate possible candidate semantic models and a probabilistic graphical model as a scoring function to guide the search.

Paper Github
From Sketches to HTML/CSS Code

Using Deep Learning to synthesize HTML/CSS programs from mock-up UIs (Course Project)

Blog Post Github
Identifying Potential Company Hardware/Software Vulnerabilities

Auto-crawling online sources to retrieve expertise of employees of a company, then predicting the software and hardware used in the company. A list of vulnerabilities is obtained by linking the software and hardware to the CVE database.

Selected Publications

Awards