My research interests are in the broad areas of Data Mining, Machine Learning and Artificial Intelligence. Specifically, I am interested in Data Cleaning, Data Analytics, Data Exploration, Hidden Web Databases, Crowdsourcing, Social Content Mining and Social Networks.
Structured hidden databases are widely prevalent on the Web. They provide restricted form-like search interfaces that allow users to execute search queries by specifying desired attribute values of the sought-after tuples, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable ranking function. The top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. This research involves developing effective techniques for retrieving more than top-k tuples for any query and support additional rank based analytics such as estimating the rank of a tuple or compare the rank of two arbitrary tuples to determine which of them is highly ranked. Our techniques access the hidden structured databases via their public interfaces and operate without any knowledge of the underlying static ranking function.
The growing popularity of online collaborative content sites such as Netflix/MovieLens (movie ratings), Flickr (images), Youtube (videos) etc has provided enormous data that lets us peer into the collective mind of customers. Knowing what items customers like and why they like it is essential for any successful business. Various user-item interactions such as visits, likes, +1s, ratings, reviews provide a rich window into what users like, but knowing why a user likes the item is much trickier as few users leave elaborate comments explaining their preferences. While users are drawn to an item due to a subset of its features, a user-item interaction only provides an expression of user preference over the entire item, and not its component features. This project concerns developing data mining and exploration algorithms for performing aggregate analytics over user interactions (visits, likes, +1s, ratings, etc) available from collaborative content sites and use the resulting information to aid customer consumption decision making, rank features or identify frequently liked sets of features.
Social Networks such as Facebook and Microblogging platforms such as Twitter have experienced a phenomenal growth of popularity in recent years, making them attrac- tive platforms for research in diverse fields from computer science to sociology. However, most of these platforms impose strict access restrictions (e.g., API rate limits) that prevent scientists with limited resources to leverage the wealth of microblogs for analytics. In this project, we consider multiple novel problems such as enabling efficient aggregate estimation over social networks and microblog platforms. In addition, we also investigate the feasibility of supporting complex queries over the limited search interfaces provided by these platforms and the various tradeoffs needed in enabling advertising over microblogs.
Crowdsourcing systems have gained popularity in a variety of domains. The next generation crowdsourcing systems will be collaborative and knowledge-intensive in nature. They need to treat the crowdsourcing problem not in optimization silos, but as an adaptive optimization problem by seamlessly handling the three main crowdsourcing processes (worker skill estimation, task assignment, task evaluation) and incorporating the uncertainty stemming from human factors. The main thrust behind this project is to develop algorithms for such an adaptive, knowledge-intensive crowdsourcing scenario by quantifying and incorporating the human factors into the three major crowdsourcing processes.
In addition to the above focussed projects, I am also involved in multiple other cool, but smaller, projects. Taking part in them has introduced me to some awesome collaborators and new subfields! This is a catchall place to list the publications arising from these projects.
Teleherence uses web and phone technologies to optimize adherence to treatment. It calls the client at agreed upon times, delivers reminders and messages, asks questions, graphs responses, sends desired alerts, and flags potential problems or opportunities using smart algorithms. It uses text-to-speech and speech recognition along with landline, cell, smart, SMS (texting), and VOIP phone technology. The system can also deliver pre-recorded audio files such as motivational messages from the care manager. Development is in partnership with Mental Health Mental Retardation of Tarrant County with support from the National Institute of Health, National Library of Medicine.