Curriculum

Cutting-Edge Data Science Meets Real-World Impact

The Master of Advanced Studies in Data Science and Engineering program brings together the skills of a software programmer, database manager, and statistician to create mathematical models of the data, identify trends/deviations, and then present them in effective visual ways that can be understood by others.

Students complete a focused 38-unit curriculum that blends programming, analysis, applications management, and visualization. The program includes 3-Foundational Courses and 6-Advanced Core Courses, culminating in 1-Capstone Project that applies classroom learning to real-world challenges.

Graduates gain more than technical expertise and real-world application knowledge—they unlock access to UC San Diego’s world-class ecosystem of innovation, industry partnerships, and professional development opportunities.

Foundational Courses (3)

DSE 200: Python for Data Analysis

The goal of this course is to bring students with diverse backgrounds and experience to a common level of competency in programming in the context of complex and noisy data. Solid competency in Python programming provides its owner with autonomy and independence in their work. Introduction to object-oriented programming using Python. Regular expressions. NumPy and numerical processing. IPython and plotting. Data analysis using PANDAS. Web page scraping using Scrapy. The Twitter API. NLTK.

DSE 201: Database Management Systems

This course will provide an introduction to the management of structured data beginning with an introduction to database models including relational, hierarchical, and network approaches. It will also cover topics in database system implementation including query languages and system architectures; parallel, column-oriented, and array-based database systems; advanced SQL features including user-defined functions (UDFs), triggers, statistical functions; and support for spatial data.

DSE 203: Data Integration and ETL

The course is designed to provide students with the fundamentals of data integration and includes schema mapping and matching, entity disambiguation, ontology development and management, data provenance, and crowd sourcing and machine learning as strategies for integration. The course will also require hands-on projects in which students will work on a data integration problem requiring integration of two or more datasets taken from an application domain of their choice (e.g., geospatial data, healthcare, financial applications, bioinformatics, etc.).

Advanced Core Courses (6)

DSE 210: Statistics and Probability Using Python

Probability and statistics for data science. Distribution over the real line; independence, expectation, variance, correlation. Central limit theorem. Chernoff/Hoeffding bound. Statistical tests. Bonferroni correction.

DSE 220: Machine Learning

This course provides a broad introduction to the practical side of machine learning and data analysis. The topics covered in this class include topics in supervised learning, such as k-nearest neighbor classifiers, decision trees, boosting and perceptrons, and topics in unsupervised learning, such as k-means, PCA, and Gaussian mixture models.

DSE 230: Scalable Data Analysis

The course exercises the data scientist’s scalability toolbox, covering such concepts as map-reduce, streaming analysis, external memory algorithms, as well as their implementation options in popular frameworks (e.g., Hadoop and its ecosystem: HBase, Hive, Pig and Spark, etc.). The class will include assignments of analyzing large existing databases.

DSE 241: Data Visualization

The goal for the course is to use visualization as a tool to explore trends, relationships, confirm hypotheses, communicate findings, and gain insight about data. This course will focus on teaching students the principles and techniques for creating visual representation from raw data. The course exercises will be based on publicly available datasets and utilize freely available tools like D3.js and VisIt. The course will be modeled similar to Stanford’s visualization CS 448 course and will include an introduction to visualization, vis foundation review, color, interaction, dashboards and heat maps, introduction to D3.js, high dimensional data, network data, geographic data, text data, scientific visualization: isosurface, volume rendering, and introduction to VisIt.

DSE 250: Beyond Relational Data Models

The course covers data models, query languages, and models of computation beyond those employed in relational databases. It addresses new developments that have gained attention with the advent of the Web 2.0 and big data revolutions. The topics are presented in a unifying framework and include key-value pairs as data model, as used in Google’s Bigtable; object-oriented data model, with its practical support in relational databases via the object-relational mapping (involves ODMG standards ODL and OQL, and recent systems such as Ruby on Rails); semi-structured databases (data organized as graph with labels on nodes and edges), query languages based on reachability constraints between nodes: conjunctive regular path queries); XML databases, as special case of semi-structured databases in which the graph is a tree (this involves associated standards such as XML Schema, XPath, and XQuery); RDF databases (with associated OWL and SPARQL standard.

DSE 290: Case Studies in Data Science

Case studies discussed by speakers from industry, government, and academia expose students to the needs and uses of different technologies and their roles in model building.

Capstone (1)

DSE 260A/B: Capstone Project

A team design project in the final two quarters of the program culminates in a final report and an oral presentation of the capstone project. In addition, there might be a demonstration of the working prototype. The project will start by identifying a domain of interest and the available data sources that will be used to study the domain. From this starting point there will be two parallel and interdependent lines of work: data extraction, Transformation and Loading (ETL), and statistical analysis and model building. The ultimate goal will be to present a processing pipeline which transforms the raw data into more usable forms and models which separates between the predictable and the unpredictable aspects of the underlying system. Examples of previous capstone projects can be found here.