When COVID-19 temporarily shuttered school buildings across the country in spring 2020, EA immediately sprang to action building a platform that teachers could use to learn about their new students in fall 2020—given the lack of the usual data sources that teachers rely on to get to know an incoming class, like attendance, end-of-year statewide assessments, and reliable gradebook information. Where we started was with thousands of data files from disparate data sources that needed to first be integrated and stored. Where we ended was with the Rally Analytics Platform, an equity-focused and action-oriented dashboard that reports student well-being and academic data, analytics, and predictions.
Here, we describe the process of building our underlying data infrastructure for the Rally Analytics Platform by leveraging different technologies—including some of the roadblocks we encountered, design decisions we made, and lessons we learned in the process.
What were we trying to build?
Product Goals, Requirements, and Constraints
The Rally Analytics Platform was initially developed as an immediate response to the COVID-19 crisis, but we have always maintained a longer-term vision for a platform that allows educators (and those who support educators) to see detailed student assessment data, well-being data, summaries of that data, and predictions to inform where students need support to meet goals in one user-friendly place.
The initial product goal was to integrate data from at least 100 school districts across two states and to present those data in one unified user application. The scope of the product included statistical transformations and predictions, as well as links between granular assessment analytics and relevant, state-specific curriculum standards and resources. At the outset, the fundamental product requirements included:
- Sourcing data about teachers, principals, and their associations to students and schools from multiple data vendors in order to set up permissions and provide access for users
- Sourcing state assessments, interim assessments, and student survey data from a variety of vendors with different data transfer specifications and mechanisms
- Updating the data with any new interim tests or surveys received on a nightly basis
- Updating statistical predictions and quantitative transformations nightly
- Presenting data in a series of tables and views for a web application accessible to users
To meet these needs, we established the following design principles:
- Allow features to be developed independently and in parallel where possible, to allow us to build the first draft of the entire system in three months
- Leverage open source data standards wherever possible
- Ensure transparency and auditability of data flows and business rules
- Create opportunities for extensibility and ease of maintainability
- Allow for rapid scale-up of infrastructure if needed to support a large number of users
How did we approach this build?
Technology & Design Choices
Once we understood the key requirements of the product we needed to build, we made some key choices about the overall infrastructure we would build:
- Make it entirely cloud-based for quick scalability
- Use open source technology wherever possible to avoid getting locked in to proprietary systems
- Create clean interfaces to allow parallel development between extracting/loading, transforming, and modeling data
- Transform as much data as possible in the database directly
In the diagram below, data moves from left to right through the system. Airflow pulls the data each night, stores a record of the data in Amazon S3 file storage, and then loads it to a PostgreSQL database. Next, Airflow runs a dbt job, which itself orchestrates a series of data transformations on the raw data to create the structures needed for the Rally application. Before completing the full run, we run a series of tests using dbt, and if they pass, the run is elevated to production and immediately accessible to the Rally application.
Data extract, load, and orchestration
We used Apache Airflow for data orchestration, and we wrote Airflow operators (using custom Python code) to interact with custom external data sources.
We typically receive data in one of three ways:
- Other vendors' SFTPs, which transfer data to us directly
- API endpoints for extracts from operational systems at school districts or other education agencies
- EA’s Sharefile (FTPS) that allows staff at the partnering education agency to extract data from operational systems and transfer it to us
In most cases, we tried to keep data transformation out of these steps, and we often convert data from CSVs into unstructured JSON files, due to variability and unpredictability in the specs we get (such as columns that are added or removed, columns that are out of order, etc.). We always prefer to inhale whatever data we receive into the database, and then use the next step in our data flow process to figure out what to do with it.
We aimed for simple, separable processes (DAGs) for these operations to give us flexibility in the scheduling and separability of development.
We used dbt to construct and orchestrate the data transformations in our pipeline. dbt allows us to easily maintain complex, ordered processes and to break up our SQL into logical steps without the overhead of maintaining all of the tables and dependencies.
We separate the process into these steps:
- Clean up and validate data coming from different sources
- This includes handling missing, extraneous, or unordered columns, and converting the data to tabular or rectangular form
- We also automatically detect which student IDs we are receiving from different sources by attempting joins; we use districts’ roster data as a source of truth (since we have found we can’t reliably identify IDs by column labels, in part because they differ across districts and years)
- Transform data coming from different places to look roughly the same and align the data to open source data standards, then link identifiers to metadata we maintain
- Shape the data into a dimensional data warehouse, defining key dimensions and facts
- Produce predictions of student academic achievement in the future
- Create purpose-built views of the dimensional model for particular analytics and aggregations presented in the Rally application
A Snippet of a Visual Representation of the dbt Process:
When working with data from different sources in Rally, we don’t only transform the data structure into a common standard—we also transform the data themselves to a common metric for use in analytics and reporting . We produced analytics to convert interim test events (from different assessment vendors) into a unified scale: units of the state test. Then, as new interim assessment data are inhaled into the system, we update predictions of how students would score on end-of-year state tests, compared to pre-pandemic trends. To accomplish this, we used R to conduct exploratory statistical research and to calibrate the models, and then we created dbt models to apply those model parameters to incoming data, which generated predictions in our nightly run.
What did we learn, and where do we go from here?
Dealing with these many data standards is hard
It is intensive to accommodate many different data standards from multiple data vendors, and the complexity increases with the number of unique sources. Schools in the same state use a variety of assessment, survey administration, and student information systems that all have different ways of storing and transferring data. Even when standardizing the data as early as possible in the pipeline, we can’t necessarily trust the data specs or the labels on key variables like student IDs. We think the long-term solution to this problem is an interoperable data standard called Ed-Fi that defines data elements commonly stored in different education data providers’ systems and the relationships among them. We are currently implementing an integration that enables us to source data from this standard.
dbt is great
We have found dbt to be an incredibly powerful tool that reduces much of the overhead involved in creating and maintaining many tables, views, and interdependencies. It allows us to break up our SQL transformation into logical, auditable steps, and it allows us to develop and iterate quickly. It also uses Jinja templating to write code that allows us to use concepts like conditional logic and looping to write more concise, modular SQL. We have made dbt a part of our standard toolset on similar projects and are continuing to delve deeper into its features and capabilities.
Postgres is good, but we may be hitting its limits for this purpose
Our most common data access patterns in the initial Rally system involved lookups of specific data available to any individual teacher in a large dataset, which is well aligned to Postgres’ row-oriented architecture. As we expand Rally to allow for more complex aggregations and views at a higher level (like school- and district-level aggregates of individual- and classroom-level metrics), a column-oriented data warehouse may be faster without hurting performance during row-wise lookups. In addition, we are beginning to cross thresholds of about 100 million row tables that may start to reduce Postgres’ ability to function well enough on analytic summary queries. We plan to investigate Snowflake as an alternative database for this purpose.
Over time, in addition to adding features to the Rally Analytics Platform, we plan to continue to leverage dbt, replace some of the more arduous custom data integrations by leveraging Ed-Fi data interoperability, and investigate Snowflake for our database technology. We think that leveraging open data standards and scalable cloud technology can allow us to provide useful data to support more educators in more states at a low cost for education agencies of all sizes.
Interested in working in Data Engineering?
Click the button to explore current opportunities at Education Analytics.