At Education Analytics, we seek to deploy and scale the most advanced evidence-based knowledge, generated by practitioners and researchers alike, for low-cost, high-impact use across the country. Our mission requires that educational research be both possible (as one needs data to do empirical research) and applicable (so the results of research should not simply be published in academic journals). Over the last ten years, these two constraints have forced us to evolve as an organization to integrate the full “stack” of the education data, technology, and research fields. In other words, we have had to develop technology teams (such as data engineering, analytics engineering, cloud engineering, software engineering, and enterprise data management) alongside research teams (like econometrics, psychometrics, statistics, data analytics, and high-performance computing).

We believe that deep integration of education research and technology has the power to create a new breed of modern, data-centric education organizations capable of achieving breakthrough student outcomes. This integration requires applying research-grade thinking about causal inference, theory of action, use cases, and empirical evidence to how we build and scale technology and data infrastructure; it also means that we desperately need to accelerate the field of education research to catch up to the advances of modern technology and the modern data stack.

In this blog, we will focus on one example of this necessary acceleration by arguing that the concept of a dataset in education ought to be replaced by the concept of a data stream. We'll start by tackling this issue from the perspective of an educational researcher, and we will make a case for why this shift is not only necessary but also an opportunity. Later, we'll look at this issue from the perspective of a technologist, and we will lay out a vision for the potential future of a fully interoperable data system that enables not only researchers, but also practitioners, policymakers, and other stakeholders to tap into data streams. By bringing both fields and audiences together in working towards this vision, we hope to help education agencies across the country move away from being data rich but information poor towards a future where "big data" can yield big insights for educators.

We believe that deep integration of education research and technology has the power to create a new breed of modern, data-centric education organizations capable of achieving breakthrough student outcomes.

From the perspective of an educational researcher

Educational researchers have no shortage of questions they wish to answer with data. The limiting factor in many cases is the availability and accuracy of the data necessary to answer a given question.

One concrete example of the need for educational researchers to tap into data streams is the following research topic we might want to examine. According to a recent nationally representative survey of teachers by Educators For Excellence, there is as great a need as ever for more teachers who are Black, Indigenous, and People of Color (BIPOC), while at the same time, we are seeing that BIPOC teachers are seemingly the most likely to want to leave the profession. If we were to try and understand national trends related to BIPOC teacher retention, how would we go about doing so?

There are several research questions that we may want to pose, for example:

  • How much are BIPOC teachers paid compared to the general population of teachers?
  • Which classrooms are BIPOC teachers teaching in, and what are the characteristics of those classrooms compared to others?
  • What professional supports do BIPOC teachers receive (or are even available) compared to non-BIPOC teachers?
  • Are the curriculum and teaching materials that are available to BIPOC teachers suitable for their success?

We are sure there are many other questions that relate to the topic of BIPOC teacher hiring, retention, and turnover—and these are just scratching the surface. But even for this subset of questions we might ask, there is no dataset that has the information needed to answer these questions. Instead, there are a host of data systems, including things like:

  • A Human Resources Information System (HRIS) or Enterprise Resource Planning (ERP) system, which provides teacher demographic data and salary information
  • A Student Information System (SIS), which provides classroom assignments and student demographic data
  • A survey platform (either a proprietary platform or a freely available tool), which provides school culture and climate survey data
  • A professional development system (if available), which provides documentation about available and provided professional supports
  • A learning object repository (LOR; if available), which identifies what curricular materials are available and in use

A dataset that would pull all of these data together would require a district staff member to go to at least five different systems and then pull the data from each of them together in a coherent way.

An illustration showcasing how different data systems are pulled together by one person or one analyst.

Within one district, this would be a fairly complex and expensive task, because it would require, for each individual source system, that a person:

  1. Understand the data request;
  2. Map out how that request relates to the data schema, API, or flat file extract of each source system;
  3. Create the data query, API pull, or flat file download process to obtain the data; and
  4. Securely transfer those data to a central repository.

And then, once all of that is done, that analyst (or more likely, another analyst) would have to then:

  1. Collect all of the data extracts from each system (potentially many of them per system);
  2. Review the context and contents of each data set;
  3. Combine the datasets using (if lucky) identifiers (IDs) that are unique to each student or teacher yet standard across the data systems—or more likely, figuring out how to merge datasets using IDs that do not match across the systems;
  4. Verify that this new combined dataset did not accidentally create personally identifiable information (PII) through data combination; and
  5. Provide the de-identified data to the researcher or analyst answering the research question(s).

During any of these processes, it is likely that the analyst(s) will have to communicate with a subject matter expert on both the data sources and the research question to make sure that the data being pulled or combined are correct, accurate, and error-free. The probability of success is very low without large amounts of highly skilled people hours that typically will not be available inside a district due to other priorities. Replicability is generally non-existent, because the source systems are ephemeral in structure, and most often, they are not governed by the district itself (because they are using a product that there is no expectation that they control).

This is just at the district level. This same process repeats itself as data moves from districts to the state, and from the state to the federal agencies; and with each additional step "up," the data analyst at each level loses the ability to talk to the local subject matter experts to get context questions answered—questions like:

  • What are the district's business rules for assigning a race/ethnicity code to a teacher?
  • What was the actual content taught in the course with this particular course code?
  • What is the difference between the codes “Absent - 47” and “Absent - 43?”

Thus, as a dataset gets created and then moved and then aggregated and then moved and then aggregated again, context and granularity must decline to the point of not being able to answer any but the most general questions, and at great cost.

The concept of a "dataset"

Historically, applied research in all fields has relied heavily on static data collection systems that involve a few standard steps:

  1. Generate a hypothesis
  2. Design a model or experiment that could use observable events to then test such a hypothesis
  3. Collect data based on (2)
  4. Analyze the data
  5. Determine if the theory and data validate or reject the hypothesis
  6. Publish the result (usually without any data)
Infographic highlighting the research cycle involved in static data collection systems.

As described above, the data collection process is extremely labor intensive, and it has typically served a single purpose, which is to enable step 3 above. Most typically, the result of step 3 is stored in what some call a dataset. In the simplest terms, this is one spreadsheet of data collected (possibly by hand), and in more complex terms, this can be multiple data sources wrestled into one dataset that statistical programs can use. In this world, the notion of data standards is beneficial, because then it is possible (even if not necessarily easy or advisable) to combine datasets across contexts. There have been many efforts to standardize the definitions and the structures of data in education throughout the last few decades (e.g., Common Education Data Standards, or CEDS; School Courses for the Exchange of Data, or SCED, etc.), all seemingly with this legacy definition of a dataset in mind.

But this model of “datasets for research” is outdated and detrimental to the conversation. It has contributed to an education system that is data rich but information poor. The reality of the modern education system is that data are being collected in real-time across hundreds of live data systems within a single school system, as a result of thousands of vendors providing tools used by students, teachers, and administrators. We can imagine that for many purposes, the cost of data collection is “free” since it is already being captured somewhere. What becomes difficult and expensive is accessing and structuring these data. They do not sit in datasets, they sit in operational databases that are structured to allow machines to talk to each other, and that are governed under rules of ownership and transfer rights that most often are unknown and unarticulated. These systems are live and dynamic; the notion of a static dataset from these systems is unlikely to retain much meaning.

Instead, we propose that the modern version of a dataset is a data stream and that the skillsets needed to access these data streams are new and different. In place of data standards, we now have data system interoperability standards. In place of data collection, we now have data access and governance. The ability to access and manage these kinds of structures has traditionally sat with technologists, but we believe that researchers and policymakers have the capability and the need to develop new skills to access and manage these new concepts.

Illustration demonstrating the shift from old world data practices to new world data practices. The dataset has become the data stream, data standards have become data interoperability standards, and data collection has become data access and governance.
This model of “datasets for research” is outdated and detrimental to the conversation. It has contributed to an education system that is data rich and information poor. Instead, the modern version of a dataset is a data stream, and the skillsets needed to access these data streams are new and different.

From the perspective of a technologist

Data infrastructure in education is in the process of being dramatically modernized and reformed. Within the last decade, and especially in the last three years, we have seen innovations in interoperability and cloud technology that enable decentralized, locally governed, systems-level interoperability that avoids the naïve solution of “just putting the data all in one spot.” In general, these technologies can be grouped into a few categories. Although these categories likely appear overwhelming to a typical education researcher or policymaker, these are a standard part (and a non-exhaustive list) of the modern technology stack not only available to, but necessary for, modern data system architecture:

  • Application Program Interface (API) technologies as applied to data transport layers being adopted by education technology products;
  • Interoperability frameworks like Ed-Fi, SIF, and 1EdTech transport layers, and CIID Generate (based on CEDS) standardized data transformation layers;
  • Cloud scaling databases, such as managed row-level databases (traditional SQL Server or postgreSQL in the cloud), columnar store databases (Amazon Redshift, Google BigQuery, Snowflake, etc.), and data lake technology using Apache Spark and similar technology;
  • Open-source modular software deployment and automation technologies, such as Docker, Kubernetes, Airflow, dbt, and cloud-based CI/CD pipeline technology; and
  • Open-source statistical software like R and Python that interact natively with these technologies through API calls.

At the moment, this ecosystem is nascent and fragmented. There are competing interoperability standards, data standards, API specifications, research patterns, and governance models. The current state of this infrastructure for a typical state could be depicted like this:

National Educational Data Infrastructure: Current State

Diagram of the current state of the national educational data infrastructure, which features individual source systems within a district that can be connected but are not yet fully interoperable.

The current state features individual source systems within a district that can be connected, but are not yet fully interoperable. Data like curricular data and student data tend to be siloed and isolated from other data sources. Student data tend to live in their own source systems, with some technologies like OneRoster available to connect student information system (SIS) data to a learning management system (LMS). District operational data is used mostly (or only) for complying with state reporting requirements, and in turn, state-level data are aggregated for use in mandated federal reporting (i.e., EdFacts).

In a few states that are on the cutting edge, an API might be used to pull some of the different data sources into a district-level operational data store (ODS) enabled by Ed-Fi technology. In turn, the data from district-level ODSs can be transported and combined into a state-level Ed-Fi ODS. At the federal level, there are various storage mechanisms, such as a state-level CEDS integrated data store (IDS), that enable federal reporting. Together, these data systems take a major step towards interoperability, but as of yet fall short of a fully interoperable data ecosystem.

Federal investment in statewide longitudinal data systems (SLDS) has laid the groundwork for an interoperable data ecosystem by incentivizing a community of state education agencies to collaborate on strategies and the development of federal data standards, such as CEDS. However, further leadership and investment is needed to realize the vision of full interoperability. When SLDS was conceived of, many (or most) of the technologies listed above did not exist. This means that the SLDS vision did not truly require an interoperability framework—rather, it focused on how to unify data within a state system, but does not consider ground-level data systems as a key source and does not focus on a unified strategy across states. This has led to many different strategies in each state, which makes the problem we are facing more challenging and more expensive from a federal perspective. Now, we have the opportunity—and in our view, the need—to leverage these technologies to build towards a nationwide strategy of data interoperability.

What the future could look like

Let's return to the example research question above to examine what a future state could look like—one where investments in data streams, data interoperability, and data governance have been made across the country. For this example, we will use specific technologies just to limit ambiguity; however, we do not expect that one particular standard or technology will end up being the gold standard in the future—and as a result, there is an imperative need for leadership at all levels of the educational ecosystem to consolidate around one or a few patterns of interoperability technology.

In the example above, we sought to answer questions about BIPOC teacher turnover and retention. We needed data on staff and student assignments, demographic data about teachers and students, compensation data for teachers, school culture and climate data, professional development resources and usage data, and data about curricular materials availability and usage. This sounds like a large amount of data; however, we are sure that this only scratches the surface of the needs to help design evidence-based policies to support BIPOC teachers and improve outcomes for students.

National Educational Data Infrastructure: Future State

Diagram of the future state of the national educational data infrastructure, where all of the individual source systems within a district have been connected via interoperable technologies, which enables them to be connected to larger groups of districts.

In the future state, there is a live, fully integrated Ed-Fi operational data store within each local education agency (LEA) that is capturing transactional data from all of the LEA’s operational systems, including the SIS, ERP, HRIS, transportation system, LMS, nutrition system, and so on. To enable this, almost all vendors of the various education technology systems have integrated with standards-based interoperability tools. All of these tools can “talk” to each other through the Ed-Fi ODS (technology managed by the Ed-Fi Alliance)—at the discretion of the LEA’s governance structure. This linking enables an entirely new universe of data access possibilities, such as:

  • The LMS and the SIS can easily share up-to-date rosters of teachers and students.
  • The district can have a learning objects repository (LOR) that is connected to their LMS via an interoperability standard called LTI (Learning Tools Interoperability, technology managed by 1EdTech), so they can procure or develop best-in-class content and make it available within their best-in-class LMS without content or platform lock-in.
  • Because the survey platform is connected to Ed-Fi as well, we can access survey results about school culture and climate, linked to data points such as content availability, teacher wages, etc. Assessment data of all kinds (state summative, standardized interims, locally-created formative assessments) can be integrated seamlessly into the system..
  • Since the LMS, HRIS, and SIS are connected via Ed-Fi (and the LOR is connected to the LMS by LTI), we can know which teachers are using what content at what rate with which students—and then, we can start to understand if BIPOC teachers have different supports or content availability that might lead to teacher turnover.

Because all of these systems have been connected via interoperable technologies, they can also be connected to larger groups of districts like counties, education service centers, cross-state consortia facilitated by RELs, states, and then the federal research centers (using protocols such as CEDS with a new CEDS API end point). And because all of these technologies have built-in security protocols, each data owner can decide what passes on to another agency or level and how often.

This future state is currently being built; the future state diagram above illustrates what we are trying to build towards. We see examples of parts of this vision across the country in many different contexts. Texas and South Carolina are implementing interoperable technology for every district in their state. Wisconsin, Arizona, Nebraska, Delaware, and Georgia (among many other states) have been implementing interoperable technologies at the state level. There are active proposals in the CEDS community to build interoperable technologies into the CEDS data standards. System-to-system interoperability between LMS and LOR systems have been built in South Carolina. Leading local education agencies such as Boston Public Schools, Denver Public Schools, and San Francisco Unified (among many others) have started implementing interoperability technologies. The list goes on and on.

Bringing the pieces together: How data streams can empower and accelerate education research

So how can we leverage this vision of modern data infrastructure to radically transform and empower the field of education?

Data collection to enable education research would become many orders of magnitude simpler than the very manual process we described at the beginning of this post. An education researcher could simply obtain a data release agreement (DRA) with the appropriate education agency/agencies, receive a cryptographic key and a paired secret (referred to as a key and secret or a key/secret pair) associated with a privacy claim on the data stream, use an API connection along with the key/secret pair to access all of the relevant data (which has already been integrated), transform the data from the native machine structure to a structure for analysis, and then perform the analysis. Importantly, this process is then infinitely replicable because the same code can be reused in other districts or states with the same data streams by anyone whose DRA and security keys are granted in the same way.

With this great interconnection of education data will come an ever-increasing need for new tools that help us access these systems. Because of this explosion of data abilities, the skillsets needed to leverage these data will also increase. Most likely, researchers will need to get much better with database technologies (SQL) and API access technologies (JSON manipulation using R or Python), or at a minimum, will need to collaborate with (and perhaps employ) data scientists and data engineers with these skillsets. Because data volumes will exponentially increase over time, high performance compute skills will be required. In essence, the empirical research lab in education will look more and more like a software development group with specialized technical roles.

On the positive side, as we standardize the data transport layer, each new tool can be scaled rapidly and cost effectively. For instance, at Education Analytics, we are building an open source analytics and research data system that sits on top of this data streaming ecosystem. This system allows for live, highly structured, pre-integrated data to be accessed in an identified way that is internal to school system researchers, and a live, connected set of data systems that pre-anonymize data in different dimensions (based on research use cases) for approved external research. These tools are very complex and expensive to build, but since they can replicate to any standard data stream once connected, costs for further research data collection and access are close to zero. Instead of spending costly time and effort to transform data from a machine structure to an analytical structure, a researcher would instead be able to start work immediately in a system that they are already familiar with.

It is our belief that compared to our research colleagues in traditionally data-rich fields—such as physics, computer science, or finance—the education research field is woefully underskilled and not well structured for this imminent data revolution. EA is actively working towards preparing ourselves and the broader education community for this impending and exciting revolution. By bringing together state-of-the-art data infrastructure, technology, and technical expertise with best-in-class research expertise, we are closing the gap between experts who know how to structure data and experts who know how to make meaning from it. In doing so, we aim to contribute to an educational system that is data rich and information rich, which ultimately is in service of improved outcomes for students.

Be sure to subscribe to our updates (at the bottom of this page) to receive future blogs and resources laying out our vision for how interoperability will accelerate education research.

Get Updates from EA Direct to Your Email!

Subscribe to our newsletter to get updates on our latest blog posts, press releases, product updates, and other EA news sent directly to your inbox. Click the button to subscribe!