Student data-science competition adds teams outside Pittsburgh region, including from California Native American youth center and New Jersey schools

Amara Sanchez was already a data scientist. She just didn’t know it.

“One day I got back [to the Pala Youth Center] from school and April Cantu asked me if I wanted to work with Kim on a data science project,” says the middle-school student.

Sanchez had followed her usual routine of coming straight to the Youth Center from school. A member of the Pala Band of Mission Indians in southern California, Sanchez was always up for new, interesting projects, which she pursued through the nearby Pala Learning Center. But she didn’t know what “data science” was. Relatively few adults do.

Cantu sent her to “Kim” — Kimberly Mann Bruch, of the San Diego Supercomputer Center (SDSC), and Doretta Musick, director of the Learning Center. They shared with her a book called Everyday Data Science.

Sanchez looked through the book.

“Oh, I already do that,” she told them. Then she showed them her hand.

The connection between Sanchez and her classmate Maniya Zwicker and the Pittsburgh DataJam is a story of persistence, coincidence, networking and above all the magic that happens when a great idea meets minds that are prepared to make the most of it. It parallels the growth of DataJam from a purely Pittsburgh phenomenon to one that now spans the North American continent, from the Atlantic to the Pacific.

Mostly, it’s about the discovery that young people can not only understand data science. They can excel at it.

DataJam began in 2013, when the Pittsburgh DataWorks — a collaboration of companies and universities in Pittsburgh with an interest in data science — decided that a great way to support data science in the region would be to nurture the next generation of data scientists. Saman Haqqi of IBM, Raja Sooriamurthi at Carnegie Mellon University and Oracle data scientist Brian Macdonald joined with the Pittsburgh Supercomputing Center’s (PSC’s) Cheryl Begandy to design an extracurricular program to train high school teachers to coach student teams through an informal data-science competition.

Starting in 2013-14, DataJam became an intensive, in-person yearly effort to help the teachers acquire the skills to be comfortable doing — and teaching — data science. The teachers in turn helped the kids pick and investigate a topic of their own choice, ensuring their interest. Unlike most competitions of this sort, the DataJam learning activity would last the whole school year, teaching data analytical skills over time. The next school year, Pitt neuroscientist Judy Cameron, who ran science outreach for the university, gave students there who wanted to do outreach the opportunity to become mentors for the DataJam contestants. This innovation, also new to data competitions, proved to be both enduring and popular for the kids and their mentors.

“Through 2019, we focused on schools in the Pittsburgh area,” says Cameron, now director of DataWorks. “We were expanding into the suburbs and areas around Allegheny County. But we saw ourselves as a local activity.”

Then COVID-19 hit. Everything shut down. The kids were shut in their homes in quarantine. One-on-one mentoring was out of the question, let alone an end-of-term, in-person award celebration.

No one knew what was going to become of DataJam.

On Sanchez’s hand, written in ink, was a series of symbols.

“In class I was doing this thing of making hearts, stars and smiley faces on my hand every time I said ‘hi’ to someone and gave them a hug,” she explains. “My hand was full of these.”

Having introduced herself to data collection, Sanchez was up for taking on a more ambitious project in data analysis. Zwicker and she worked with Cantu, Musick and Mann Bruch on a DataJam project focusing on water quality in a small section of a local river on the Pala Reservation.

“It was just like any other project we’ve had in the past,” says Musick. “It takes a while to get it off the ground and get it started. It depends on if we strike up the interest, whereas Amara and Maniya were totally [invested] in this one.”

The San Luis Rey River runs through the Pala Reservation, serving as a major source of water for the community. The students decided they wanted to learn more about the quality of their river’s water and better understand how it compared to other areas throughout San Diego County.

Of course, DataJam’s problem was writ large on the entirety of the U.S. educational system. What could be done to educate students when schools were closed down?

As did many school systems, DataJam turned to remote learning.

The idea posed challenges. Could they ensure that students in less-affluent communities had access? Without in-person tutoring, could they train either teachers or students? One by one, they started overcoming the issues. And, when DataJam became an online-only competition, a surprising thing happened.

DataJam grew.

“From the end of 2020 into the spring of 2021, we started to get inquiries from all over the place,” Cameron says. “We said ‘yes’ to everybody; we figured if we were going to have to mentor by videoconferencing … what difference did it make where the school was?”

A bunch of human connections – many of them serendipitous – helped bring teams in from across the U.S.

Cheryl Begandy introduced Cameron to investigators in the NSF funded Northeast Big Data Innovation Hub, with which she’d been working as part of her outreach efforts for the PSC, thinking that this Hub might be interested in supporting the DataJam.

One of these investigators, Catherine Cramer of the Woods Hole Institute, worked with Cameron and two other scientists to write a pilot project proposal in the summer of 2021 to expand the DataJam to other states in the northeast.

This got the interest of Rich Chomko, a teacher at the Passaic Academy in New Jersey. The school stood up several teams for the 2022 DataJam.

“I was working with one of the [Passaic] teams,” says Jackson Filosa, a DataJam mentor about to be a senior at Pitt. “They were completely new to the world of Big Data … they were a team of freshmen without  a ton of statistics experience.”

The Passaic team nonetheless dove into the work. They studied the relationship between median income and COVID-19 cases in N.J. counties. They found that, for the most part, higher income areas had lower COVID-19 cases, and formulated hypotheses as to why that was so. Differences in working conditions, quality of health care and social-class-associated adherence to precautions all seemed possible explanations. “The highest COVID cases were in Passaic County, where they lived,” Filosa says. The students would like to continue their work and dig deeper into the question. “It showed them that Big Data is not just these random numbers. It’s relevant; it’s all around; it’s everyone.”

The connection with the Pala Band of Mission Indians was another unexpected turn. And it was more indirect than you might think, if you knew how many connections there are between Pittsburgh and southern California.

For one thing, SDSC, which connected the Pala team to the DataJam, has long worked with PSC. Through the soon-to-end, 10-year Extreme Science and Engineering Discovery Environment (XSEDE) program, which coordinated the National Science Foundation-funded supercomputing centers across the U.S., SDSC and PSC staff have worked together on projects as different as allocating computing time to researchers, supporting and advising them on how to approach computing projects, teaching tutorials on computing tools and communicating XSEDE’s and the centers’ contributions to science with the computing community and the general public.

The contact that helped create the Pala DataJam team had nothing to do with this relationship, though. Instead, Cramer also worked part-time at SDSC and brought the idea of DataJam participation to Mann Bruch, who shared it with the Pala Learning Center leadership.

In their data collection, the Pala DataJam team focused on pH. They measured the acidity of water from the San Luis Rey on Reservation land.

“They were granted access to a pretty much closed-off part of the river,” says their Pitt mentor, Louise Hicks, who was a senior while working with them last year. “They took long walks to get to a secluded portion of the river. It was an opportunity to interact physically with the environment [as well as do a data project].”

The Pala students discovered something surprising. Their data showed that the river on the Reservation land was slightly more acidic than the state had been measuring elsewhere. The difference wasn’t enough to raise an alarm — but it was interesting, and worth further study. Partly because of their presentation to the DataJam judges on their findings, they would like to take more measurements over time and find out if the difference is persistent. They will pursue this, among other projects, via an internship at SDSC this summer.

The team presented their final DataJam project, “Examining pH Data in the San Luis Rey River within the Pala Native American Reservation and Beyond!” in both English and their Native language of Cupeño. This work impressed the judges, gaining them recognition as Best New Team at the virtual DataJam Finale in April 2022. In July, the project gained another honor, with Hicks being invited to present her work with the Pala team at the PEARC22 national supercomputing conference in Boston.

Above: Louise Hicks, the Pala team’s Pitt mentor, presenting the work at the PEARC22 supercomputing conference in Boston

For DataJam, 2022 marked new possibilities and effectively shattered the project’s image as “only” a local endeavor. This year 21 teams competed, spanning geographically from the Pala team to three teams from Passaic Academy. The first-place award went to “Analyzing the Legacy of Redlining in Pittsburgh,” by the Norwin High School, Pa., team; second place was for “Forecasting COVID-19 Infections & Vaccinations in PA,” by the Central Dauphin High School team in Harrisburg PA, in the middle of the state; and third place to “Availability of Inclusive Parks & ADA Accessibility within the City of Pittsburgh,” Hampton High School.

You can find the 2022 DataJam project poster-reports here. You can also watch a YouTube video of the DataJam Finale here.

DataWorks is now focused on the right way to take DataJam national.

In addition to forming ties with the Pala community via SDSC and the Passaic Academy, says Cameron, “we have also made new contacts [with] the director of the Northeast Big Data Hub and are putting in an NSF grant proposal to be much more inclusive, providing Big Data science support to areas of the country including the Northeast and the Midwest.” Other grant applications include one to work with Duquesne University, the University of Pittsburgh Bradford and the University of California San Diego — SDSC’s parent institution — to tailor the program to be more accessible for urban, low resource, rural, and Native American students. SDSC, for example, has connections with four Native communities in addition to the Pala who may be interested in forming teams in the future.

“It’s great to have University of Pittsburgh students mentoring these teams all over the place, but that’s not really sustainable,” Cameron adds. “If we really want to sustain [at the national level] we have to grow the pool of mentors.”

In coming years, mentors for N.J. students will be coming from their home state. Chomko, the teacher at Passaic Academy, knew Salvatore Ferraro, a professor at Caldwell University in N.J. who runs a program through which Caldwell reaches out to STEM teachers in the state. Chomko told Ferraro about DataJam. After speaking with Cameron, Ferraro decided to teach Caldwell students to be mentors as well.

In addition to Caldwell, Duquesne University in Pittsburgh, Pitt Bradford and UCSD are all working on adding DataJam Mentor training to the courses they offer. Grants have been written to expand using a collaborative synchronous blended learning environment (BSLE) strategy where all universities will connect by videoconference so one course can be taught that reaches students at a number of universities.

But the DataWorks leadership understands that more will be necessary.

“Best-case scenario, we have DataJam ‘nodes,’” she says. “Pittsburgh/western Pennsylvania was the first, and now we’re forming nodes in San Diego and New Jersey.” The Midwest is another promising possibility for a node.

Each node, Cameron envisions, could offer mentors to guide local teams, with local, in-person DataJam Finales. Then they’d submit to a national DataJam, run by Pittsburgh DataWorks, for judging and a national Finale.

Good communication between the nodes is key.

“We already have a really good rubric, which we revised for video use during COVID,” she says. “That’s a start, but we’ll be putting more effort into training judges so we get consistency, which we think will be important if you’re going to have a national program.”

The 2022-23 DataJam will accept proposals from student teams due by Dec. 2, 2022, with a Finale on April 27, 2023. For more information and to apply, see the DataJam website.