The Future Of Reproducible Research - Powered by Kubeflow

created : 2022-08-05T15:54:31+00:00
modified : 2022-08-05T15:55:39+00:00

kubecon devops kubeflow

Motivation

Articles About Why Reproducible Research is Important

The Replication Crisis: What Is It?

  • Wikipedia Article Paraphrase:
    • Many scientific studies are difficult or impossible to reproduce.
    • Most prevalent in psychology and medicine, but also serious in other natural and social sciences.
    • Term coined in eary 2010s, gave rise to meta-science discipline.

The Replication Crisis : Causes

  • Wikipiedia Article Paraphrase:
    • C ommodification of Science
    • Publish or Perish Culture in Academia
    • Fraud and otherwise “Questionable” Research Practices
    • Statistical Issues
    • Base Rate Hypotheses Accuracy

The Replication Crisis: Consequences

  • Wikipedia Article Paraphrase:
    • Political repercussions
    • Public awareness and perceptions
    • Response in Academia

The Replication Crisis: Potential Remedies

  • Wikipedia Article Paraphase:
    • Reforms in publishing
    • Statistical Reform
    • Replication Efforts
    • Changes to scientific approach

My Experience Trying to Reproduce Research

  • Grad Student/ Academic Papers
  • Working on someone else’s old junk code
  • Working on my own old junk code

What we did

Tower of Babel: Making Apache Spark, K8s, and Kubeflow Play Nice

10 Minute Quick Overview of KF4COVID

  • Early days of pandemic - everyone was scared, no solutions were out of bounds.
  • Various ERs turned to CT scans and ultrasounds to detect ‘ground glass occlusions’ a hallmark of covid (technique has been used in ERs in the past for rapid pneumonia detection).
  • CT Scans deliver high dose of radition
  • Low Dose CT Scans deliver, low dose of radiation, but produce ‘noisy’ images.
  • We used K8s, Apache Spark, Apache Mahout & Kubeflow to denoise CT Scans

Rapid Testing Needed -Desperately

  • Mental Time Machine - to March 2020.
    • No one understands Coronavirus - but hospitals are being overrune and people are dying.
    • Slow Tests
    • Rapid test “issues”
    • No answer was ‘out of bounds’

The Pipeline: Overview

  • S3 Buckets of images (can be easily swapped out to other image repo)
  • PyDiCOM to turn CT scan into numerical matrix, write matrix to disk
  • Load matrix in apache spark (~500 MB each) then wrap RDD into Mahout DRM
  • DS-SVD on Mahout DRM (why couldn’t do this in Numpy?)
  • DS-SVD results in two matrices- one of basis vectors, one of weights per image - to “de noise” you only use first X% of basis vectors. These get output and can be easily rastered using a laptop.

Call to action / How you can do the same

  • Assume they won’t be using your laptop.

Use Kubeflow

Assuming someone will want to replicate your work, and that they won’t have access to your machine, Kubeflow provides a nice framework for reproducing results.

What is Kubeflow and Why Will it Help?

  • Talk about Kubeflow pipelines- a seris of docker containers that execute steps then hand off data to next step

Conclusion / Q&A

  • Buy our book

개인 생각

  • 전체적으로 유쾌하면서도 전달하고자하는바가 명확한 강연이였다.
  • 요약하자면, 과학자들은 실험을 하고 재현하려고 노력하는데 개발자들은 그렇지 않은 경우가 많고, 자신의 컴퓨터에서만 동작하면 끝인줄 안다. 재현 가능함 여부는 굉장히 중요하며 이를 위해서 kubeflow 를 사용했다. + 추가적을 자기 책 사주면 좋겠다.
  • kubeflow 를 한번도 안써봐서 좀 찾아봐야겠다. 이런 목표의 프로젝트인지 잘 몰랐다.
  • 참고자료
    • Kubeflow 의 목적은 machine learning workflow 를 kubernetes에 배포하는 것을 단순화 시키는 것
      • 더 빠르고 일관된 배포에 초점을 맞추어 이 강연은 진행되었다.
  • 추가적인 궁금점
    • kubernetes 에서 gpu device 는 어떻게 지원되고, 어떻게 세팅되는가
      • 전부를 다 볼필요는 없고 대충 세팅이 가능하다만 국내 블로그 있나 찾아봤다. 있다. 그러면 뭐 잘 되나보지 하고 일단 덮어뒀다.
    • kubernetes 에서 gpu 는 어떻게 scheduling 될까?
      • 확실치는 않은데 개수 단위로 요청하는 것 같다. 출처
      • 흐음… GPU 머신이 있으면 좋겠는데 조금 아쉽다. 실제로 테스트 해보고 싶은데