How to Design Experiments for Data Collection

Tips on how to collect raw data

Benjamin Obi Tayo Ph.D.
3 min readMay 31, 2022

--

Photo by Science in HD on Unsplash

Key Takeaways

  • Designing experiments for data collection is important when the data required for analysis isn’t available.
  • The key goal is to design a way to collect the best subset of data quickly and efficiently.

Data plays a central role in data science and machine learning. Most often, we assume that the data to be used for analysis or model building is readily available and free. Sometimes we may not have the data and getting the full dataset either isn’t possible or would take too long to collect. In this case, we need to design a way to try to collect the best subset of data that we can get quickly and efficiently. The process of designing an experiment for collecting data is called design of experiments. Some examples of design of experiments include surveys and clinical trials.

We now discuss 4 main factors to keep in mind when designing and executing experiments for data collection.

Time

We need to make sure the experiment can be designed and implemented within a reasonable period of time. For example, suppose the customer service department of a certain organization is experiencing exponential growth in the number of calls, and long call center wait times. The organization can design surveys in which employees and customers can participate. This has to be done in a prompt and timely manner so that data collected could be analyzed and used for data-driven decision making that could help improve the customer service experience. If the design of the experiment and analysis of data collected is not executed in a timely manner, it could negatively impact sales and profits.

Quantity of Data

In designing experiments, we need to make sure the data collected from the experiment will be sufficient for us to answer the questions we need to. The amount of data (sample data) collected has to be small compared to the total expected data (population data), otherwise it would take too long to collect. The sample data must be representative of the whole population. For example, an experiment designed to study the efficacy of a medication should be demographically representative…

--

--

Benjamin Obi Tayo Ph.D.

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Personal Finance Analytics, Materials Sciences, Biophysics