# Machine Learning

It is not enough that your data are sampled from a population. Statistical tests are also based on the assumption that each subject (or each experimental unit) was sampled independently of the rest. Data are independent when any random factor that causes a value to be too high or too low affects only that one value. If a random factor (one that you didn’t account for in the analysis of the data) can affect more than one value, but not all of the values, then the data are not independent.

*The concept of independence can be difficult to grasp. Consider the following three situations.*

- You are measuring blood pressure in animals. You have five animals in each group, and measure the blood pressure three times in each animal. You do not have 15 independent measurements. If one animal has higher blood pressure than the rest, all three measurements in that animal are likely to be high. You should average the three measurements in each animal. Now you have five mean values that are independent of each other.
- You have done a biochemical experiment three times, each time in triplicate. You do not have nine independent values, as an error in preparing the reagents for one experiment could affect all three triplicates. If you average the triplicates, you do have three independent mean values.
- You are doing a clinical study and recruit 10 patients from an inner-city hospital and 10 more patients from a suburban clinic. You have not independently sampled 20 subjects from one population. The data from the 10 inner-city patients may be more similar to each other than to the data from the suburban patients. You have sampled from two populations and need to account for that in your analysis.

**How you can use statistics to extrapolate from sample to population**

Statisticians have devised three basic approaches to make conclusions about populations from samples of data:

The first method is to assume that parameter values for populations follow a special distribution, known as the Gaussian (bell shaped) distribution. Once you assume that a population is distributed in that manner, statistical tests let you make inferences about the mean (and other properties) of the population. Most commonly used statistical tests assume that the population is Gaussian. These tests are sometimes called parametric tests.

The second method is to rank all values from low to high and then compare the distributions of ranks. This is the principle behind most commonly used nonparametric tests, which are used to analyze data from non-Gaussian distributions.

The third method is known as resampling. With this method, you create a population of sorts by repeatedly sampling values from your sample. This is best understood by an example. Assume you have a single sample of five values, and want to know how close that sample mean is likely to be from the true population mean. Write each value on a card and place the cards in a hat. Create many pseudo samples by drawing a card from the hat, writing down that number, and then returning the card to the hat. Generate many samples of N=5 this way. Since you can draw the same value more than once, the samples won’t all be the same (but some might be). When randomly selecting cards gets tedious, use a computer program instead. The distribution of the means of these computer-generated samples gives you information about how accurately you know the mean of the entire population. The idea of resampling can be difficult to grasp.

# Independent Sampling

# What are the Sampling Methods???????

__Methods of Sampling__

Most sampling methods can be categorized into two –

(A) Probability Sampling Methods

(B) Nonprobability Sampling Methods

**(A) Probability Sampling Methods**– Are those that clearly specify the probability or likelihood of inclusion of each element or individual in the sample.

Major Probability Sampling methods are the following-

- Simple Random Sampling
- Stratified Random Sampling

**1.Simple Random Sample**– A simple random is a one in which each and every individual of the population has an equal chance of being included in the sample and also the selection of one individual is in no way dependent upon the selection of another person.

2.**Stratified Random Sample**– In stratified random sampling the population is, first, divided into two or more strata, which may be based upon a single criterion such as gender- male and female, or upon a combination of two or more criteria such as gender and education, yielding four strata, namely, male undergraduates, male graduates, female graduates and female graduates. Having divided the population into two or more strata, which are considered to be homogeneous internally, a simple random sample for the desired number is taken from each population stratum.

**(B) Nonprobability Sampling Methods- **Are those in which there is no way of assessing the probability of the element or group of elements, of population being included in the sample. Important techniques of nonprobabilty sampling methods ar**e-**

- Quota Sampling
- Accidental Sampling
- Judgmental or Purposive Sampling
- Systematic Sampling
- Snowball Sampling

**1.Quota Sampling**– This type of sampling is apparently similar to stratified random sampling. Here, the investigator recognizes the different strata of population and from each stratum he selects the number of individuals arbitrarily. This constitutes the quota sample.

Suppose, the investigator knows that population of individual that he is going to study has three strata in terms of Shifts- Morning, Afternoon & Evening. Further suppose he knows that there are 100 people in Morning shift, 700 people in Afternoon shift and 200 people in Evening shift. Thus, the population consists 1000 individuals. If he wants to select 100 individuals & finally, selects 10 from Morning shift, 70 from Afternoon shift & 20 from Evening shift, according to his convenience ( and not randomly), this constitutes quota sample.

**Purposive Sample**– This type of sample is based on the*typicality*of the cases to be included in the sample. The investigator has some belief that the sample being handpicked is a very good representative of the population. A purposive sample is also known as judgmental sample because the investigator on the basis of is impression makes a judgment regarding the concerned cases, which are thought to be typical of the population.

Before the start of general elections, purposive samples are often taken in an attempt to forecast the national elections. The investigator selects the persons from those states whose election result on previous polls have approximated the actual result & thus, have been typical of the whole population.

**Accidental Sampling**– It refers to a sampling procedure in which the investigator selects the persons according to his convenience. Here he does not care about including the people with some specific or designated traits, rather he is mainly guided by convenience & economy. This is a crude method of sampling & the investigator knows that little can be generalized from the sample thus drawn.

**Systematic sampling-**This may be defined as drawing or selecting every*nth*person from the predetermined list of elements or individuals. Selecting every 5th roll number in a class of 60 students will constitute systematic sampling. Likewise, drawing every 8th name from a telephone directory is an example of systematic sampling.

**Snowball Sampling-**This type of sampling is basically socio metric. It is defined as having all the persons in a group or organization identified their friends who in turn identify their friends and associates until the researcher observes that a constellation of friendships converges into some type of a definite social pattern. Snowball sampling has important research application in relatively small business & industrial organizations where*N*is expected not to exceed 100. Such sampling is more convenient to the studies of social change & diffusion of information among specific segments of social organizations.

# How to Calculate Sample Size??????

__ Sample Size Calculation:__

How many responses do you really need? This simple question is a never-ending quandary for researchers. A larger sample can yield more accurate results — but excessive responses can be pricey.

Consequential research requires an understanding of the statistics that drive sample size decisions. A simple equation will help you put the migraine pills away and sample confidently.

Before you can calculate a sample size, you need to determine a few things about the target population and the sample you need:

**Population Size**— how many total people fit your demographic? For instance, if you want to know about mothers living in the US, your population size would be the total number of mothers living in the US. Don’t worry if you are unsure about this number. It is common for the population to be unknown or approximated.**Margin of Error (Confidence Interval)**— No sample will be perfect, so you need to decide how much error to allow. The confidence interval determines how much higher or lower than the population mean you are willing to let your sample mean fall. If you’ve ever seen a political poll on the news, you’ve seen a confidence interval. It will look something like this: “68% of voters said yes to Proposition Z, with a margin of error of +/- 5%.”**Confidence Level**— How confident do you want to be that the actual mean falls within your confidence interval? The most common confidence intervals are 90% confident, 95% confident, and 99% confident.**Standard of Deviation**— How much variance do you expect in your responses? Since we haven’t actually administered our survey yet, the safe decision is to use .5 – this is the most forgiving number and ensures that your sample will be large enough.

Your confidence level corresponds to a Z-score. This is a constant value needed for this equation. Here are the z-scores for the most common confidence levels:

- 90% – Z Score = 1.645
- 95% – Z Score = 1.96
- 99% – Z Score = 2.576

If you choose a different confidence level, use Z-Score Table to find your score.

Next, plug in your Z-score, Standard of Deviation, and confidence interval into this equation:

**Necessary Sample Size = (Z-score) ² * StdDev*(1-StdDev) / (margin of error) ²**

Here is how the math works assuming you chose a 95% confidence level, .5 standard deviation, and a margin of error (confidence interval) of +/- 5%.

((1.96)² x .5(.5)) / (.05)²

(3.8416 x .25) / .0025

.9604 / .0025

384.16

385 respondents are needed

Link for Z-Score Table