Give step-by-step solution with explanation and final answer:Problem 2. Decision Tree (21 points)

(1) (7 points) Table 1/ consists of training data from an employee database. department, age,
sala

Question

Give step-by-step solution with explanation and final answer:Problem 2. Decision Tree (21 points)

(1) (7 points) Table 1/ consists of training data from an employee database. department, age,
salary are attributes of the employee. For example, ‘36...45’ for age represents the age
range of 36 to 45, ‘36K...45K’ for salary represents the salary range of 36,000 to 45,000,
‘sales’ represents the employee who belongs to sales department. Let status indicate the
categorical labels of these 10 employees. Please calculate the information gain of each
attribute. Based on your calculation, identify which attribute should be used to split.

department age salary status
sales 36.45 36K...45K senior
sales 26...35 26K...35K junior
sales 36.45 26K..35K junior
systems ~~ 46..55 46K..55K senior
systems ~~ 36..45 46K...55K senior
systems ~~ 36..45 36K...45K junior
systems ~~ 26..35 26K..35K junior
marketing 36...45 36K...45K senior
marketing 36...45 26K...35K junior
marketing 26.35 26K...35K junior
Table 1: Attributes of 10 employees.

(2) (7 points) Consider the dataset given in Table[1| Please calculate the gain ratio of each
attribute. Based on this calculation, identify w

VIPSolutions · Accepted Answer

Let's solve **Problem 2(1): Information Gain Calculation** step by step.

---

### **Step 1: Extract Data from Table 1**

| department |   age   |   salary   | status |
|------------|---------|------------|--------|
| sales      | 36...45 | 36K...45K  | senior |
| sales      | 26...35 | 26K...35K  | junior |
| sales      | 36...45 | 26K...35K  | junior |
| systems    | 46...55 | 46K...55K  | senior |
| systems    | 36...45 | 46K...55K  | senior |
| systems    | 36...45 | 36K...45K  | junior |
| systems    | 26...35 | 26K...35K  | junior |
| marketing  | 36...45 | 36K...45K  | senior |
| marketing  | 36...45 | 26K...35K  | junior |
| marketing  | 26...35 | 26K...35K  | junior |

- **status** (target): 4 senior, 6 junior

---

### **Step 2: Calculate Base Entropy**

The entropy for the dataset ($S$):

$$
p_{senior} = \frac{4}{10}, \quad p_{junior} = \frac{6}{10}
$$
$$
Entropy(S) = -p_{senior} \log_2 p_{senior} - p_{junior} \log_2 p_{junior}
$$
$$
= -\frac{4}{10} \log_2 \frac{4}{10} - \frac{6}{10} \log_2 \frac{6}{10}
$$
$$
= -0.4 \cdot (-1.322) - 0.6 \cdot (-0.737)
$$
$$
= 0.529 + 0.442 = 0.971
$$

---

### **Step 3: Calculate Information Gain for Each Attribute**

#### **A. department**

- **Sales**: 3 (1 senior, 2 junior)
- **Systems**: 4 (2 senior, 2 junior)
- **Marketing**: 3 (1 senior, 2 junior)

##### Entropy for each split:

- **Sales**: $p_{senior}=1/3$, $p_{junior}=2/3$
$$
Entropy = -\frac{1}{3}\log_2\frac{1}{3} - \frac{2}{3}\log_2\frac{2}{3} = 0.918
$$

- **Systems**: $p_{senior}=2/4$, $p_{junior}=2/4$
$$
Entropy = -\frac{2}{4}\log_2\frac{2}{4} - \frac{2}{4}\log_2\frac{2}{4} = 1
$$

- **Marketing**: $p_{senior}=1/3$, $p_{junior}=2/3$
$$
Entropy = 0.918
$$

Weighted average entropy:
$$
E(department) = \frac{3}{10}\cdot0.918 + \frac{4}{10}\cdot1 + \frac{3}{10}\cdot0.918
= 0.275 + 0.4 + 0.275 = 0.95
$$

**Information Gain:**
$$
IG(department) = 0.971 - 0.95 = 0.021
$$

---

#### **B. age**

- **26...35**: 3 (all junior)
- **36...45**: 5 (2 senior, 3 junior)
- **46...55**: 2 (all senior)

Entropies:
- **26...35**: $p_{senior}=0$, $p_{junior}=1$ ⇒ Entropy = 0
- **36...45**: $p_{senior}=2/5$, $p_{junior}=3/5$
$$
Entropy = -0.4\log_2 0.4 - 0.6\log_2 0.6 = 0.971
$$
- **46...55**: $p_{senior}=1$, $p_{junior}=0$ ⇒ Entropy = 0

Weighted entropy:
$$
E(age) = \frac{3}{10}\cdot0 + \frac{5}{10}\cdot0.971 + \frac{2}{10}\cdot0 = 0 + 0.4855 + 0 = 0.4855
$$

**Information Gain:**
$$
IG(age) = 0.971 - 0.4855 = 0.4855
$$

---

#### **C. salary**

- **26K...35K**: 5 (all junior)
- **36K...45K**: 3 (all senior)
- **46K...55K**: 2 (all senior)

Entropies:
- **26K...35K**: $p_{senior}=0$, $p_{junior}=1$ ⇒ Entropy = 0
- **36K...45K**: $p_{senior}=1$, $p_{junior}=0$ ⇒ Entropy = 0
- **46K...55K**: $p_{senior}=1$, $p_{junior}=0$ ⇒ Entropy = 0

Weighted entropy:
$$
E(salary) = \frac{5}{10}\cdot0 + \frac{3}{10}\cdot0 + \frac{2}{10}\cdot0 = 0
$$

**Information Gain:**
$$
IG(salary) = 0.971 - 0 = 0.971
$$

---

### **Step 4: Identify the Best Attribute to Split**

- **department**: 0.021
- **age**: 0.4855
- **salary**: 0.971

**Final Answer:**
> The attribute **salary** should be used to split, as it has the highest information gain (**0.971**).

---

## **Summary Table**

| Attribute   | Information Gain |
|-------------|-----------------|
| department  | 0.021           |
| age         | 0.486           |
| salary      | 0.971           |

**Best attribute to split:** **salary** (with information gain 0.971)

VIPSolutions · Answer

Certainly! Let's go through the detailed step-by-step calculations for **Problem 2(1): Calculating the Information Gain for each attribute** using the data provided.

---

### **Step 1: Extract the Data from the Table**

| Employee | department | age       | salary      | status  |
|------------|--------------|-----------|--------------|---------|
| 1          | sales        | 36...45   | 36K...45K    | senior  |
| 2          | sales        | 26...35   | 26K...35K    | junior  |
| 3          | sales        | 36...45   | 26K...35K    | junior  |
| 4          | systems      | ~~        | 46K...55K    | senior  |
| 5          | systems      | ~~        | 46K...55K    | senior  |
| 6          | systems      | ~~        | 36K...45K    | junior  |
| 7          | systems      | ~~        | 26K...35K    | junior  |
| 8          | marketing    | 36...45   | 36K...45K    | senior  |
| 9          | marketing    | 36...45   | 26K...35K    | junior  |
| 10         | marketing    | 26...35   | 26K...35K    | junior  |

**Target class (status):**  
- Senior: Employees 1, 4, 5, 8 → total 4  
- Junior: Employees 2, 3, 6, 7, 9, 10 → total 6

---

### **Step 2: Calculate the Entropy of the Whole Dataset**

$$
p_{senior} = \frac{4}{10} = 0.4
$$
$$
p_{junior} = \frac{6}{10} = 0.6
$$

Entropy formula:

$$
Entropy(S) = -p_{senior} \log_2 p_{senior} - p_{junior} \log_2 p_{junior}
$$

Calculations:

$$
-0.4 \times \log_2 0.4 = -0.4 \times (-1.322) = 0.529
$$
$$
-0.6 \times \log_2 0.6 = -0.6 \times (-0.737) = 0.442
$$

Total entropy:

$$
\boxed{
Entropy(S) = 0.529 + 0.442 = \mathbf{0.971}
}
$$

---

### **Step 3: Calculate the Information Gain for Each Attribute**

---

## **A. Attribute: department**

**Possible categories: sales, systems, marketing**

Count the employees in each category and their class distribution:

| Department | Total | Senior | Junior |  
|--------------|--------|---------|---------|  
| sales        | 3      | 1       | 2       |  
| systems      | 4      | 2       | 2       |  
| marketing    | 3      | 1       | 2       |

### **Step 3.1: Calculate entropy for each category**

**1. sales**

$$
p_{senior} = \frac{1}{3} \approx 0.333
$$
$$
p_{junior} = \frac{2}{3} \approx 0.667
$$

$$
Entropy_{sales} = -0.333 \log_2 0.333 - 0.667 \log_2 0.667
$$
$$
= -0.333 \times (-1.585) - 0.667 \times (-0.585) = 0.528 + 0.390 = 0.918
$$

**2. systems**

$$
p_{senior} = \frac{2}{4} = 0.5
$$
$$
p_{junior} = 0.5
$$

$$
Entropy_{systems} = -0.5 \log_2 0.5 - 0.5 \log_2 0.5 = -0.5 \times (-1) - 0.5 \times (-1) = 1
$$

**3. marketing**

$$
p_{senior} = \frac{1}{3} \approx 0.333
$$
$$
p_{junior} = \frac{2}{3} \approx 0.667
$$

$$
Entropy_{marketing} = 0.918 \quad \text{(same as sales)}
$$

### **Step 3.2: Calculate weighted average entropy**

$$
E(department) = \frac{3}{10} \times 0.918 + \frac{4}{10} \times 1 + \frac{3}{10} \times 0.918
$$
$$
= 0.3 \times 0.918 + 0.4 \times 1 + 0.3 \times 0.918
$$
$$
= 0.275 + 0.4 + 0.275 = 0.95
$$

### **Step 3.3: Compute Information Gain**

$$
IG(department) = Entropy(S) - E(department) = 0.971 - 0.95 = \boxed{0.021}
$$

---

## **B. Attribute: age**

Age categories are:

- 26...35 (Employees 2, 3, 7, 10): total 4  
- 36...45 (Employees 1, 6, 8, 9): total 4  
- 46...55 (Employees 4, 5): total 2

### **Step 3.1: Class distribution in each age category**

| Age Range   | Total | Senior | Junior |  
|--------------|--------|---------|---------|  
| 26...35     | 4      | 0       | 4       |  
| 36...45     | 4      | 2       | 2       |  
| 46...55     | 2      | 2       | 0       |

### **Step 3.2: Calculate entropy for each age category**

- **26...35**: all juniors → entropy = 0
$$
p_{senior} = 0, \quad p_{junior} = 1
$$
$$
Entropy = -0 \times \log_2 0 - 1 \times \log_2 1 = 0
$$

- **36...45**: 2 senior, 2 junior

$$
p_{senior} = 0.5, \quad p_{junior} = 0.5
$$
$$
Entropy = -0.5 \log_2 0.5 - 0.5 \log_2 0.5 = 1
$$

- **46...55**: all seniors

$$
p_{senior} = 1, \quad p_{junior} = 0
$$
$$
Entropy = 0
$$

### **Step 3.3: Calculate weighted average entropy**

$$
E(age) = \frac{4}{10} \times 0 + \frac{4}{10} \times 1 + \frac{2}{10} \times 0 = 0 + 0.4 + 0 = 0.4
$$

### **Step 3.4: Compute Information Gain**

$$
IG(age) = 0.971 - 0.4 = \boxed{0.571}
$$

*(Note: In earlier quick calculations, it was 0.486, but based on detailed counts, it's 0.571)*

---

## **C. Attribute: salary**

Salary ranges:

- 26K...35K: Employees 2, 3, 6, 7, 10 → total 5  
- 36K...45K: Employees 1, 8, 9 → total 3  
- 46K...55K: Employees 4, 5 → total 2

### **Step 3.1: Class distribution**

| Salary Range | Total | Senior | Junior |  
|----------------|--------|---------|---------|  
| 26K...35K     | 5      | 0       | 5       |  
| 36K...45K     | 3      | 1       | 2       |  
| 46K...55K     | 2      | 2       | 0       |

### **Step 3.2: Entropies**

- **26K...35K**: all juniors → entropy = 0
- **36K...45K**: 1 senior, 2 juniors

$$
p_{senior} = \frac{1}{3} \approx 0.333
$$
$$
p_{junior} = 0.667
$$
\[
Entropy = 0.918

✨ VIPSolutions

Question:

Final Answer:

Question:

Final Answer:

📢 Instructions:

Study Help, Fast Answers

✨ VIPSolutions

Question:

Final Answer:

Question:

Final Answer:

📢 Instructions:

Related Answers

Study Help, Fast Answers