I have a dataframe that has a lot of entires similar to the table to the left shown below. I was to query it using SQL to get a result similar to the table to the right shown below. So that I will be able to plot a stacked bar chart with the data with each bar representing a state and Severity count S03, S04 will add up.
+--+-----+--------+
|ID|State|Severity|
+--+-----+--------+
|01| NY | 3 | +-----+---+---+
|02| CA | 4 | |State|S03|S04|
|03| NY | 4 | => +-----+---+---+
|04| CA | 3 | | CA | 1 | 3 |
|05| CA | 4 | | NY | 1 | 1 |
|06| CA | 4 |
I tried the following SQL query but it is giving the same result for every entry in S03 and same for S04.
city_accidents = spark.sql("\
SELECT State, \
(SELECT COUNT(ID) AS Count FROM us_accidents WHERE Severity = 3 ) AS S03, \
(SELECT COUNT(ID) AS Count FROM us_accidents WHERE Severity = 4 ) AS S04 \
FROM accidents \
GROUP BY State \
ORDER BY State DESC LIMIT 10")
city_accidents.show()
+-----+---+---+
|State|S03|S04|
+-----+---+---+
| NY | 1 | 3 |
| CA | 1 | 3 |
That is probably because I haven't entered any filter for the inner select statement from which state to select from. Is there a way I can access those inner variables in the select query? What I meant is if I could change inner select statements to (SELECT COUNT(ID) AS Count FROM us_accidents WHERE Severity = 3 AND State = this.State ) AS S03..