I have been trying to clean a dataset named logbook.csv. The dataset focuses on analyzing fuel usage of users Globally. The first step is to clean a column named "date_fueled" which consists of the date that the users purchased fuel. This column has dates in the format e.g; "Apr 12 2020" but also has non-date values that have also have commas in them e.g; "Cooling System, Heating System, Lights, Spark Plugs". I have been trying to clean this data using various libraries namely: lubridate, parsedate, dplyr and readr but I keep getting either errors or all my dates get turned into NA values. I restarted my RStudio and tried to start over and realised that I get a warning message after importing my dataset.
The warning message is as follows:
> library(readr)
> logbook <- read_csv("C:/Users/theet/Downloads/logbook.csv")
Rows: 1174870 Columns: 9
── Column specification ─────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): date_fueled, date_captured, cost_per_gallon, total_spent, user_url
dbl (3): gallons, mpg, miles
num (1): odometer
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
> View(logbook)
After reading the above I ran the "problems(dat)" code and received the following feedback:
problems(logbook)
# A tibble: 398 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 5409 4 a double 8,583.478 C:/Users/theet/Downloads/logbook.csv
2 5790 8 a double 1,182.5 C:/Users/theet/Downloads/logbook.csv
3 9681 8 a double 1,888.2 C:/Users/theet/Downloads/logbook.csv
4 12023 4 a double 10,738.000 C:/Users/theet/Downloads/logbook.csv
5 12140 7 a double 1,049.2 C:/Users/theet/Downloads/logbook.csv
6 12140 8 a double 2,713.3 C:/Users/theet/Downloads/logbook.csv
7 13609 8 a double 132,388.0 C:/Users/theet/Downloads/logbook.csv
8 16234 4 a double 2,817.502 C:/Users/theet/Downloads/logbook.csv
9 20879 4 a double 16,378.667 C:/Users/theet/Downloads/logbook.csv
10 26262 8 a double 49,725.2 C:/Users/theet/Downloads/logbook.csv
# ℹ 388 more rows
# ℹ Use `print(n = ...)` to see more rows
The link to my dataset is: https://drive.google.com/file/d/18TbpdmNS7hsBtUU-wkItEK9IBEfy9Hqr/view?usp=drive_link
Here is the code I wrote using the lubridate library:
library(parsedate)
library(lubridate)
library(dplyr)
library(readr)
logbook2 <- read_csv("C:/Users/theet/Downloads/logbook.csv")
# Convert date_fueled to actual date objects
logbook2 <- logbook2 %>%
mutate(date_fueled = as.Date(date_fueled, format = "%b %d %Y")
# Replace NA values in date_fueled with NA
logbook2 <- logbook2 %>%
mutate(date_fueled = ifelse(is.na(date_fueled), NA, date_fueled))
head(logbook2)
the above code gave me this error:
Error: unexpected symbol in:
"#Replace NA values in date_fueled with NA
logbook2"
>
Please help me fix this error and also notify if there might be additional mistakes in my code.