In this post, I look at reported crimes in LA (2020-2023) to see which areas have high and low incidence– which might be one way to gauge the level of safety in those areas. I will be using R, specifically the tidyverse, sf, tmap, spatstat, sfdep and spdep packages in this post.

Travelling Woes

We will be travelling to LA for vacation in a little over a month, and it would be good to learn a bit more before we get there. As a foreigner and given everything that’s happening in the US and in the world, I am quite concerned about my safety when visiting an unfamiliar place. Are there places that I need to avoid? Did we choose the right location for our hotel?

In order to attempt to answer these questions myself, one dimension I can look at are the location of crimes in LA. Data from 2020 to date are available in the US open data website data.gov. The website warns of issues in January 2024 and a transition from March 2024, so we will just refer to pre-20024 data in this source.

LA Crimes 2020-2023

I have downloaded the data from the website as a csv file, but other formats are available.

Importing and Transforming the Data

The code chunk below loads tidyverse and then loads the csv into the object la_crimes. tidyverse is a collection of multiple packages that are commonly used for data wrangling and cleaning.

pacman::p_load(tidyverse)
la_crimes <- read_csv("data/raw/Crime_Data_from_2020_to_Present.csv", show_col_types = FALSE)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

The file contains 987K rows and 28 columns. We can check all column names using colnames().

colnames(la_crimes)

 [1] "DR_NO"          "Date Rptd"      "DATE OCC"       "TIME OCC"      
 [5] "AREA"           "AREA NAME"      "Rpt Dist No"    "Part 1-2"      
 [9] "Crm Cd"         "Crm Cd Desc"    "Mocodes"        "Vict Age"      
[13] "Vict Sex"       "Vict Descent"   "Premis Cd"      "Premis Desc"   
[17] "Weapon Used Cd" "Weapon Desc"    "Status"         "Status Desc"   
[21] "Crm Cd 1"       "Crm Cd 2"       "Crm Cd 3"       "Crm Cd 4"      
[25] "LOCATION"       "Cross Street"   "LAT"            "LON"

I am only concerned about the location of the crimes, the types of crimes, and the date (especially the year) the crimes were committed. I might also be interested in whether the crime happened indoors or outdoors. This means that I can just focus on five columns/variables– which I will define in an object cols_to_keep for easy selecting later. I will also be making the column names clearer by replacing them with the ones I am defining in cols_newnames.

cols_to_keep <- c("DATE OCC", "Crm Cd Desc", "LAT", "LON", "Premis Desc")
cols_newnames <- c("Date", "Crime Type", "latitude", "longitude", "Premise")

We can check the selected columns by using head() on a subset of the data.

head(select(la_crimes, all_of(cols_to_keep)))

# A tibble: 6 × 5
  `DATE OCC`             `Crm Cd Desc`                   LAT   LON `Premis Desc`
  <chr>                  <chr>                         <dbl> <dbl> <chr>        
1 03/01/2020 12:00:00 AM VEHICLE - STOLEN               34.0 -118. STREET       
2 02/08/2020 12:00:00 AM BURGLARY FROM VEHICLE          34.0 -118. BUS STOP/LAY…
3 11/04/2020 12:00:00 AM BIKE - STOLEN                  34.0 -118. MULTI-UNIT D…
4 03/10/2020 12:00:00 AM SHOPLIFTING-GRAND THEFT ($95…  34.2 -118. CLOTHING STO…
5 09/09/2020 12:00:00 AM VEHICLE - STOLEN               34.1 -118. STREET       
6 05/02/2020 12:00:00 AM VEHICLE - STOLEN               34.1 -118. STREET

This shows that the date (DATE OCC) columns is currently a character string with the first portion being the date in US format. (i.e., month before day)

In the following code chunk, we will reload the data and keep only the five columns using select(). Changing the names can be done by assigning a list to colnames(). We then convert the Date column into the right type and then keep only records that happened before 2024.

la_crimes <- read_csv("data/raw/Crime_Data_from_2020_to_Present.csv", show_col_types = FALSE) %>%
  select(all_of(cols_to_keep)) #Keep five columns

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

colnames(la_crimes) <- cols_newnames #Rename columns

la_crimes$Date <- mdy_hms(la_crimes$Date) #Convert date column from string

la_crimes <- la_crimes[la_crimes$Date < ymd("2024-01-01"),] #Keep only pre2024 data
head(la_crimes)

# A tibble: 6 × 5
  Date                `Crime Type`                    latitude longitude Premise
  <dttm>              <chr>                              <dbl>     <dbl> <chr>  
1 2020-03-01 00:00:00 VEHICLE - STOLEN                    34.0     -118. STREET 
2 2020-02-08 00:00:00 BURGLARY FROM VEHICLE               34.0     -118. BUS ST…
3 2020-11-04 00:00:00 BIKE - STOLEN                       34.0     -118. MULTI-…
4 2020-03-10 00:00:00 SHOPLIFTING-GRAND THEFT ($950.…     34.2     -118. CLOTHI…
5 2020-09-09 00:00:00 VEHICLE - STOLEN                    34.1     -118. STREET 
6 2020-05-02 00:00:00 VEHICLE - STOLEN                    34.1     -118. STREET

The new object now has 867K rows and 5 columns, from the original 987K rows and 28 columns.

I will also be adding a few date-derived columns to make the analyses easier later. These are done by the ccode chunk below for the year, the month and the day of the week.

la_crimes$Year <- year(la_crimes$Date)
la_crimes$Month <- month(la_crimes$Date, label = TRUE, abbr = FALSE)
la_crimes$Day_of_Week <- wday(la_crimes$Date, label = TRUE, abbr = FALSE)

Data Quality Checking

I can perform a few data quality checks on the data that we have left– before I go into analyzing the data.

One possible check is to see the range of the dates. Do they all fall between Jan 1st 2020 and Dec 31st 2023? Are there any invalid dates? summary() gives the range and quartiles so it can be used for displaying both the minimum and maximum. The function is.na() returns TRUE if a value is invalid.

summary(as.Date(la_crimes$Date))

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2020-01-01" "2020-11-09" "2022-02-25" "2022-01-05" "2023-01-24" "2023-12-31"

sum(is.na(la_crimes$Date))

[1] 0

The output above shows that the dates are indeed between the beginning of 2020 to the end of 2023, and that there are no invalid dates.

Another check that can be done is to check if there are any irregularities or errors in the string columns: Crime Type and Premise. I can first check the number of unique values that these two values contain using the function n_distinct().

n_distinct(la_crimes$'Crime Type')

[1] 138

n_distinct(la_crimes$Premise)

[1] 306

There appears to be a very large number of entries for these two columns. The unique() function can show the unique values in a column. The code chunk below displays the unique values for Crime Type as a single string with entries separated by ‘//’

print(paste(unique(la_crimes$'Crime Type'), collapse = " // "))

[1] "VEHICLE - STOLEN // BURGLARY FROM VEHICLE // BIKE - STOLEN // SHOPLIFTING-GRAND THEFT ($950.01 & OVER) // ARSON // BURGLARY // PIMPING // PANDERING // OTHER MISCELLANEOUS CRIME // VANDALISM - MISDEAMEANOR ($399 OR UNDER) // INTIMATE PARTNER - SIMPLE ASSAULT // ROBBERY // THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD // ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT // THEFT OF IDENTITY // BATTERY - SIMPLE ASSAULT // SHOPLIFTING - PETTY THEFT ($950 & UNDER) // BUNCO, GRAND THEFT // VIOLATION OF COURT ORDER // VIOLATION OF RESTRAINING ORDER // THEFT PLAIN - PETTY ($950 & UNDER) // VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) // RAPE, FORCIBLE // THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) // TRESPASSING // VEHICLE - ATTEMPT STOLEN // RESISTING ARREST // EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER) // BURGLARY FROM VEHICLE, ATTEMPTED // LETTERS, LEWD  -  TELEPHONE CALLS, LEWD // CRIMINAL THREATS - NO WEAPON DISPLAYED // SEX OFFENDER REGISTRANT OUT OF COMPLIANCE // UNAUTHORIZED COMPUTER ACCESS // THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) // CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER) // BRANDISH WEAPON // BURGLARY, ATTEMPTED // DISCHARGE FIREARMS/SHOTS FIRED // BATTERY POLICE (SIMPLE) // VEHICLE, STOLEN - OTHER (MOTORIZED SCOOTERS, BIKES, ETC) // ORAL COPULATION // INDECENT EXPOSURE // THEFT FROM PERSON - ATTEMPT // CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT // OTHER ASSAULT // DISTURBING THE PEACE // INTIMATE PARTNER - AGGRAVATED ASSAULT // BOMB SCARE // FAILURE TO YIELD // CONTEMPT OF COURT // ATTEMPTED ROBBERY // ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER // DOCUMENT FORGERY / STOLEN FELONY // BUNCO, PETTY THEFT // SEXUAL PENETRATION W/FOREIGN OBJECT // SHOTS FIRED AT INHABITED DWELLING // CHILD STEALING // DEFRAUDING INNKEEPER/THEFT OF SERVICES, $950 & UNDER // KIDNAPPING - GRAND ATTEMPT // SHOTS FIRED AT MOVING VEHICLE, TRAIN OR AIRCRAFT // THEFT, PERSON // CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT // EXTORTION // CHILD NEGLECT (SEE 300 W.I.C.) // TILL TAP - GRAND THEFT ($950.01 & OVER) // SEX,UNLAWFUL(INC MUTUAL CONSENT, PENETRATION W/ FRGN OBJ // BATTERY WITH SEXUAL CONTACT // HUMAN TRAFFICKING - COMMERCIAL SEX ACTS // CHILD ANNOYING (17YRS & UNDER) // DOCUMENT WORTHLESS ($200.01 & OVER) // RAPE, ATTEMPTED // FALSE IMPRISONMENT // THROWING OBJECT AT MOVING VEHICLE // LEWD CONDUCT // PEEPING TOM // KIDNAPPING // CRIMINAL HOMICIDE // STALKING // THEFT PLAIN - ATTEMPT // SODOMY/SEXUAL CONTACT B/W PENIS OF ONE PERS TO ANUS OTH // VIOLATION OF TEMPORARY RESTRAINING ORDER // CHILD PORNOGRAPHY // WEAPONS POSSESSION/BOMBING // DRIVING WITHOUT OWNER CONSENT (DWOC) // THEFT FROM MOTOR VEHICLE - ATTEMPT // PICKPOCKET // SHOPLIFTING - ATTEMPT // COUNTERFEIT // BUNCO, ATTEMPT // DEFRAUDING INNKEEPER/THEFT OF SERVICES, OVER $950.01 // CRUELTY TO ANIMALS // FALSE POLICE REPORT // PROWLER // DISHONEST EMPLOYEE - GRAND THEFT // THREATENING PHONE CALLS/LETTERS // PURSE SNATCHING // EMBEZZLEMENT, PETTY THEFT ($950 & UNDER) // DOCUMENT WORTHLESS ($200 & UNDER) // ILLEGAL DUMPING // LEWD/LASCIVIOUS ACTS WITH CHILD // BATTERY ON A FIREFIGHTER // PETTY THEFT - AUTO REPAIR // MANSLAUGHTER, NEGLIGENT // RECKLESS DRIVING // TILL TAP - PETTY ($950 & UNDER) // PURSE SNATCHING - ATTEMPT // LYNCHING - ATTEMPTED // CREDIT CARDS, FRAUD USE ($950.01 & OVER) // CREDIT CARDS, FRAUD USE ($950 & UNDER // THEFT, COIN MACHINE - PETTY ($950 & UNDER) // HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE // BIKE - ATTEMPTED STOLEN // CONTRIBUTING // BRIBERY // BOAT - STOLEN // CONSPIRACY // GRAND THEFT / INSURANCE FRAUD // DRUGS, TO A MINOR // CHILD ABANDONMENT // THEFT, COIN MACHINE - GRAND ($950.01 & OVER) // DISRUPT SCHOOL // THEFT, COIN MACHINE - ATTEMPT // DISHONEST EMPLOYEE - PETTY THEFT // LYNCHING // FIREARMS RESTRAINING ORDER (FIREARMS RO) // REPLICA FIREARMS(SALE,DISPLAY,MANUFACTURE OR DISTRIBUTE) // GRAND THEFT / AUTO REPAIR // DRUNK ROLL // PICKPOCKET, ATTEMPT // TELEPHONE PROPERTY - DAMAGE // BEASTIALITY, CRIME AGAINST NATURE SEXUAL ASSLT WITH ANIM // FIREARMS EMERGENCY PROTECTIVE ORDER (FIREARMS EPO) // BIGAMY // INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES) // BLOCKING DOOR INDUCTION CENTER // INCITING A RIOT // DISHONEST EMPLOYEE ATTEMPTED THEFT // FAILURE TO DISPERSE"

It shows a very diverse list of types of crimes. Some seem to be specific variants of certain crimes. It would have been great if I could consolidate this into a fewer number of types, but we can wait until we dive into the data to see if I can just focus on a few types to answer my questions.

The code chunk below does the same and shows the unique values for Premise.

print(paste(unique(la_crimes$Premise), collapse = " // "))

[1] "STREET // BUS STOP/LAYOVER (ALSO QUERY 124) // MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC) // CLOTHING STORE // PUBLIC STORAGE // JEWELRY STORE // OTHER BUSINESS // PARKING LOT // ALLEY // GAS STATION // CAR WASH // SINGLE FAMILY DWELLING // CONDOMINIUM/TOWNHOUSE // RESTAURANT/FAST FOOD // SIDEWALK // NURSING/CONVALESCENT/RETIREMENT HOME // MARKET // GOVERNMENT FACILITY (FEDERAL,STATE, COUNTY & CITY) // DRIVEWAY // MINI-MART // YARD (RESIDENTIAL/BUSINESS) // VEHICLE, PASSENGER/TRUCK // GARAGE/CARPORT // PARKING UNDERGROUND/BUILDING // NAIL SALON // PORCH, RESIDENTIAL // DRUG STORE // MTA BUS // MTA - RED LINE - VERMONT/BEVERLY // OTHER PREMISE // DAY CARE/CHILDREN* // BUS STOP // POLICE FACILITY // MISSIONS/SHELTERS // CHURCH/CHAPEL (CHANGED 03-03 FROM CHURCH/TEMPLE) // PHARMACY INSIDE STORE OR SUPERMARKET* // DISCOUNT STORE (99 CENT,DOLLAR,ETC. // MTA - RED LINE - 7TH AND METRO CENTER // OFFICE BUILDING/OFFICE // OTHER STORE // DELIVERY SERVICE (FED EX, UPS, COURIERS,COURIER SERVICE)* // HOTEL // MORTUARY // OTHER RESIDENCE // OTHER/OUTSIDE // MTA - EXPO LINE - EXPO/LA BREA // JUNIOR HIGH SCHOOL // NIGHT CLUB (OPEN EVENINGS ONLY) // COFFEE SHOP (STARBUCKS, COFFEE BEAN, PEET'S, ETC.) // CYBERSPACE // HIGH SCHOOL // CONSTRUCTION SITE // STUDIO (FILM/PHOTOGRAPHIC/MUSIC) // BEAUTY/BARBER SHOP // MOTEL // VALET // PARK/PLAYGROUND // TRANSITIONAL HOUSING/HALFWAY HOUSE // LAUNDROMAT // AUTO SALES LOT // ELEMENTARY SCHOOL // LA UNION STATION (NOT LINE SPECIFIC) // MTA - EXPO LINE - EXPO/VERMONT // BUS DEPOT/TERMINAL, OTHER THAN MTA // SPECIALTY SCHOOL/OTHER // CONVENTION CENTER // DETENTION/JAIL FACILITY // DIY CENTER (LOWE'S,HOME DEPOT,OSH,CONTRACTORS WAREHOUSE) // NURSERY/FLOWER SHOP // GROUP HOME // RECYCLING CENTER // 7TH AND METRO CENTER (NOT LINE SPECIFIC) // FRAT HOUSE/SORORITY/DORMITORY // TAXI // NA // HOSPITAL // DEPARTMENT STORE // PROJECT/TENEMENT/PUBLIC HOUSING // TRANSIENT ENCAMPMENT // MTA - RED LINE - WESTLAKE/MACARTHUR PARK // THE GROVE // SINGLE RESIDENCE OCCUPANCY (SRO'S) LOCATIONS // COLLEGE/JUNIOR COLLEGE/UNIVERSITY // VETERINARIAN/ANIMAL HOSPITAL // CHECK CASHING* // UNDERPASS/BRIDGE* // HEALTH SPA/GYM // MAIL BOX // MEDICAL/DENTAL OFFICES // LIQUOR STORE // AUTO REPAIR SHOP // TUNNEL // AUTO SUPPLY STORE* // MTA - EXPO LINE - EXPO/WESTERN // SHOPPING MALL (COMMON AREA) // BEAUTY SUPPLY STORE // DRIVE THRU* // THE BEVERLY CONNECTION // PUBLIC RESTROOM(INDOORS-INSIDE) // MTA - GOLD LINE - UNION STATION // POST OFFICE // APARTMENT/CONDO COMMON LAUNDRY ROOM // SYNAGOGUE/TEMPLE // MASSAGE PARLOR // MUNICIPAL BUS LINE INCLUDES LADOT/DASH // ABANDONED BUILDING ABANDONED HOUSE // PAWN SHOP // HARDWARE/BUILDING SUPPLY // STAPLES CENTER * // MTA - BLUE LINE - WASHINGTON // LIBRARY // FURNITURE STORE // STAIRWELL* // STORAGE SHED // HIGH-RISE BUILDING // THE BEVERLY CENTER // TV/RADIO/APPLIANCE // TOBACCO SHOP // MTA - RED LINE - UNION STATION // PATIO* // CELL PHONE STORE // MUSEUM // MTA - PURPLE LINE - CIVIC CENTER/GRAND PARK // CLEANER/LAUNDROMAT // MTA - RED LINE - CIVIC CENTER/GRAND PARK // AUTOMATED TELLER MACHINE (ATM) // SHORT-TERM VACATION RENTAL // PET STORE // MTA - GOLD LINE - HIGHLAND PARK // ELEVATOR // MOBILE HOME/TRAILERS/CONSTRUCTION TRAILERS/RV'S/MOTORHOME // DRIVE THRU BANKING (WINDOW)* // FREEWAY // BAR/COCKTAIL/NIGHTCLUB // MEMBERSHIP STORE (COSTCO,SAMS CLUB)* // BANK // DAY CARE/ADULTS* // MTA - RED LINE - PERSHING SQUARE // SLIPS/DOCK/MARINA/BOAT // BALCONY* // THEATRE/MOVIE // VEHICLE STORAGE LOT (CARS, TRUCKS, RV'S, BOATS, TRAILERS, ETC.) // CEMETARY* // MTA - EXPO LINE - FARMDALE // PRIVATE SCHOOL/PRESCHOOL // CREDIT UNION // WAREHOUSE // OIL REFINERY // SKATEBOARD FACILITY/SKATEBOARD PARK* // MTA - RED LINE - VERMONT/SUNSET // BEACH // TRAIN TRACKS // GUN/SPORTING GOODS // COLISEUM // MEDICAL MARIJUANA FACILITIES/BUSINESSES // WEBSITE // FOSTER HOME BOYS OR GIRLS* // MTA - RED LINE - VERMONT/SANTA MONICA // MTA - EXPO LINE - EXPO/SEPULVEDA // ELECTRONICS STORE (IE:RADIO SHACK, ETC.) // TOW YARD* // VISION CARE FACILITY* // TRASH CAN/TRASH DUMPSTER // TRANSPORTATION FACILITY (AIRPORT) // MTA - RED LINE - HOLLYWOOD/HIGHLAND // AIRCRAFT // SEX ORIENTED/BOOK STORE/STRIP CLUB/GENTLEMAN'S CLUB // BAR/SPORTS BAR (OPEN DAY & NIGHT) // AUTO DEALERSHIP (CHEVY, FORD, BMW, MERCEDES, ETC.) // PEDESTRIAN OVERCROSSING // MTA - SILVER LINE - HARBOR GATEWAY TRANSIT CTR // BASKETBALL COURTS // MTA - EXPO LINE - 7TH AND METRO CENTER // ENTERTAINMENT/COMEDY CLUB (OTHER) // MTA - EXPO LINE - EXPO/BUNDY // GOLF COURSE* // FIRE STATION // TELECOMMUNICATION FACILITY/LOCATION // FACTORY // MTA - RED LINE - HOLLYWOOD/VINE // SPORTS VENUE, OTHER // PUBLIC RESTROOM/OUTSIDE* // MTA - PURPLE LINE - WESTLAKE/MACARTHUR PARK // HOSPICE // MTA - BLUE LINE - SAN PEDRO // BUS-CHARTER/PRIVATE // RIVER BED* // MTA - ORANGE LINE - RESEDA // MTA - SILVER LINE - ROSECRANS // MANUFACTURING COMPANY // TATTOO PARLOR* // OPTICAL OFFICE INSIDE STORE OR SUPERMARKET* // MTA - EXPO LINE - EXPO PARK/USC // MTA - SILVER LINE - HARBOR FWY // MTA - ORANGE LINE - TAMPA // POOL-PUBLIC/OUTDOOR OR INDOOR* // METROLINK TRAIN // MTA - EXPO LINE - EXPO/CRENSHAW // MTA - EXPO LINE - LA CIENEGA/JEFFERSON // EQUIPMENT RENTAL // VACANT LOT // MTA - BLUE LINE - GRAND/LATTC // MTA - BLUE LINE - PICO // BANKING INSIDE MARKET-STORE * // AMUSEMENT PARK* // MTA - EXPO LINE - JEFFERSON/USC // SPORTS ARENA // MTA - BLUE LINE - 7TH AND METRO CENTER // MTA - RED LINE - HOLLYWOOD/WESTERN // SWAP MEET // BUS, SCHOOL, CHURCH // MTA - GOLD LINE - SOUTHWEST MUSEUM // GREYHOUND OR INTERSTATE BUS // TRADE SCHOOL (MEDICAL-TECHNICAL-BUSINESS)* // MTA - PURPLE LINE - UNION STATION // MTA - GOLD LINE - SOTO // MTA - BLUE LINE - 103RD/WATTS TOWERS // BOWLING ALLEY* // MTA - ORANGE LINE - WOODMAN // MTA - EXPO LINE - PICO // MTA - ORANGE LINE - VAN NUYS // OTHER PLACE OF WORSHIP // MTA PROPERTY OR PARKING LOT // MTA - PURPLE LINE - 7TH AND METRO CENTER // OTHER RR TRAIN (UNION PAC, SANTE FE ETC // BANK DROP BOX/MONEY DROP-OUTSIDE OF BANK* // COMPUTER SERVICES/REPAIRS/SALES // MTA - GOLD LINE - LINCOLN/CYPRESS // ABATEMENT LOCATION // MTA - GOLD LINE - MARIACHI PLAZA // MTA - EXPO LINE - WESTWOOD/RANCHO PARK // BOOK STORE // SAVINGS & LOAN // MTA - SILVER LINE - SLAUSON // MTA - ORANGE LINE - BALBOA // ABORTION CLINIC/ABORTION FACILITY* // TRAIN DEPOT/TERMINAL, OTHER THAN MTA // MTA - PURPLE LINE - PERSHING SQUARE // MTA - ORANGE LINE - WOODLEY // MTA - GOLD LINE - HERITAGE SQ // RECORD-CD MUSIC/COMPUTER GAME STORE // MTA - BLUE LINE - VERNON // MTA - SILVER LINE - UNION STATION // SEWAGE FACILITY/PIPE // DODGER STADIUM // MTA - SILVER LINE - 37TH ST/USC // MTA - GOLD LINE - CHINATOWN // MTA - GOLD LINE - INDIANA // TOOL SHED* // METHADONE CLINIC // PAY PHONE // AMTRAK TRAIN // MTA - SILVER LINE - DOWNTOWN STREET STOPS // VIDEO RENTAL STORE // TRUCK, COMMERICAL // MTA - SILVER LINE - PACIFIC COAST HWY // SURPLUS SURVIVAL STORE // HORSE RACING/SANTA ANITA PARK* // CATERING/ICE CREAM TRUCK // FINANCE COMPANY // DAM/RESERVOIR // CULTURAL SIGNIFICANCE/MONUMENT // MTA - ORANGE LINE - SEPULVEDA // MTA - GOLD LINE - LITTLE TOKYO/ARTS DISTRICT // MTA - GOLD LINE - PICO/ALISO // MASS GATHERING LOCATION // WATER FACILITY // OTHER INTERSTATE, CHARTER BUS // MTA - EXPO LINE - LATTC/ORTHO INSTITUTE // MTA - ORANGE LINE - VALLEY COLLEGE // ESCALATOR* // MTA - ORANGE LINE - SHERMAN WAY // MTA - ORANGE LINE - NORTH HOLLYWOOD // MTA - RED LINE - NORTH HOLLYWOOD // MTA - PURPLE LINE - WILSHIRE/NORMANDIE // ENERGY PLANT/FACILITY // MTA - ORANGE LINE - CHATSWORTH // MTA - PURPLE LINE - WILSHIRE/WESTERN // MTA - RED LINE - UNIVERSAL CITY/STUDIO CITY // MTA - RED LINE - WILSHIRE/VERMONT // MTA - GREEN LINE - AVALON // MTA - PURPLE LINE - WILSHIRE/VERMONT // MTA - GREEN LINE - HARBOR FWY // MTA - EXPO LINE - PALMS // SKATING RINK* // HANDBALL COURTS // MTA - ORANGE LINE - CANOGA // MTA - ORANGE LINE - NORDHOFF // TERMINAL, OTHER THAN MTA // MUSCLE BEACH // MTA - GREEN LINE - AVIATION/LAX // ARCADE,GAME ROOM/VIDEO GAMES (EXAMPLE CHUCKIE CHEESE)* // GARMENT MANUFACTURER // MTA - ORANGE LINE - LAUREL CANYON // MTA - ORANGE LINE - PIERCE COLLEGE // MTA - ORANGE LINE - ROSCOE // MOSQUE* // MTA - ORANGE LINE - DE SOTO // MTA - SILVER LINE - LAC/USC MEDICAL CENTER // DEPT OF DEFENSE FACILITY // RETIRED (DUPLICATE) DO NOT USE THIS CODE // HOCKEY RINK/ICE HOCKEY // MTA - SILVER LINE - MANCHESTER // CHEMICAL STORAGE/MANUFACTURING PLANT // TRAIN, OTHER THAN MTA (ALSO QUERY 809/810/811) // HARBOR FRWY STATION (NOT LINE SPECIFIC)"

While this is a longer list, it looks like there are a lot more identifiable groups. I will attempt to reduce this into a smaller list of items which I will call as Premise_Type. The first step is to create the new column as a copy of Premise.

la_crimes$'Premise_Type' <- la_crimes$Premise

My goal is primarily to make certain types of outdoor or public locations easier to find since these will be more relevant for me as a tourist.

Consolidate MTA/Train

I will be taking public transport during this upcoming trip so crimes that occur on trains and the stations are important.

n_distinct(la_crimes$Premise_Type)

[1] 306

la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "MTA"), "MTA/Train", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "LA UNION STATION"), "MTA/Train", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "TRAIN"), "MTA/Train", Premise_Type))

n_distinct(la_crimes$Premise_Type)

[1] 211

Consolidate Bus / Bus Station

For the same reason as trains, I will also be interested in crimes that occur in buses and bus stations.

n_distinct(la_crimes$Premise_Type)

[1] 211

la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "BUS"), "Bus (or Stop/Station)", Premise_Type))

n_distinct(la_crimes$Premise_Type)

[1] 201

Consolidate Schools

There are many premises that represent schools. While we are not going to visit any of these, consolidating them will make it easy to filter or pick them out when looking at results.

n_distinct(la_crimes$Premise_Type)

[1] 201

la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "SCHOOL"), "School", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "COLLEGE"), "School", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "DAY CARE"), "School", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "DORMITORY"), "School", Premise_Type))

n_distinct(la_crimes$Premise_Type)

[1] 193

Consolidate Residential

This will also make filtering out private residences– as we do not have any plans to visit or stay residences other than our hotel.

n_distinct(la_crimes$Premise_Type)

[1] 193

la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "DWELLING"), "Residential", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "CONDO"), "Residential", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "RESIDENTIAL"), "Residential", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "RESIDENCE"), "Residential", Premise_Type))
la_crimes <- la_crimes %>% mutate(Premise_Type = if_else(str_detect(Premise_Type, "MOBILE HOME"), "Residential", Premise_Type))

n_distinct(la_crimes$Premise_Type)

[1] 186

Exploratory Data Analysis

I will perform some EDA or exploratory data analysis even before bringing in the geospatial data. The charts generated will be from ggplot which is already included in tidyverse. (so there is no need to import the package)

Number of Crimes Over Time (Total)

I first want to see whether there are trends or seasonality in the occurrence of crimes overall. The code below produces a bar chart for the total number of crimes recorded per year.

chart_data <- la_crimes %>%
    group_by(Year) %>%
    summarise(count = round(n()/1000,1), .groups = 'drop')

ggplot(chart_data, aes(x = factor(Year), y = count)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    geom_text(aes(label = count), vjust = -0.5, size = 3.5) +
    labs(title = "Number of Crimes Recorded by Year (Thousands)", x = "Year") +
    theme_minimal() +
    theme(
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()
    )

The chart shows that, with the exception of 2021, there were between 230K and 255K crimes reported per year in LA. The number of crimes reported in 2021 deviates with only 146K crimes reported.

I can also look at any seasonality across the different years. The code chunk below creates a line chart which shows the number of crimes recorded per month.

chart_data <- la_crimes %>%
  group_by(Year, Month) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(chart_data, aes(x = Month, y = count, color = factor(Year), group = Year)) +
  geom_line() +
  geom_point() +
  labs(title = "Number of Crimes Recorded by Month", x = "Month", y = "Number of Crimes", color = "Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid.major.x = element_blank())

Visually, the only recurring pattern I see is for the period from August to December where there is an alternate drop and rise. October is the month with the highest number of reported crimes in 2022 and 2023.

Number of Crimes By Type

The code below produces a table to show the top 20 crime types over the four year period based on the number of reports. As some of the descriptions are very long, I opted for a table than a chart to make sure they are all readable.

chart_data <- count(la_crimes, `Crime Type`) %>%
  arrange(desc(n)) %>%
  mutate(
        Percentage = n / sum(n) * 100,
        Running_Total_Percentage = cumsum(n / sum(n) * 100)
    ) %>%
  slice_head(n = 20)

chart_data <- chart_data %>%
    rename(
        `Number of Records` = n,
        `Percentage of Total` = Percentage,
        `Running Total Percentage` = Running_Total_Percentage
    )

print(chart_data)

# A tibble: 20 × 4
   `Crime Type` `Number of Records` `Percentage of Total` Running Total Percen…¹
   <chr>                      <int>                 <dbl>                  <dbl>
 1 VEHICLE - S…               91496                 10.6                    10.6
 2 BATTERY - S…               68879                  7.94                   18.5
 3 THEFT OF ID…               55353                  6.38                   24.9
 4 BURGLARY FR…               53415                  6.16                   31.0
 5 BURGLARY                   53030                  6.12                   37.2
 6 VANDALISM -…               52686                  6.08                   43.2
 7 ASSAULT WIT…               49165                  5.67                   48.9
 8 THEFT PLAIN…               44529                  5.14                   54.0
 9 INTIMATE PA…               43319                  5.00                   59.0
10 THEFT FROM …               33712                  3.89                   62.9
11 THEFT FROM …               30216                  3.49                   66.4
12 ROBBERY                    29634                  3.42                   69.8
13 THEFT-GRAND…               28529                  3.29                   73.1
14 VANDALISM -…               22945                  2.65                   75.8
15 SHOPLIFTING…               21859                  2.52                   78.3
16 CRIMINAL TH…               17786                  2.05                   80.3
17 BRANDISH WE…               13390                  1.54                   81.9
18 TRESPASSING                12847                  1.48                   83.4
19 INTIMATE PA…               11763                  1.36                   84.7
20 VIOLATION O…               10886                  1.26                   86.0
# ℹ abbreviated name: ¹`Running Total Percentage`

The table shows that the top type reported is from stolen vehicles, followed by battery and identity theft. These three collectively make up 25% of all the reports. While a tourist like me might not be at risk of the first type, witnessing or being in the vicinity of a vehicle being stolen is still something I want to avoid. The location of the crimes, are then probably more important for someone like me.

Number of Crimes By Premise

The code below creates a similar table which shows the number of reports by Premise_Type.

chart_data <- count(la_crimes, `Premise_Type`) %>%
  arrange(desc(n)) %>%
  mutate(
        Percentage = n / sum(n) * 100,
        Running_Total_Percentage = cumsum(n / sum(n) * 100)
    )

chart_data <- chart_data %>%
    rename(
        `Number of Records` = n,
        `Percentage of Total` = Percentage,
        `Running Total Percentage` = Running_Total_Percentage
    )

print(chart_data)

# A tibble: 186 × 4
   Premise_Type `Number of Records` `Percentage of Total` Running Total Percen…¹
   <chr>                      <int>                 <dbl>                  <dbl>
 1 Residential               267393                 30.8                    30.8
 2 STREET                    218432                 25.2                    56.0
 3 PARKING LOT                59710                  6.89                   62.9
 4 Bus (or Sto…               49163                  5.67                   68.6
 5 SIDEWALK                   37215                  4.29                   72.9
 6 VEHICLE, PA…               25513                  2.94                   75.8
 7 GARAGE/CARP…               16781                  1.94                   77.8
 8 DRIVEWAY                   14115                  1.63                   79.4
 9 RESTAURANT/…               10932                  1.26                   80.7
10 DEPARTMENT …               10883                  1.26                   81.9
# ℹ 176 more rows
# ℹ abbreviated name: ¹`Running Total Percentage`

The largest portion of reports makes up 31% of the total and are from residential premises. These are not very concerning for me as there is little chance that I will be exposed to these types of crime as a tourist who is only checked into a hotel. The same is true for any other location which is not public or in a hotel.

I can create a subset of the data to only include those that happened in public by removing the ones that I think happen in private. I define a list of values private to exclude and then use filter() to remove them in the code chunk below. The list is not comprehensive but should at least exclude the top ones that are in premises that a tourist like myself will not encounter.

private <- c("Residential", "GARAGE/CARPORT", "OFFICE BUILDING/OFFICE",
             "WAREHOUSE", "CONSTRUCTION SITE", "MAIL BOX", "STORAGE SHED",
             "MEDICAL/DENTAL OFFICES", "MISSIONS/SHELTERS", "TRANSIENT ENCAMPMENT",
             "GROUP HOME", "PROJECT/TENEMENT/PUBLIC HOUSING",
             "DETENTION/JAIL FACILITY", "MANUFACTURING COMPANY",
             "SHORT-TERM VACATION RENTAL", "FOSTER HOME BOYS OR GIRLS*",
             "WATER FACILITY", "HOSPICE", "GARMENT MANUFACTURER",
             "RETIRED (DUPLICATE) DO NOT USE THIS CODE")

la_crimes_public <- la_crimes %>%
    filter(!Premise_Type %in% private)

# Display the resulting dataframe
head(la_crimes_public)

# A tibble: 6 × 9
  Date                `Crime Type`        latitude longitude Premise  Year Month
  <dttm>              <chr>                  <dbl>     <dbl> <chr>   <dbl> <ord>
1 2020-03-01 00:00:00 VEHICLE - STOLEN        34.0     -118. STREET   2020 March
2 2020-02-08 00:00:00 BURGLARY FROM VEHI…     34.0     -118. BUS ST…  2020 Febr…
3 2020-03-10 00:00:00 SHOPLIFTING-GRAND …     34.2     -118. CLOTHI…  2020 March
4 2020-09-09 00:00:00 VEHICLE - STOLEN        34.1     -118. STREET   2020 Sept…
5 2020-05-02 00:00:00 VEHICLE - STOLEN        34.1     -118. STREET   2020 May  
6 2020-07-07 00:00:00 ARSON                   34.1     -118. STREET   2020 July 
# ℹ 2 more variables: Day_of_Week <ord>, Premise_Type <chr>

Number of Crimes - Public

I now want to rerun some of the earlier charts with this new dataset. Will I see the same trends and top types of crimes for just the ‘public’ ones?

The code below recreates the number of crimes reported by month.

chart_data <- la_crimes_public %>%
  group_by(Year, Month) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(chart_data, aes(x = Month, y = count, color = factor(Year), group = Year)) +
  geom_line() +
  geom_point() +
  labs(title = "Number of Public Crimes Recorded by Month", x = "Month", y = "Number of Crimes", color = "Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid.major.x = element_blank())

There is no big difference in the distribution of the crimes across years or months. I can still see some alternating in the latter part of the year and the three years still appear to have very similar number of crimes reported.

The code below then recreates the number of reports by the type of crime.

chart_data <- count(la_crimes_public, `Crime Type`) %>%
  arrange(desc(n)) %>%
  mutate(
        Percentage = n / sum(n) * 100,
        Running_Total_Percentage = cumsum(n / sum(n) * 100)
    ) %>%
  slice_head(n = 20)

chart_data <- chart_data %>%
    rename(
        `Number of Records` = n,
        `Percentage of Total` = Percentage,
        `Running Total Percentage` = Running_Total_Percentage
    )

print(chart_data)

# A tibble: 20 × 4
   `Crime Type` `Number of Records` `Percentage of Total` Running Total Percen…¹
   <chr>                      <int>                 <dbl>                  <dbl>
 1 VEHICLE - S…               88381                15.5                     15.5
 2 BURGLARY FR…               45118                 7.91                    23.4
 3 BATTERY - S…               44514                 7.81                    31.2
 4 VANDALISM -…               41455                 7.27                    38.5
 5 ASSAULT WIT…               38470                 6.75                    45.2
 6 THEFT FROM …               30819                 5.40                    50.6
 7 ROBBERY                    27324                 4.79                    55.4
 8 THEFT FROM …               25465                 4.47                    59.9
 9 THEFT PLAIN…               24423                 4.28                    64.2
10 BURGLARY                   22959                 4.03                    68.2
11 SHOPLIFTING…               21707                 3.81                    72.0
12 THEFT-GRAND…               17994                 3.16                    75.2
13 VANDALISM -…               14805                 2.60                    77.8
14 THEFT OF ID…               13135                 2.30                    80.1
15 INTIMATE PA…               12980                 2.28                    82.3
16 BRANDISH WE…                8888                 1.56                    83.9
17 CRIMINAL TH…                8542                 1.50                    85.4
18 TRESPASSING                 6197                 1.09                    86.5
19 SHOPLIFTING…                4312                 0.756                   87.2
20 ATTEMPTED R…                4194                 0.735                   88.0
# ℹ abbreviated name: ¹`Running Total Percentage`

Stolen vehicles still appear as the top type of crime, but we don’t see identity theft follow it anymore. Burglary from vehicles and battery follow as the second and third most frequent type of crime reported.

Moving forward, I would want to limit the analysis to 2023, or the most recent full year. This limits the amount of data I need to run, and should not be an issue since it is the most updated, and there is no big deviation from the two other years. (not counting 2021)

The code below keeps only the records with a Year of 2023 in la_crimes_public

la_crimes_public <- la_crimes_public[la_crimes_public$Year == 2023, ]

Importing the LA Map

In order to visualize in a map and apply techniques we need to bring in the geospatial data. I have taken the district level map from the LA County website. The following code loads the sf package and then imports the map in shapefile format into r using using st_read(). sf is the package to import and work with sf or simple features objects in R for geospatial operations.

pacman::p_load(sf)

la_district <- st_read(dsn = "data/geospatial", layer="LA_City_Council_Districts")

Reading layer `LA_City_Council_Districts' from data source 
  `C:\drkrodriguez\datawithderek\posts\la-crimes\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 15 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6359580 ymin: 1714641 xmax: 6514613 ymax: 1945516
Projected CRS: NAD83 / California zone 5 (ftUS)

The output shows that there are 15 features, or districts, in the object and are of the polygon type. We can use st_crs() to display information on the crs or coordinate reference systems of the object.

st_crs(la_district)

Coordinate Reference System:
  User input: NAD83 / California zone 5 (ftUS) 
  wkt:
PROJCRS["NAD83 / California zone 5 (ftUS)",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4269]],
    CONVERSION["SPCS83 California zone 5 (US survey foot)",
        METHOD["Lambert Conic Conformal (2SP)",
            ID["EPSG",9802]],
        PARAMETER["Latitude of false origin",33.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-118,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",35.4666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",34.0333333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",6561666.667,
            LENGTHUNIT["US survey foot",0.304800609601219],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",1640416.667,
            LENGTHUNIT["US survey foot",0.304800609601219],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["US survey foot",0.304800609601219]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["US survey foot",0.304800609601219]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura."],
        BBOX[32.76,-121.42,35.81,-114.12]],
    ID["EPSG",2229]]

The output shows that the object is based on EPSG 2229, which is a valid system, however it is using empirical units. We can project it to a different CRS, EPSG 3310, which is using meters, by using st_transform().

la_district <- st_read(dsn = "data/geospatial", layer="LA_City_Council_Districts") %>%
  st_transform(3310)

Reading layer `LA_City_Council_Districts' from data source 
  `C:\drkrodriguez\datawithderek\posts\la-crimes\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 15 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6359580 ymin: 1714641 xmax: 6514613 ymax: 1945516
Projected CRS: NAD83 / California zone 5 (ftUS)

Geospatial data can be visualized as maps using the tmap package. I use qtm() below to quickly visualize the contents of la_district.

pacman::p_load(tmap)

qtm(la_district)

Converting Report Locations to sf Format

The crime data cannot be used directly with the map data since it is still not in the right sf format. This can be done by using st_as_sf() and passing the coordinates into the coords argument. The first EPSG or CRS code assigned is 4326 which is WGS 84 or a geodetic system in order to accept the degree coordinates. The code chunk ends in assigning the same CRS code of 3310 which was used earlier for the district map.

la_crimes_sf <- select(la_crimes_public, c("Crime Type", "Month", "Day_of_Week",
                                          "latitude", "longitude")) %>%
  st_as_sf(coords = c("longitude", "latitude"),
                       crs=4326) %>%
  st_transform(crs = 3310)

In order to visualize these in the district map, we can add individual elements as layers with tmap using tm_shape(). The first layer will be the district map followed by the crime locations. Colors can be defined for elements of each layer. I added an alpha argument to the crime locations so they have some transparency and will give some relative idea of the density.

tm_shape(la_district) +
  tm_borders("black") +
tm_shape(la_crimes_sf) +
  tm_dots("red", alpha = 0.1)

There are clearly some areas which have a lot more crimes than others. Unfortunately, I am not familiar with the locations in LA, and I am not sure in which district the hotel we are staying is. I have defined a variable hotel which contains our Downtown LA hotel location based on its coordinates and then converted to sf.

I can then find the district that this point location is in with by using st_contains(). The sparse argument instructs the function to return TRUE or FALSE rather than a list.

hotel_dist <- la_district[st_contains(la_district, hotel, sparse = FALSE),]

I can now redraw the map to highlight the district of our hotel. In the code below, I add it as a layer with a green border or outline.

tm_shape(la_district) +
  tm_borders("black") +
tm_shape(la_crimes_sf) +
  tm_dots("red", alpha = 0.1) +
tm_shape(hotel_dist) +
  tm_borders("green")

Our downtown hotel, unsurprisingly, is in one of the districts that have a high number of crimes. It is hard to pinpoint which area in the district has the ighest incidence of reports, and whether the district has a higher or a lower number

While the previous visual display already gives us an idea of the density of crime reports, it is hard to tell objectively where there are high or low incidence of (public) crimes, but there are some techniques that can. In this post, I will use a few and see if they are telling me the same thing and if I should be more careful when stepping out of our hotel:

Simple density calculation
KDE or Kernel Density Estimation
Using LISA or local indicators of spatial association

Visualizing Using (Simple) Density

One way to describe how crime-prone an area is to just count the number of crimes reported in that area. The code below uses st_intersects() to check which crimes are reported in each district and uses lengths() to count the number and store it in a new column called Crimes.

la_district$Crimes <- lengths(st_intersects(la_district, la_crimes_sf))
head(la_district)

Simple feature collection with 6 features and 6 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 122729.1 ymin: -442872.4 xmax: 167861.7 ymax: -412202.8
Projected CRS: NAD83 / California Albers
  DIV_CITY Dist_Str Shape_Leng Shape_Le_1 Shape_Area
1        1        1   65684.47   178255.1  387763426
2        2        2  133851.86   362762.5 1385362790
3        3        3   64802.21   175710.7 1152116605
4        4        4   98313.54   266752.6  848830556
5        5        5  141215.59   383106.5 1314559236
6        6        6   70725.56   191612.7  742012710
                        geometry Crimes
1 POLYGON ((161792.9 -432253....  10063
2 POLYGON ((147756.5 -422606,...   7674
3 POLYGON ((136499 -422382.3,...   8381
4 POLYGON ((158353.1 -437130....  11076
5 POLYGON ((145559.7 -436402....   8730
6 POLYGON ((138282.7 -420588....   8119

We can visualize this in a map by using the counts as a parameter for the fill of the districts.

tm_shape(la_district) +
  tm_borders("black") +
  tm_fill('Crimes') +
tm_shape(hotel_dist) +
  tm_borders("green") +
tm_layout(main.title.size = 1, main.title = "2023 Crimes Per District", legend.width = 5)

The district where the hotel is in still does appear to be the one with the highest number of reported crimes in 2023. The district level might be too high though so I would want to look at a lower level to see if it is still the case when we look deeper.

I have downloaded the neighborhood council maps from the same source. I then import it into R and also located our hotel in the same using the code below.

la_nbhood <- st_read(dsn = "data/geospatial", layer="LA_City_Neighborhood_Councils") %>%
  st_transform(3310) %>%
  st_crop(st_bbox(la_district))

Reading layer `LA_City_Neighborhood_Councils' from data source 
  `C:\drkrodriguez\datawithderek\posts\la-crimes\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 99 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -118.6681 ymin: 33.70374 xmax: -117.6573 ymax: 34.91415
Geodetic CRS:  WGS84(DD)

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

hotel_nbhood <- la_nbhood[st_contains(la_nbhood, hotel, sparse = FALSE),]

tm_shape(la_nbhood) +
  tm_borders("black") +
tm_shape(hotel_nbhood) +
  tm_fill("green")

I then count the number of crimes per neighborhood (council) using the same approach earlier. Note that since we are checking with points in polygons, st_intersects() can actually be replaced with st_contains().

la_nbhood$Crimes <- lengths(st_intersects(la_nbhood, la_crimes_sf))

To visualize the counts in a map, we can again use the tmap function and just pass the neighborhood-level data instead of the district-level ones.

tm_shape(la_nbhood) +
  tm_borders("black") +
  tm_fill('Crimes') +
tm_shape(hotel_nbhood) +
  tm_borders("green") +
tm_layout(main.title.size = 1, main.title = "2023 Crimes Per Neighborhood", legend.width = 5)

At the neighborhood level, our hotel still does appear to be in the most dense in terms of the number of crimes reported. For the other neighborhoods in LA, it is hard to see which one comes next, and just looking at the number of reports might be misleading as different neighborhoods have different sizes or land areas. Looking at the density might be more appropriate.

To compute for the density, I first need to get the area of each neighborhood. While there is a Shape_Area column in there, I am not sure about its accuracy or its units. It is better to compute the area myself. I can do this by simply using st_area() on the sf object.

la_nbhood$AREA <- st_area(la_nbhood)

As the projection system is in meters, the areas computed should be in \(m^2\). To get the density, I just need to divide the number of crimes with these areas. The resulting number is expected to be very small if it is presented as the number of crimes per square meter. As the units is being included as an attribute, we load the units package in order to be able to do the conversion and keep it consistent with the displayed units. The last line converts the units from “1/m^2” to “1/km^2” which means the original values were multiplied by a factor of a million.

la_nbhood$crime_density <- la_nbhood$Crimes / la_nbhood$AREA

pacman::p_load(units)

la_nbhood$crime_density <- set_units(la_nbhood$crime_density, "1/km^2", mode = "standard")

I can now use this new column to generate the plot of the density.

tm_shape(la_nbhood) +
  tm_borders("black") +
  tm_fill('crime_density', title = "Crimes Per Sq Km") +
tm_shape(hotel_nbhood) +
  tm_borders("green") +
tm_layout(main.title.size = 1, main.title = "Density of 2023 Crime Reports", legend.width = 5)

This updated chart shows that there is another neighborhood with the highest density (at least 1000 reports per square kilometer) followed by a handful of neighborhoods with 500-1000 reports per sq km. Every other neighborhood has a density of less than 500 per square km.

While this is good, I don’t want to see 90 neighborhoods in a single group. tmap allows the user to adjust the way that groups or classes are defined by using the style argument. One possible argument value is “quantile” which splits the data into 5 groups with the same number of elements.

tm_shape(la_nbhood) +
  tm_borders("black") +
  tm_fill('crime_density', title = "Crimes Per Sq Km", style = "quantile", palette = "Reds") +
tm_shape(hotel_nbhood) +
  tm_borders("green") +
tm_layout(main.title.size = 1, main.title = "Density of 2023 Crime Reports", legend.width = 5)

It is now easier to see which neighborhoods are in the top or bottom 20%, which are roughly 20 neighborhoods each as there are 99 neighborhoods in the dataset.

We can further improve the map and make it easier to interpret by making it interactive my setting the mode to “view”. I have also added an id argument to the fill element so they display the neighborhood names when moused over.

tmap_mode("view")

tmap mode set to interactive viewing

tm_shape(la_nbhood) +
  tm_borders("black") +
  tm_fill('crime_density', id = "name", title = "Crimes Per Sq Km", style = "quantile", palette = "Reds") +
tm_shape(hotel_nbhood) +
  tm_borders("green")

tmap_mode("plot")

tmap mode set to plotting

Again, Downtown LA is in the top 20% of the neighborhoods. The interactive map also lets me see that the small neighborhood beside Downtown LA– Westlake South, was the one with the highest density at 1423 crimes reported per square kilometer. I am also able to see the names of the neighborhoods with the lowest densities like Coastal San Pedro and Wilmington in the south.

Visualizing Using KDE

While the density approach produced some good visualizations, it’s dependent on the partitioning used and only describes the data in the confines of those partitions. The partitions might also not represent the true distribution of the data.

An approach that I can take is by using KDE or kernel density estimation instead. In very simple terms, KDE with geospatial point data tries to model the number of points, events, or, in our case, crime reports that are within range of every point within the ‘map’ or study area. The range is referred to as the bandwidth and can either be a fixed number of a dynamic or adaptive one. Based on this definition, it also implies that the densities computed will be over overlapping areas since a single event will be in range of all the other (infinite) points within its range.

In R, KDE can be performed using functions from the spatstat package.

pacman::p_load(spatstat)

Removing Duplicates

Before proceeding, I need to check that there are no duplicated points in the data as some of the functions do not work in their presence. If there are duplicates, then some transformation is needed to make the data fit for use.

The check can be done using the duplicated() function and it needs to be done at the geometry or location of the points.

any(duplicated(la_crimes_sf$geometry))

[1] TRUE

The output shows that there is indeed some duplication in the point locations. We can introduce ‘jitter’ or a small random shift in order to ensure that points do not occupy the same space. For sf objects, this can be done with the st_jitter() function where an amount can be specified for the range of jitter to be introduced.

la_crimes_jitt <- st_jitter(la_crimes_sf$geometry, 5)

We can doublecheck that there is no more duplication using the same function. Note that the new object la_crimes_jitt only contains the geometry or the points, so we can pass the whole object into duplicated().

any(duplicated(la_crimes_jitt))

[1] FALSE

Data Transformations for KDE

Additional data transformations are required as spatstat does not directly work with sf objects. One option is to use the ppp format. Converting the crime report data to this format takes a few steps.

The first step is to create an owin object of the map of LA which defines the geographic boundaries. This is done by passing the boundaries into as.owin(). As the boundaries are required, we can use a function like st_union() to ensure only one object’s boundaries are defined.

la_owin <- as.owin(st_union(st_geometry(la_district)))

plot(la_owin)

The next code chunk completes the transformation by converting the crime locations into a ppp object using as.ppp() and then combining it with the owin using the second line of code.

la_crimes_ppp <- as.ppp(la_crimes_jitt)
la_crimes_ppp <- la_crimes_ppp[la_owin]

plot(la_crimes_ppp)

If we use the data as is, we will see that all the units will be based on meters, which means that density values will be very small. We can convert the distance units to kilometers using rescale.ppp() and pass in a factor of 1000 as a parameter.

la_crimes_ppp <- rescale.ppp(la_crimes_ppp, 1000, "km")

KDE with Fixed Bandwidth

The KDE can be computed using the density() function. This is specifically for computing the KDE where the is a user fixed bandwidth. This bandwidth is passed into the sigma argument of the function. There are a number of other possible arguments, but I am just using two of these in the upcoming code chunk:

The edge argument indicates whether edge correction is applied.
The kernel argument specifies the smoothing kernel, or the function used to smooth out and interpolate the density values. The process will only compute for the density of existing points, so there needs to be a way to calculate the density in between. The most common method used is gaussian which represents a normal curve.

The code chunks below compute the KDE for four different bandwidths and then produces the plots by passing them into the plot() function.

BW = 250m

kde_la_crimes_250m <- density(la_crimes_ppp,
                         sigma = 0.25,
                         edge = TRUE,
                         kernel = "gaussian")
plot(kde_la_crimes_250m, main = "KDE with 250m bandwidth")

BW = 500m

kde_la_crimes_500m <- density(la_crimes_ppp,
                         sigma = 0.5,
                         edge = TRUE,
                         kernel = "gaussian")
plot(kde_la_crimes_500m, main = "KDE with 500m bandwidth")

BW = 1km

kde_la_crimes_1km <- density(la_crimes_ppp,
                         sigma = 1,
                         edge = TRUE,
                         kernel = "gaussian")
plot(kde_la_crimes_1km, main = "KDE with 1km bandwidth")

BW = 5km

kde_la_crimes_5km <- density(la_crimes_ppp,
                         sigma = 5,
                         edge = TRUE,
                         kernel = "gaussian")
plot(kde_la_crimes_5km, main = "KDE with 5km bandwidth")

The plots continue to show that the area around downtown LA has the highest density. The largest bandwidth produces a very smooth gradient which is not useful for identifying specific locations as hotspots. The lower bandwidths are more useful in this purpose.

These charts continue to show that the central area, including Downtown LA, is where the crime reports are most concentrated. There are a few spots in the north which have high incidence of reports, but most of the northern part is generally the least dense in terms of the number of crime reports.

Note that there are more formal or scientific methods to determine the ideal bandwidth, but I will not cover them in this post.

KDE with Adaptive Bandwidth

The downside of using a fixed bandwidth is that it might make it hard to see variations where points are dense, while making it produce large variances where points are scarce. An adaptive bandwidth can solve this downside as it adjusts the bandwidth used depending on the density of the points in the region.

The KDE with adaptive bandwidth can be computed by using adaptive.density(). The method argument specifies the estimation method used and can take on one of three values: “kernel”, “voronoi” or “nearest”.

kde_la_crimes_adaptive <- adaptive.density(la_crimes_ppp, method = "kernel")

write_rds(kde_la_crimes_adaptive, "data/rds/kde_la_crimes_adaptive.rds")

plot(kde_la_crimes_adaptive, main = "KDE with Adaptive bandwidth")

The output continues to highlight downtown LA as a hotspot, but also produced some hotspots scattered across the map which can be thought of as local hotspots.

Visualizing Using LISA

The last option I will explore in this post is the use of LISA or local indicators of spatial association, specifically Moran’s I. Spatial association or autocorrelation describes whether there is a systemic variation over space for a variable. Tests can be done to conclude whether there is (statistically relevant) clustering or dispersion for a variable. These tests and statistics can be computed at a global or local level.

The local tests and methods can also be used to classify hotspots and outliers in the data– which is something that could help me answer the main problem.

I will be using the sfdep package to perform most of the steps here.

pacman::p_load(sfdep)

One thing to note is that the method that will be used will be using attributes rather than point data. This means that it will be more similar to the first method (density by neighborhood, in terms of the visualization) than the second one. (KDE)

I will be using the neigborhood level density of crime reports for this approach.

Deriving Weights

Instead of just looking at each cells’ or objects value, spatial autocorrelation looks at that object together with its neighbors. The ‘value’ of a cell is then an average, sum or some combination of its and its neighbors’ values depending on the weights.

The first step is deriving the list of neighbors per neighborhood. If we observe the map, we see that some of the borders overlap and some of the borders don’t even touch. This makes the use of functions like st_contiguity(), which derives a list of neighbors based on adjacency, infeasible without cleaning up the data.

tmap_mode("view")

tmap mode set to interactive viewing

tm_shape(la_nbhood) +
  tm_borders("black") +
  tm_fill("lightblue") +
tm_shape(hotel_nbhood) +
  tm_fill("green")

tmap_mode("plot")

tmap mode set to plotting

Instead of using contiguity, we will use distance to determine the list of neighbors. The first step is to determine the ideal distance to use. This ideal distance should ensure that each neighborhood has a neighbor.

la_nbhood_poly <- st_cast(la_nbhood, "POLYGON")

Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

centroid_x <- map_dbl(la_nbhood_poly$geometry, ~st_centroid(.x)[[1]])
centroid_y <- map_dbl(la_nbhood_poly$geometry, ~st_centroid(.x)[[2]])
coords <- cbind(centroid_x, centroid_y)
head(coords)

     centroid_x centroid_y
[1,]   144466.2  -418250.3
[2,]   165487.4  -433628.8
[3,]   157666.4  -429928.5
[4,]   144533.1  -433198.1
[5,]   165313.5  -440626.4
[6,]   159993.3  -447374.8

Next, I need to find the distance of each neighborhood to its nearest neighbors– which does not necessarily require them to be touching on the map. For this, I would need functions from the spdep package. The code below goes through the following:

I look for each neighborhood’s nearest neighbors using knearneigh(). The output is converted into a neighbor class, nb, using knn2nb()
Return the distance to a neighbor using nbdists()

pacman::p_load(spdep)

k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords))
summary(k1dists)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  751.6  1731.0  2280.4  2546.8  3093.7  6088.8

The output shows that the largest distance to a neighbor is 6088.8, which means that a distance of at least that number should ensure that each neighborhood will find at least one neighbor. I will use 6090.

To return a list of distance based neighbors, I use st_dist_band() in the code chunk below. This needs to be applied to the centroids so I create a new object for the neighborhood centroids and then define a range of 0 to 6090 meters for the neighbors.

la_nbhood_centroids <- st_centroid(la_nbhood)

Warning: st_centroid assumes attributes are constant over geometries

la_nbhood_centroids$nb <- st_dist_band(la_nbhood_centroids,
                                       lower = 0,
                                       upper = 6090)

The next step is to assign weights which can be done using st_weights(). It requires a neighbor list as an input and then I have used “W” as an input to the style argument to specify equal weights to be used.

la_nbhood_centroids$wt <- st_weights(la_nbhood_centroids$nb, style = "W")

Performing Local Moran’s I Test

I will be using Moran’s I as the local statistic to be used which is useful for identifying outliers and clusters. This can be done using the local_moran() function of sfdep. The function needs three required arguments, the values to be used, the neighbor list and the weights. The nsim argument instructs the test to perform a number of simulations. I defined a seed value at the top of the block in order to make sure I am able to reproduce the results.

set.seed(1234)

la_nbhood_centroids <- la_nbhood_centroids %>%
  mutate(local_moran = local_moran(
    Crimes, nb, wt, nsim = 99),
    .before = 1) %>%
  unnest(local_moran)

Note that the dataframe used does not contain the actual neighborhood maps so it is not suitable for mapping. We can transfer the results, which are in the first 11 columns of the output, into the original neighborhood map using the following code.

results <- as.data.frame(la_nbhood_centroids)[1:11]
la_nbhood_moran <- bind_cols(la_nbhood, results)

Visualising Local Indicators

We can visualize the results by using the tmap package. The test statistic is stored in the column ii while the p values for the simulations are stored in p_ii_sim. In general, values above zero for the test statistics indicate clustering while those below indicate dispersion.

tm_shape(la_nbhood_moran) +
  tm_fill(c("ii", "p_ii_sim"), title = c("Local Moran's I","P Value")) +
  tm_borders(alpha = 0.5) +
  tm_view(set.zoom.limits = c(6,8)) +
  tm_layout(
    main.title = "Local Moran's I and P-values",
    legend.position = c("left", "bottom"))

Variable(s) "ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

One useful output of local_moran() is the classification of elements. One is stored in the column mean. This classifies clusters and outliers and indicates whether they have high values or low. Note that these classifications are available for all elements regardless of whether the test finds them being significant.

The code chunk below produces an interactive map that shows the classifications for the neighborhoods where the p-value is less than 5%. The classification indicates clusters as “Low-Low” or “High-High”, while the two other classifications for outliers. Where there is a high valued neighborhood among low valued ones, it will show as “High-Low”. For the reverse, it will show “Low-High”.

la_nbhood_moran_sig <- la_nbhood_moran %>%
  filter(p_ii < 0.05)

tmap_mode("view")

tmap mode set to interactive viewing

tm_shape(la_nbhood_moran) +
  tm_polygons(id = "name") +
  tm_borders(alpha = 0.5) +
tm_shape(la_nbhood_moran_sig) +
  tm_fill("mean", id = "name") +
  tm_borders(alpha = 0.4)

Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

tmap_mode("plot")

tmap mode set to plotting

The results show that there is a statistically significant cluster of with high crime report density– again with Downtown LA. There are four outliers identified around this cluster’s borders. There is also a significant low density cluster of neighborhoods a little north of the high density cluster.

What do I do now?

It appears that no matter how I look at it, Downtown LA and its surroundings have the highest density of crime reports. This doesn’t necessarily mean that is is unsafe to step out of the hotel, it just means that there were more crimes reported. It is also not surprising since these are also the densest areas in terms of the number of people. Regardless of what I see here, we will still go ahead with our hotel booking since it’s fully paid already. XD