DR. OLCAY SAHIN

2020 TRB Annual Meeting

2019-12-04T00:00:00+00:00

I am going to be attending the Transportation Research Board (TRB)’s Annual Meeting in January 2020 at the Walter E. Washington Convention Center, in Washington, D.C. This year I have one accepted paper as the corresponding author. TRB’s Annual Meeting is one of the largest gatherings of transportation professionals and is expected to attract more than 13,000 participants from around the world.

Paper:

Sahin, O., Cetin, M., Ustun, I. (2020). Empty Platform Semi-Trailer classification using side-fire LIDAR data for supporting Freight Analysis and Planning

Lectern Session:

Innovations in Data Collection, Analysis, and Fusion to Address Persistent Freight Data Gaps
Standing Committee on Freight Transportation Data (ABJ90)
Tuesday, 1-14-2020 – 1:30 PM - 3:15 PM
Convention Center, 144A

ABSTRACT

Empty truck trips constitute an important aspect of commodity-based freight planning and modelling. But this information is generally not available to State DOTs or Metropolitan Planning Organizations (MPOs) since detecting empty trips is a challenge with traditional vehicle sensors. In this study, we propose a method for detecting empty and loaded platform semi-trailers using data from a multi-array LIDAR sensor. From the LIDAR cloud points, 3D profiles of trucks can be generated, and these profiles allow extracting useful information (e.g. body type, empty and loaded platforms). Since only platform semi-trailers’ load is observable from their 3D profiles, we only consider open platform trailers which constitute 20% of the truck trailer population in the USA. This paper shows how point-cloud data from a 16-beam LIDAR sensor are processed to extract useful information and features to distinguish between empty and loaded platform semi-trailers versus all other major truck body types (e.g. dry van, container, tank, automobile transport, etc.). Several machine learning (ML) models, in particular, K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Adaptive Boosting (AdaBoost.M2), and Support Vector Machines (SVM) are implemented on the field data collected on a freeway segment that includes over nine-thousand trucks. The results show that all major semi-trailers and empty platform semi-trailers can be distinguished with very high level of accuracies of 99% and 97% respectively.

Motor Carrier Management Information System

2019-11-07T00:00:00+00:00

RShiny application for downloading and visualizing Motor Carrier Management Information System (MCMIS) data. More information can be found here: link

I wanted to have an example for downloading data from RShiny and PostgreSQL powered web application.

Using this application, data can be filtered and downloaded.

I will also include some simple analysis.

Data is open to public and can be downloaded from this link. This data updates every month. I downloaded October 2019 dataset. If I have time, I will write a PHP script for automatically downloads and updates previously created database.

Virtal Private Server Setup

I rented a low cost virtual private server from OVH ($4 per month (a cup of coffee!)) in the following configuration 1 vCore 2GB memory and 20GB SSD space.

Database Setup

I created a local PostgreSQL database in the server for this data set. I could also use real-time reading from file using data.table library, but it could take some time to read this data from file in this server. This server is a low-cost server for lightweight works.

So I created a table in the database. You can see the SQL file in the repository.

Shiny Server Setup

Shiny server installation is straightforward. Just follow steps from official RStudio Shiny tutorial: link

Database Credentials protection

Since I am going to upload this script to the Github public repository, I am going to hide my credentials. In order to do so, I created a .Renviron file and included necessary credentials in this file as the following:

dbname = "name"
dbuser = "user"
dbpass = "password"
dbhost = "Ip address or localhost"
dbport = 5432 #Generally port for PostgreSQL is 5432. If you have multiple version check port number in postgres config
dbtable = "table name" #Not really necessary if you have multiple table in your database.

When an application or RStudio get started, these information are loaded to environment. There are so many ways to protect your credentials which is explained detail in Rstudio’s maual page: link

NGINX Web Server

Shiny server has it own port number (3838) to serve its applications. However I don’t want to enable any port other than 80. Therefore, I setup a proxy in Nginx web server. In this way, you can set a subdomain from domain name provider (e.g., GoDaddy, Name.com, etc) and point to Virtual Server IP address. Since we added the proxy information to the NGINX server, it will handle the routing.

Anybody who are not familier with the server and proxy can copy the below configuration. Don’t forget to update your information.

Below file is located at /etc/nginx/sites-available/mcmis

server {
    server_name mcmis.olcaysahin.com;
    root /opt/shiny-server/samples/mcmis/;
    gzip on;
    gzip_types text/plain text/css application/xml application/x-javascript;
    location / {
        proxy_set_header X-Real-IP  $remote_addr;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_set_header Host $host;
        proxy_buffering off;
        proxy_read_timeout 300s;
        proxy_pass http://127.0.0.1:3838/mcmis/;
        client_max_body_size 1000m;
     }
}

Symbolic Link

As you can see from the above root folder location, created RShiny Application located in a secure location. Therefore this location must be linked using Linux “Symbolic Link” command. See the example below:

ln -s /opt/shiny-server/samples/mcmis/ /srv/shiny-server/

Now the application can be seen under my porfolio’s subdomain as: http://mcmis.olcaysahin.com

RStudio Reticulate Setup

2019-11-01T00:00:00+00:00

This code written in the Rstudio. Reticulate library enables to run Python codes in Rstudio. I copied example from Reticulate manual. I will post my own examples soon.

R Code:

    library(reticulate)
    use_condaenv("r-reticulate",required = T)
    py_run_string("import os as os")
    py_run_string("os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = '../Local/Continuum/anaconda3/Library/plugins/platforms/'")

Python Code:

    import pandas
    import matplotlib.pyplot as plt
    import numpy as np
    t = np.arange(0.0,2.0,0.01)
    s = 1 + np.sin(2*np.pi*t)
    plt.plot(t,s)
    plt.grid(True)
    plt.show()

Chicago Ride-share Data Analysis

2019-10-22T00:00:00+00:00

I just have written a recent R code to analyze “rideshare trips” data from the Chicago data portal. Data contains individuals’ trips from origin coordinates to destination coordinates along with travel time and cost of the trip and some other information. The goal of using this data is for travel demand modeling. In travel demand modeling, we need to know zone-to-zone interaction. I am going to count how many trips has been made within the zones. However, there are some issues with the data set:

(1) Almost for each trip, census trac information is given. But in order to have a complete data set, missing census tracts are needed to be found. So we need to download the Chicago census tracts polygon data to find missing tracts for trips. Luckily, this Census tract data is also available in the Chicago data portal. As a result, we need to do a spatial match between Census tracts and missing data points in the trips data set.

(2) Downloaded data is about 19GB which has around 73M rows. My computer’s memory is 16GB. If I want to analyze this data either I have to rent an AWS or Google Cloud resources or split the data to the chunks and analyze individually. I could also use big data frameworks for this analysis, but it is not that big to rent Hadoop resources as well. I have to use my computer. I developed the following R code to handle this big data and analysis on the fly.

Source for the trip data: link

Source for the census tracts data: link

I had experience with all of the data handling libraries. For big data sets, I can suggest “data.table” library. For my spatial analysis, I used to analyze with PostgreSQL and PostGIS but there is a recent great library by Edzer Pebesma et. al (2019) for R which is “sf”. It is fast and easy to use. I don’t have to write spatial SQL code anymore.

Here is the code: This code also can be found in my github page: link

#' To see how long all process will take time...
startTime_FullProcess <- Sys.time()

library(data.table)
library(sf)
library(stringr)
library(dplyr)
library(readr)
library(fasttime)
library(mapview)

options(scipen=999)
#Read for the column names for the rest of readings.
TNP_ColNames <- fread('Transportation_Network_Providers_-_Trips (1).csv', nrows = 0)
#Column names has gaps. These gaps sometimes make problems. Thats why I added "_" if there is a gap.
setnames(TNP_ColNames, str_replace_all(names(TNP_ColNames), pattern = " ", replacement = "_"))

#' I also wanted write the results to the disc to not repeat the same process again.
#' For this purpose I can append to whenever the chunk of the data is analyzed.
#' But first I need to prepare a base empty file for appending
#' I am also adding 2 additional columns for newly found census tracts
TNP_ColNames_Base <- TNP_ColNames %>% mutate(pickup_census_tract=NA,dropoff_census_tract=NA)
fwrite(TNP_ColNames_Base,'TNP_Pickup_Dropoff.csv',append = F)

#' Read the census tracts shape file:
censusTracs <- read_sf("./geo_export_a33b8e6a-8cde-49fa-bca2-5446460ee02b.shp")
#' Fix the datum here. Transformation is also needed for the spatial analysis.
censusTracs <- st_transform(censusTracs, 4326)

#' Lets see the census tracts in our map.
mapview(censusTracs)

#' Let's start the fun part.
#' First set the number of chunks for each read.
nChunks = 1e6
#' I know we have around 73M of rows.
#' Let's set the chunks to read from data.
chunks <- seq(0,73e6,nChunks)
#' There are many ways to write this for loop. But I like to lapply which is faster.
#' I use bind_rows fcuntion from dplyr. This should be ok for this work.

mergedData <- bind_rows(lapply(chunks, function(chunk){
  #' I just want to see how many minutes or seconds to take to process and analyze the chunk of the data.
  startTime <- Sys.time()
  #' Read given amount of rows:
  #' I use fread function from data.table library. This is very fast.
  #' "chunk" variable passed through lapply function
  TNP_Part <- fread('Transportation_Network_Providers_-_Trips (1).csv', skip = (chunk+1),nrows = nChunks, header = F)

  #' I disabled the reading header names because we are not reading the always from top of the file.
  #' That's why we need to add header names.
  #' setnames is fucntion from data.table
  setnames(TNP_Part, names(TNP_ColNames))
  #' Just check the if there is a duplicated trips. According to description there should not be duplicated trip.
  #length(unique(TNP_Part$Trip_ID))

  #' Let's start with the spatial match with census tracts.
  #' Lets start with the missing pickups.
  TNP_Pickup <- TNP_Part[,.(Pickup_Centroid_Longitude,Pickup_Centroid_Latitude)]

  #' Convert the points to geospatial object
  TNP_Pickup <- sf::st_as_sf(TNP_Pickup, coords=c("Pickup_Centroid_Longitude","Pickup_Centroid_Latitude"), crs=4326, na.fail = F)

  #' Let's use the st_intersects from sf package.
  #' Results returns a list. I want to convert to data.frame. Thanks to dplyr for pipe function.
  #' This function returns 2 value. One is index value for pickups and other one is index vvalue for the census tracts.
  #' I could filter out the ones already have census tracts but I want to match with the given one and found one.
  pickup_CT <- sf::st_intersects(TNP_Pickup, censusTracs, sparse = T) %>% as.data.frame()

  #' Set a new column name for new census pickup tracts
  TNP_Part$pickup_census_tract <- NA
  #' Now lets update the values.
  TNP_Part$pickup_census_tract[pickup_CT$row.id] <- censusTracs$geoid10[pickup_CT$col.id]  

  #' Below is same process for drop offs
  TNP_Dropoff <- TNP_Part[,.(Dropoff_Centroid_Longitude,Dropoff_Centroid_Latitude)]
  TNP_Dropoff <- sf::st_as_sf(TNP_Dropoff, coords=c("Dropoff_Centroid_Longitude","Dropoff_Centroid_Latitude"), crs=4326, na.fail = F)

  dropoff_CT <- sf::st_intersects(TNP_Dropoff, censusTracs, sparse = T) %>% as.data.frame()

  TNP_Part$dropoff_census_tract <- NA
  TNP_Part$dropoff_census_tract[dropoff_CT$row.id] <- censusTracs$geoid10[dropoff_CT$col.id]  

  #' Lets start the data analysis now.
  #' Never use the for loop. If you select to use, you will wait forever.
  #' Use the advantage of data.table process.
  #' Time stamps format are need to be converted from text to timestamp.
  TNP_Part[, `:=` (Trip_Start_Timestamp_formatted = as.POSIXct(Trip_Start_Timestamp, format = "%m/%d/%Y %I:%M:%S %p", tz = "UTC"),
                   Trip_End_Timestamp_formatted = as.POSIXct(Trip_Start_Timestamp, format = "%m/%d/%Y %I:%M:%S %p", tz = "UTC"))]
  #' Add dates and times for grouping purposes.
  TNP_Part[, `:=` (Trip_Start_Date = as.IDate(Trip_Start_Timestamp_formatted), # Extract the trip start date
                   Trip_End_Date = as.IDate(Trip_End_Timestamp_formatted), # Extract the trip end date
                   start_hour_of_day = hour(Trip_Start_Timestamp_formatted), # Extract the trip start hour
                   end_hour_of_day = hour(Trip_End_Timestamp_formatted), # Extract the trip end hour
                   wday_trip_start = wday(Trip_Start_Timestamp_formatted), # Extract the trip start weekday
                   wday_trip_end = wday(Trip_End_Timestamp_formatted))]  # Extract the trip end weekday

  #' Thanks data.table. I want to buy a cup of coffee.
  #' This is increadbly fast.
  #' Now get the number of trips for each census tracts depending on the what resolution you need.
  TNP_Summary <- TNP_Part[,  list(nTrip=.N, #Number of trips
                                  aveTrip_TT_sec=mean(Trip_Seconds)), # AVerage travel time
                          by = c('Trip_End_Date',
                                 'end_hour_of_day',
                                 'pickup_census_tract',
                                 'dropoff_census_tract')] %>% na.omit()

  print(Sys.time() - startTime)

  #' If you need to write this file to local disk you can use the append.
  #' fwrite is a function from data.table lib.
  #fwrite(TNP_Part,'TNP_Pickup_Dropoff.csv',append = T)

  #' That's it.
  #' Now it is time to return the analysed chunk of data set to bind_rows fucntion for collecting all the chunks in one data frame.
  #' My memory now can handle this
  return(TNP_Summary)

}))

#' Let see how long it will take to finish all analysis
print(Sys.time() - startTime_FullProcess)
#' Time difference of 1.033505 hours
#' Of course if I filtered out the ones already have the census tracts, it would take way much less time.

#' Now let's make the final analysis. Then move the second part of the analysis.
finalMergedData <- mergedData[, list(sum=sum(nTrip)), by = c('Trip_End_Date', 'end_hour_of_day', 'pickup_census_tract','dropoff_census_tract')]

#' As a final word, this script is memory friendly and fast.
#' I can also do this process in parallel mode, but source data is big. Therfore, it could crash.
#' It is kind of still slow because spatial analysis make it slow.
#' In my computer (Windows 10 Intel Xeon CPU E5-2687W v2 @ 340GHz) RStudio consumes average 3.6GB memory.

Data.Table Samples

2019-05-21T00:00:00+00:00

In this post I am going to post some useful and handy data.table examples which I implemented in my codes.

Data.table cheat sheet link

Read list of files as a light speed: I had around 10k of same format csv files. Below code reads them incredibly fast.

data_files <- list.files('../../../../../Data/', full.names = T, recursive = T, pattern = '.csv')
l <- lapply(data_files, fread, sep = ',') #Read the files
data <- rbindlist(l, fill = T)

TSSC2019 Signalized Intersections Challenge

2019-03-19T00:00:00+00:00

TSSC2019 - Big Data Challenge on Signalized Intersections

Another traffic data related challenge. This time Traffic Signals Systems Committee (AHB25) organizing a challenge on developing a visualization tool and an algorithm to assist the decision making of the Utah Department of Transportation. Me and my old colleague Ilyas Ustun will attend the competition to provide a useful tool to UDOT.

Here is the link for the competition Github Page.

TRB 2019

2019-01-02T00:00:00+00:00

2019 TRB Annual Meeting

I am going to be attending the Transportation Research Board (TRB)’s Annual Meeting in January 2019 at the Walter E. Washington Convention Center, in Washington, D.C. I have two accepted papers at this conference, one as the author and the other as a co-author. TRB’s Annual Meeting is one of the largest gatherings of transportation professionals and is expected to attract more than 13,000 participants from around the world.
Papers:

Sahin, O., R.V. Nezafat, Cetin, M. (2019). Classification of Truck Trailers Based on Side-Fire LIDAR Data In Transportation Research Board 98th Annual Meeting.

Poster presentation: Monday 10:15 AM- 12:00 PM Convention Center, Hall A
R.V. Nezafat, Sahin, O., Cetin, M. (2019). A Deep Transfer Learning Approach for Classification of Truck Body Types Based on Side-Fire LIDAR Data In Transportation Research Board 98th Annual Meeting.

Presentation: Tuesday 1:30 PM- 3:15 PM Convention Center, 151B

Transfor19

2018-12-10T00:00:00+00:00

TRB-ABJ70 Transportation Forecasting Competition

Me and my old colleague Ilyas Ustun attended to the TRB-ABJ70 Transportation Forecasting Competition . The results will be announce on December 17th.

** Update: According to results we were not able to selected in the first 5 top attendees. The source code will be open in the competition’s Github page.

VDOT Hackathon

2018-04-30T00:00:00+00:00

VDOT SmarterRoads Hackaton

Old Dominion University’s Transportation Research Institute (TRI) attended the VDOT SmarterRoads Hackathon. We formed 3 teams and we have succesfully accomplished the projects in a given short time.

Me and my colleague Gulsevi developed a web application of toll-based route guidance. Toll data has been parsed real time from SmarterRoads Data Portal, travel time and direction data parsed from Google Maps Api.

When the user enter origin and destination locations, parsed data analyzed and suggests multiple route options with the tolling information. User can select desired route option for the destination.

The web application written in R with shiny package The source code can be found in here. Leaflet map has been used for displaying the waypoints of selected route.

The other teams projects are SmartPave and Vi-Care.

Unfortunately my project was not selected but I had a great time and experience.

In the hackathon, based on the submission rules, we used only provided data feeds. Actually there is a plenty of resources for the data feeds. The list is here. Download it before it has gone!

Here is the snapshot of the user interface. I want to note that it needs be still improved.

HRBT Overheight Trucks Problem

2017-09-10T00:00:00+00:00

HRBT Overheight Trucks

I have started to a new project which is analyzing the overheight truck turn arounds at Hampton Roads Bridge Tunnel (HRBT).

More information can be found in here.