Getting Started with `tezr` • tezr

This vignette walks through the core features of tezr. Each section builds on the previous one, starting with simple keyword searches and progressing to multi-filter queries, detail retrieval, and cache management.

Package builds show the code without running live requests. Set TEZR_LIVE_DOCS=true before rendering if you want to refresh all outputs.

library(tezr)
library(dplyr)

tezr has three search functions: search_basic(), search_advanced(), and search_detailed(). The sections below cover each function in order.

Basic Search

search_basic() searches the National Thesis Center (NTC) database by keyword. It checks all fields by default, so it works well when you do not know where your term appears.

# Search all fields for "tarımsal sulama"
ag_irrigation <- search_basic("tarımsal sulama")

The output is a tibble with one row per thesis.

# Column names and types
dplyr::glimpse(ag_irrigation)

Targeting Specific Fields

You can use the search_field argument to restrict matching to a single field.

# Search only in thesis titles
ag_irrigation_title <- search_basic(
  "tarımsal sulama", 
  search_field = "title")
dplyr::glimpse(ag_irrigation_title)

Available search field values are "all" (default), "title", "author", "supervisor", "subject", "index", and "abstract". Use search_detailed(thesis_no = ...) for thesis-number lookup.

# Search abstracts
abstract_search <- search_basic(
  "production function", 
  search_field = "abstract")

# Search by author name
author_search <- search_basic(
  "Işıl Şirin Selçuk", 
  search_field = "author")

Filtering by Thesis Type and Access Status

search_basic() also accepts thesis_type and access_type filters. These filters are applied server-side, so you download fewer records.

Available thesis_type values are: "all" (default), "masters", "phd", "medical_specialty", "arts", "dentistry", "medical_sub", "pharmacy".

# PhD dissertations only
phd_results <- search_basic(
  "ekonometri", 
  thesis_type = "phd")
dplyr::glimpse(phd_results)

Available access type values are: "all" (default), "open", "restricted".

# Open access theses only
open_results <- search_basic(
  "hanehalkı", 
  access_type = "open")

The 2000-Result Limit

Basic search cannot exceed 2000 results. This is a server-side limit. If your query returns more than 2000 records, the function warns you. In these cases, you can set max_search_results = Inf to paginate past the limit. search_basic() automatically delegates to advanced search for pagination when you set max_search_results = Inf. There is more information about pagination below.

# This stops at 2000
climate_change <- search_basic("climate change")

# Delegate to advanced search with auto-pagination
climate_change_all <- search_basic(
  keyword = "climate change",
  max_search_results = Inf
)

Advanced Search

search_advanced() adds year range, language, thesis type, access type, and thesis status filters to keyword search.

The NTC advanced search form supports up to three keyword rows combined with Boolean operators (AND, OR, NOT), each targeting a different field. search_advanced() exposes only the first keyword row.

R packages that interface with academic databases, such as rentrez (PubMed) and europepmc (Europe PMC), often pass Boolean logic as a single query string (for example, "term1 AND term2"). NTC does not accept free-form Boolean strings. It uses structured form fields for each keyword row, so that pattern is not applicable here. To keep the interface simple, search_advanced() does not expose Boolean row combinations. For equivalent results, you can use the following approaches.

AND: Use search_detailed() with its field-specific parameters (title, author, supervisor, etc.).
OR: Run separate searches and combine with dplyr::bind_rows() |> dplyr::distinct().
NOT: Run both searches and exclude with dplyr::anti_join().

Year and Language Filters

# Keyword search with year range
recent_climate <- search_advanced(
  keyword = "iklim değişikliği",
  year_start = 2015,
  year_end = 2024
)

# English-language theses only
# language accepts ISO 639 codes ("tr", "en", "fr", "de", ...), or
# full names ("Turkish", "French")
english_growth <- search_advanced(
  keyword = "economic growth",
  language = "en"
)

# French-language theses
french_theses <- search_advanced(
  keyword = "migration",
  language = "fr"
)

Thesis Status

status controls whether results include only approved theses or also in-preparation ones: "approved" (default), "all", "in_preparation".

# In-preparation theses (not yet defended)
ongoing_ml <- search_advanced(
  keyword = "makine öğrenmesi",
  status = "in_preparation"
)

Combining Filters

You can combine filters to build precise keyword queries. Start with a minimal query, then add constraints.

# PhD theses in social sciences, open access, 2000-2024
complex_query <- search_advanced(
  keyword = "ekonometri",
  search_field = "title",
  thesis_type = "phd",
  year_start = 2000,
  year_end = 2024,
  access_type = "open"
)

Auto-Pagination

When max_search_results is greater than 2000 (including Inf) and the server reports more than 2000 matches, tezr switches to iterative year-range pagination. If you do not supply year_start and year_end, the package uses 1959:current_year as the search window. It then creates year chunks with weighted split points (pre-2000, 2000-2010, post-2010) and a safety target below the hard 2000-row cap. Each chunk is requested with the same filters as the original query. If a chunk is still capped by the server limit, that chunk is split again and retried until the range is small enough (or a single year remains). During this process, tezr updates split weights from observed uncapped chunk densities to bias later splits toward denser periods. Finally, chunk results are merged, deduplicated by thesis_no, and returned. If a single year still exceeds 2000 results, the package cannot paginate further for that year and warns you to narrow the query with additional filters.

# Retrieve all title matches (auto-paginate by year)
all_eu <- search_advanced(
  keyword = "avrupa",
  search_field = "title",
  year_start = 2010,
  year_end = 2020,
  max_search_results = Inf
)
dplyr::glimpse(all_eu)

Detailed Search

search_detailed() provides field-specific keyword search. Use it when you need to target thesis titles, authors, supervisors, subjects, index terms, or abstracts. It supports the same auto-pagination flow as search_advanced().

Supported keyword parameters in search_detailed() are title, author, supervisor, abstract, keyword, and subject. You can combine those with university, university_id, group, thesis_type, year_start, year_end, language, access_type, status, max_search_results, and ignore_cache.

YÖK’s redesigned detailed form supports field-specific and institutional filters. tezr sends university, institute, division, subject, discipline, group, and thesis-number filters through that form when you use search_detailed().

Finding Valid Filter Values

You can still use the list_*() functions to inspect YÖK’s metadata tables and interpret result fields.

# All universities
unis <- list_universities()
head(unis)

# Subjects have Turkish and English names
subjects <- list_subjects()
subjects |>
  filter(stringr::str_detect(name_tr, "Ekonomi"))

# Other list functions (each returns 'name' and 'id' columns)
institutes <- list_institutes()
divisions <- list_divisions()
disciplines <- list_disciplines()

Filtering by Subject

# All econometrics theses
econ_all <- search_detailed(subject = "Ekonometri")

Filtering by Supervisor

You can also filter results by supervisor names.

# Find theses supervised by a specific supervisor
supervisor_theses <- search_detailed(supervisor = "Mustafa Kadir Doğan")
head(supervisor_theses)

Vector-Valued Parameters

The YÖK web portal accepts only one value per keyword field. tezr removes this restriction for supported keyword fields and selected filters. When you pass multiple values, the package expands them into separate API calls, combines the results, and deduplicates by thesis_no.

# Search multiple subjects
multi_subject <- search_detailed(
  subject = c("Ekonomi", "Ekonometri")
)

# Multiple thesis types
multi_type <- search_detailed(
  subject = "Ekonomi",
  thesis_type = c("phd", "masters")
)

# Multiple languages (ISO 639 codes)
multi_lang <- search_detailed(
  subject = "Ekonomi",
  language = c("tr", "en", "fr")
)

# Search multiple subjects with pagination
multi_subject_all <- search_detailed(
  subject = c("Ekonomi", "Ekonometri"),
  max_search_results = Inf,
  ignore_cache = TRUE
)

Retrieving Detailed Metadata

Search results contain core metadata (title, author, university, year, type, subject). If you need full details, such as abstracts, keywords, supervisor names, page counts, and PDF links, you can use detail() function.

Single Thesis

Pass a search-result row to detail() to fetch the full record. When the row includes encrypted_no, detail() uses it automatically to request citation metadata.

# Search and get details for the first match
econ_phd <- search_detailed(
  subject = "Ekonometri",
  thesis_type = "phd",
  year_start = 2024,
  year_end = 2025
)

econ_phd_details <- detail(econ_phd[2, ])

dplyr::glimpse(econ_phd_details)

# English abstract
econ_phd_details$abstract_translation

Batch Retrieval

You can also pass all search-result rows to fetch details for multiple theses. The function shows text progress updates by default and fetches uncached records in parallel (up to 5 active requests).

# Fetch details for all results
econ_phd <- search_detailed(
  subject = "Ekonometri",
  thesis_type = "phd",
  year_start = 2025,
  year_end = 2026
)

# Batch retrieval
econ_phd_all_details <- detail(econ_phd)

Aggregate Statistics

These functions return summary statistics tables from the NTC.

# Thesis counts by year
year_stats <- stats_years()
tail(year_stats)

# Thesis counts by university
uni_stats <- stats_universities()
head(uni_stats)

# Thesis counts by subject
subject_stats <- stats_subjects()
head(subject_stats)

# Total counts by thesis type
type_stats <- stats_types()
type_stats

Cache Management

tezr caches search results, detail records, year-range queries, and lookup lists in memory. Caching speeds up repeated queries and reduces server load.

Viewing Cache Status

# Shows: enabled status, item counts, and TTL settings
cache_info()

The output includes search_count, range_count, detail_count, search_ttl, and detail_ttl. Search cache defaults to 3600 seconds (1 hour). Detail cache defaults to NULL (session lifetime; entries stay until you clear them or restart R).

Clearing Cache

You can clear specific cache types or everything at once. The "lookups" option clears cached university/subject/division lists.

# Clear search results only
cache_clear("searches")

# Clear detail records only
cache_clear("details")

# Clear lookup lists (universities, subjects, etc.)
cache_clear("lookups")

# Clear everything
cache_clear("all")

Configuring Cache TTL

You can also adjust time-to-live settings or disable caching entirely. TTL values are in seconds. NULL means entries persist for the entire session.

# 2-hour search cache, 1-week detail cache
cache_config(
  search_ttl = 7200,
  detail_ttl = 604800
)

# Disable caching entirely (every call hits the server)
cache_config(enable = FALSE)

# Re-enable with defaults
cache_config(enable = TRUE, search_ttl = 3600, detail_ttl = NULL)

Working with Results

Search results are returned in tibbles, so they work directly with dplyr and other tidyverse tools.

climate_change <- search_basic("climate change")

# Count by year
climate_change |>
  dplyr::count(year)

# Filter recent PhDs
climate_change |>
  dplyr::filter(thesis_type_tr == "Doktora", year >= 2020) |>
  dplyr::select(author, year, title_original, university)

# Most common subjects
climate_change |>
  dplyr::count(subject_tr, sort = TRUE) |>
  dplyr::slice_head(n = 10)

See the Analysis Examples vignette for complete analysis workflows with visualizations.

Limitations and Best Practices

Technical Limitations

No official API. tezr scrapes <tez.yok.gov.tr> by simulating browser requests and parsing HTML/JavaScript responses. Any change to the portal’s page structure, form parameters, or JavaScript patterns will break the package until updated. This is the primary fragility risk.
Single-year overflow. Auto-pagination splits year ranges to work around the 2000-result server cap, but cannot split below a single calendar year. If a query matches more than 2000 theses in one year, the package retrieves only the first 2000 for that year and issues a warning. You should narrow your search with supported filters such as thesis type, language, access type, status, subject, or year range.
In-memory cache only. All cached data (searches, details, lookups) is stored in R environment objects and lost when the session ends. You can save results to disk with readr::write_rds() or saveRDS() for persistence across sessions.
SSL verification disabled. The YÖK server has certificate issues, so SSL peer verification is turned off (ssl_verifypeer = FALSE). This is a security trade-off required for the package to function.
Fixed rate limiting. Requests use a built-in 2-second rate limit that is not user-configurable, so fetching large datasets still takes time.
Vector parameter expansion. search_detailed() expands vector-valued supported parameters into separate API calls via cartesian product. Passing many multi-valued parameters can generate a large number of requests.
Lookup tables. The list_*() lookup functions expose metadata tables from YÖK. These IDs can be passed to detailed searches to skip name lookup.
Metadata only. The package retrieves thesis metadata. PDF URLs are included in detail records but full-text files are not downloaded. You can use URLs to download the PDFs.

Best Practices

Cache and save results. Run large queries once and save locally with readr::write_rds(), readr::write_csv(), or similar functions. Then reload from disk in later sessions.
Filter before paginating. Add year ranges, thesis types, languages, access types, statuses, or supported keyword fields to keep result sets manageable before setting max_search_results = Inf.
Minimize server load. Use cached results when possible. Avoid repeating identical queries.
Validate data quality. Metadata may have inconsistencies (missing fields, encoding issues). Clean and validate before analysis.