Vector Similarity Search And RAG-Lite in R

rstats
data-wrangling
vector-search
vss
rag
Author

@hrbrmstr

Published

August 11, 2024

(If you’re coming here not from the Daily Drop parent post, please take a look at this previous project to get up to speed before diving in here.)

We’re going to partially replicate the ^^ sibling project, but also kick things up a notch and use the embeddings matrix and KEV corpus to build a wee RAG in R, with some help from Ollama and a few of its models.

Please note that the models we’ll be using will eat up some memory. Like, alot. This is one of the really bad things about the LLM/GPT craze. Sure, I have a beefy, old-ish M1 MacBook Pro Max which doesn’t break a sweat with these operations, but not everyone has 64GB of RAM (shared between Apple’s GPUs and CPUs) available.

Don’t worry, you only need that RAM for the RAG part, so you can skip that if you like.

Vector Similarity Search In R

R is perfectly capable of performing vector similarity search on its own without a need for a database. The “gotcha” is that the content has to fit into memory. Most of the “content” you and I want to poke through will generally fit comfortably in memory for these operations.

We’ll use CISA’s KEV again, but this time we’ll use a super lightweight embeddings model — nomic-embed-text. It’s wicked fast, has an 8K context size (~32K characters; most blog posts and other individual content pages are well withing this limit), and the embeddings dimensions are 768, so the matrix operations are generally super fast.

First Things First

These are the packages we’ll be using. The two new ones are {RcppHNSW} (for the search stuff) and {ollamar} which provides some useful helpers for Ollama.

suppressPackageStartupMessages({
  library(ollamar)
  library(RcppHNSW)
  library(stringi)
  library(tidyverse)
})

So, while {ollamar} is nice, it’s imperfect, so we need to make some helpers to make our life easier.

embed_quietly <- quietly(ollamar::embeddings)
chat_quietly <- quietly(ollamar::chat)

response_content <- \(.x) {
  .x |> 
    getElement("result") |> 
    httr2::resp_body_json() |> 
    getElement("message") |> 
    getElement("content")
}

nomic_embed <- function(input, task = c("search_document", "search_query")) {
  input <- glue::glue("{task[1]}: {input[1]}")

  embed_quietly(
    prompt = input,
    model = "nomic-embed-text",
  ) |> 
    getElement("result")
}

If you don’t have nomic-embed-text in your local Ollama, you can pull it via the command line or {ollamar}‘s pull() (that clobbers {dplyr}’s pull() so either use {conflicted} to make sure which one gets precedence, or :: your way to sanity.’).

The Nomic embed model has sone keywords that help it to create the right embeddings for different contexts, hence the task parameter in the embedding helper function.

Now, we’ll pull KEV from the source of truth:

cols(
  cveID = col_character(),
  vendorProject = col_character(),
  product = col_character(),
  vulnerabilityName = col_character(),
  dateAdded = col_date(format = ""),
  shortDescription = col_character(),
  requiredAction = col_character(),
  dueDate = col_date(format = ""),
  knownRansomwareCampaignUse = col_character(),
  notes = col_character(),
  cwes = col_logical()
) -> kev_cols

kev <- read_csv("https://www.cisa.gov/sites/default/files/csv/known_exploited_vulnerabilities.csv", col_types = kev_cols)

And make embeddings from the vulnerabilityName and shortDescription (play around with adding other columns or layer in information from NVD).

Just like last time, this takes a bit, so set .progress to TRUE in your interactive session to help pass the time (45 seconds on my laptop).

(system.time(kev |> 
  mutate(
    embedding = glue::glue("{vulnerabilityName}. {shortDescription}") |> 
      map(nomic_embed, .progress = FALSE)
  ) -> kev))
   user  system elapsed 
 18.193   0.711  46.340 

Now, we need to build the index, just like we did in SQLite and DuckDB.

# we need to pass a matrix to the indexer
kev_mat <- do.call(rbind, kev$embedding)

# TODO4U: see if changing the `distance` does anything to the results
ann <- hnsw_build(kev_mat, distance = "cosine")

Finally, we replicate the “find CVE’s similar to Log4Shell” example from the previous post:

kev |> 
  filter(
    cveID == "CVE-2021-44228"
  ) |> 
  select(embedding) |> 
  dplyr::pull(embedding) |> # get the KEV embedding—
  unlist() |> 
  matrix(nrow = 1) |> # —as a 1 row matrix
  hnsw_search(ann, k = 10) |> # k == # similar items
  getElement("idx")  |> # `idx` has the indices of the nearest neighbors; `dist` in the object has the computed distances
  as.vector() |> 
  kev[i = _, c("cveID", "product", "vulnerabilityName")]
# A tibble: 10 × 3
   cveID          product                        vulnerabilityName              
   <chr>          <chr>                          <chr>                          
 1 CVE-2021-44228 Log4j2                         Apache Log4j2 Remote Code Exec…
 2 CVE-2021-45046 Log4j2                         Apache Log4j2 Deserialization …
 3 CVE-2010-1871  JBoss Seam 2                   Red Hat Linux JBoss Seam 2 Rem…
 4 CVE-2016-8735  Tomcat                         Apache Tomcat Remote Code Exec…
 5 CVE-2017-12149 JBoss Application Server       Red Hat JBoss Application Serv…
 6 CVE-2017-12617 Tomcat                         Apache Tomcat Remote Code Exec…
 7 CVE-2019-10758 mongo-express                  MongoDB mongo-express Remote C…
 8 CVE-2022-41128 Windows                        Microsoft Windows Scripting La…
 9 CVE-2017-5638  Struts                         Apache Struts Remote Code Exec…
10 CVE-2013-0422  Java Runtime Environment (JRE) Oracle JRE Remote Code Executi…

Feel empowered to try the clustering as well, but I highly recommend doing it with DBSCAN.

RAG-Time

Now that we have this data in the bestest programming language, let’s see what it takes to build a naive RAG (a.k.a. RAG-lite). I say “naive” and “lite”, as we’re doing literally the bare minimum (perhaps a bit less, truth be told) in terms of implementation. You need to put in many safeguards against malicioius input, and we should layer in full-text search, and more. But, there are some general concepts, here, that translate well to many RAG applications.

We’re using llama3.1 8b to help us rewrite the questions we’ll get from our humans (yes, all the fancy online chat thing you talk to do this as well; they’re way better at writing prompts than we are; sorry-not-sorry if you wanted a lucrative career as a prompt engineer). It’s also going to help us produce the RAG output to the human query (it has a bonkers yuge input context).

We’re also going to use UniversalNER, a LLaMA 2 7B model fine tuned with UniversalNER data for entity extraction tasks. In our case, we’re going to try to tease the vendor out of the query for more accurate results filtering.

Unfortunately, this is all bundled into a non-refactored massive ask_kev() function, which means you’ll need to read through the comments, below, for the commentary:

ask_kev <- function(q) {

  message("Analyzing original query…")

  # a middling prompt to Ollama and llama3.1 to get it to rewrite our input query.

  chat_quietly(
    model = "llama3.1",
    messages = list(list(role = "user", content = sprintf(r"(
  From the input at the end, identify the core topic:
    
  - focus on the most important aspect of the query
  - remove unnecessary words and phrases
  - create a concise phrase suitable for vector search and only return the new prompt without any other text or commentary: ```%s```)", q[1]))), 
    stream = FALSE
  ) |> 
    response_content() -> new_q # {ollamar} is super chatty, sigh
  
  message("Generated embeddings…")

  nomic_embed(
    input = new_q[1],
    task = "search_query" # tell nomic we're querying, now
  ) |> 
    matrix(
      nrow = 1
    ) -> X
  
  message("Searching corpus…")

  # find the closest matches in the corpus
  answer <- hnsw_search(X, ann, 10)
  result <- kev[as.numeric(answer$idx),]

  message("Trying to identify mentioned vendors…")

  # there will inevitably be mis-hits, so we try to help the process out
  # by seeing if we can narrow them down to any vendors specified in the query.
  # this is imperfect, but might be better if we could send tuning parameters to
  # ollama from {ollamar}, but we can't, for now.

  chat_quietly(
    model = "zeffmuks/universal-ner",
    messages = list(
      list(
        role = "user",
        content = sprintf("%s  Organization", q[1])
      )
    )
  ) |> 
    response_content() |> 
    jsonlite::fromJSON() |> 
    stri_trans_tolower() -> vendors

  if (length(vendors)) {
    message("  - Narrowing results down to: ", paste0(vendors, collapse=", "))
    result |> 
      filter(
        stri_trans_tolower(vendorProject) %in% vendors
      ) -> result
  }

  # now we build up the corpus we'll send along with an intricate prompt to Ollama and llama3.1b

  result |> 
    with(
      glue::glue(
        r"(CVE: {cveID}
Vendor: {vendorProject}
Product: {product}
Vulnerability: {vulnerabilityName}
Description: {stri_replace_all_regex(shortDescription, "[[:space:]]", " ")}
----
)"
      )
    ) |>
    paste0(collapse = "\n") -> corpus

  message("Shunting the question and corpus to the LLM…")

  messages <- list(
    list(role = "assistant", content = r"(For each received question, use the provided search results, and generate a response following these guidelines:

1. Create a concise list in markdown of relevant CVEs and their associated vulnerability names. Only include results that match the specific technology, product, vendor, or vulnerability type mentioned in the query.

2. Format each list item as follows:
   - CVE-YYYY-NNNNN: (Vendor) Vulnerability Name

3. After the list, select one vulnerability from the results and provide a detailed explanation, including:
   - Full vulnerability name
   - Affected systems or software
   - Technical details of the vulnerability
   - Potential impact if exploited
   - Recommended mitigation or patch information

Ensure your response is precise, thorough, and directly addresses the query without unnecessary preamble.)"),
    list(role = "user", content = sprintf("%s\n Search results: ````\n%s\n````", q[1], corpus))
  )
  
  chat_quietly(
    model = "llama3.1",
    messages = messages, 
    stream = FALSE
  ) |> 
    response_content()   

}

Let’s give it a go!

answer <- ask_kev("I'd like to know about Java deserialization vulnerabilities in Oracle products?")
Analyzing original query…
Generated embeddings…
Searching corpus…
Trying to identify mentioned vendors…
  - Narrowing results down to: oracle
Shunting the question and corpus to the LLM…
cat(answer)

Here is a concise list of relevant CVEs for Java deserialization vulnerabilities in Oracle products:

Affected Vulnerabilities

  • CVE-2013-2465: (Oracle) Oracle Java SE Unspecified Vulnerability
  • CVE-2010-0840: (Oracle) Oracle JRE Unspecified Vulnerability

Detailed Explanation of Selected Vulnerability: CVE-2010-0840

Full Vulnerability Name

Oracle JRE Unspecified Vulnerability (CVE-2010-0840)

Affected Systems/Software

This vulnerability affects the Java Runtime Environment (JRE) in Oracle’s Java SE product.

Technical Details

The specific details of this vulnerability are not publicly disclosed due to its unspecified nature. However, it is categorized as affecting confidentiality, integrity, and availability via Unknown vectors related to unspecified components. Deserialization vulnerabilities typically occur when untrusted input is fed into an object deserialization mechanism, potentially leading to code execution or data corruption.

Potential Impact

If exploited, this vulnerability could allow remote attackers to access sensitive information, disrupt system functionality, or even gain control over the affected systems. The nature of the impact would depend on how the vulnerability is exploited and what parts of the Java SE environment are compromised.

List of Microsoft Exchange vulnerabilities:

  • CVE-2022-41080: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
  • CVE-2021-31207: (Microsoft) Microsoft Exchange Server Security Feature Bypass Vulnerability
  • CVE-2018-8581: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
  • CVE-2022-41082: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerability
  • CVE-2021-34523: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
  • CVE-2024-21410: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
  • CVE-2022-41040: (Microsoft) Microsoft Exchange Server Server-Side Request Forgery Vulnerability
  • CVE-2021-33766: (Microsoft) Microsoft Exchange Server Information Disclosure
  • CVE-2021-26858: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerability
  • CVE-2021-26857: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerality

Detailed explanation of CVE-2022-41082:

Full vulnerability name:

CVE-2022-41082 - Microsoft Exchange Server Remote Code Execution Vulnerability

Affected systems or software:

Microsoft Exchange Server

Technical details of the vulnerability:

This vulnerability exists in Microsoft Exchange Server and allows for authenticated remote code execution. It is chainable with CVE-2022-41040, which makes it possible to execute arbitrary code on a vulnerable server.

Potential impact if exploited:

An attacker could exploit this vulnerability to gain elevated privileges within the Exchange server environment or even further across the network if other vulnerabilities are successfully chained together. This vulnerability represents part of the ProxyNotShell exploit chain, highlighting the severe risk posed by successful exploitation.

List of Relevant CVEs

  • CVE-2023-42916: (Apple) Apple Multiple Products WebKit Out-of-Bounds Read Vulnerability
  • CVE-2023-28204: (Apple) Apple Multiple Products WebKit Out-of-Bounds Read Vulnerability
  • CVE-2023-32435: (Apple) Apple Multiple Products WebKit Memory Corruption Vulnerability
  • CVE-2023-42917: (Apple) Apple Multiple Products WebKit Memory Corruption Vulnerability
  • CVE-2023-37450: (Apple) Apple Multiple Products WebKit Code Execution Vulnerability
  • CVE-2023-41993: (Apple) Apple Multiple Products WebKit Code Execution Vulnerability
  • CVE-2023-32439: (Apple) Apple Multiple Products WebKit Type Confusion Vulnerability
  • CVE-2021-30663: (Apple) Apple Multiple Products WebKit Integer Overflow Vulnerability
  • CVE-2023-23529: (Apple) Apple Multiple Products WebKit Type Confusion Vulnerability
  • CVE-2023-28205: (Apple) Apple Multiple Products WebKit Use-After-Free Vulnerability

Detailed Explanation of a Selected Vulnerability

CVE-2023-32435: Apple Multiple Products WebKit Memory Corruption Vulnerability

Full Vulnerability Name: Apple Multiple Products WebKit Memory Corruption Vulnerability

Affected Systems or Software: The vulnerability affects various Apple products and software that use the WebKit browser engine, including but not limited to: - iOS - iPadOS - macOS - tvOS - watchOS - Safari

Technical Details of the Vulnerability: This vulnerability involves a memory corruption issue within WebKit. It allows an attacker to craft malicious web content that leads to code execution by manipulating how WebKit processes HTML. The technical specifics involve a manipulation of resources in WebKit that are not properly protected, leading to corruption and potential execution of arbitrary code.

Potential Impact if Exploited: If exploited successfully, this vulnerability could lead to unauthorized access to sensitive information or the execution of malicious code on affected systems. This could result in various negative consequences depending on the system and data compromised.

Recommended Mitigation or Patch Information: Apple typically addresses vulnerabilities through security updates and patches for their operating systems and software products. Users are advised to ensure their devices and browsers are updated with the latest available patches, which will mitigate this vulnerability among others that may be present in older versions of affected products.

Please note that while I have provided information on how to address the identified vulnerability, it’s crucial for users to follow specific instructions from Apple or other relevant vendors regarding updates and security patches. Additionally, general best practices such as avoiding suspicious links or downloads can also help prevent exploitation attempts.

If we could do some tuning, the results would be far more deterministic, but modifying {ollamar} for this weekend bonus post was just not happening thanks to something called “yard work” (sigh).

Play Around!

We did not do much defensive coding for the LLM, it could use some more keyword filtering, perhaps even asking UniversalNER to give us technologies as well as vendors.

If you manage to get good results with smaller models, please let me know! The more we can do with our own hardware and open source tools like Ollama/llamafile, and open weights models, the more this side of data science can be democratized.