suppressPackageStartupMessages({
library(ollamar)
library(RcppHNSW)
library(stringi)
library(tidyverse)
})
(If you’re coming here not from the Daily Drop parent post, please take a look at this previous project to get up to speed before diving in here.)
We’re going to partially replicate the ^^ sibling project, but also kick things up a notch and use the embeddings matrix and KEV corpus to build a wee RAG in R, with some help from Ollama and a few of its models.
Please note that the models we’ll be using will eat up some memory. Like, alot. This is one of the really bad things about the LLM/GPT craze. Sure, I have a beefy, old-ish M1 MacBook Pro Max which doesn’t break a sweat with these operations, but not everyone has 64GB of RAM (shared between Apple’s GPUs and CPUs) available.
Don’t worry, you only need that RAM for the RAG part, so you can skip that if you like.
Vector Similarity Search In R
R is perfectly capable of performing vector similarity search on its own without a need for a database. The “gotcha” is that the content has to fit into memory. Most of the “content” you and I want to poke through will generally fit comfortably in memory for these operations.
We’ll use CISA’s KEV again, but this time we’ll use a super lightweight embeddings model — nomic-embed-text
. It’s wicked fast, has an 8K context size (~32K characters; most blog posts and other individual content pages are well withing this limit), and the embeddings dimensions are 768, so the matrix operations are generally super fast.
First Things First
These are the packages we’ll be using. The two new ones are {RcppHNSW} (for the search stuff) and {ollamar} which provides some useful helpers for Ollama.
So, while {ollamar} is nice, it’s imperfect, so we need to make some helpers to make our life easier.
<- quietly(ollamar::embeddings)
embed_quietly <- quietly(ollamar::chat)
chat_quietly
<- \(.x) {
response_content |>
.x getElement("result") |>
::resp_body_json() |>
httr2getElement("message") |>
getElement("content")
}
<- function(input, task = c("search_document", "search_query")) {
nomic_embed <- glue::glue("{task[1]}: {input[1]}")
input
embed_quietly(
prompt = input,
model = "nomic-embed-text",
|>
) getElement("result")
}
If you don’t have nomic-embed-text
in your local Ollama, you can pull it via the command line or {ollamar}‘s pull()
(that clobbers {dplyr}’s pull()
so either use {conflicted} to make sure which one gets precedence, or ::
your way to sanity.’).
The Nomic embed model has sone keywords that help it to create the right embeddings for different contexts, hence the task
parameter in the embedding helper function.
Now, we’ll pull KEV from the source of truth:
cols(
cveID = col_character(),
vendorProject = col_character(),
product = col_character(),
vulnerabilityName = col_character(),
dateAdded = col_date(format = ""),
shortDescription = col_character(),
requiredAction = col_character(),
dueDate = col_date(format = ""),
knownRansomwareCampaignUse = col_character(),
notes = col_character(),
cwes = col_logical()
-> kev_cols
)
<- read_csv("https://www.cisa.gov/sites/default/files/csv/known_exploited_vulnerabilities.csv", col_types = kev_cols) kev
And make embeddings from the vulnerabilityName
and shortDescription
(play around with adding other columns or layer in information from NVD).
Just like last time, this takes a bit, so set .progress
to TRUE
in your interactive session to help pass the time (45 seconds on my laptop).
system.time(kev |>
(mutate(
embedding = glue::glue("{vulnerabilityName}. {shortDescription}") |>
map(nomic_embed, .progress = FALSE)
-> kev)) )
user system elapsed
18.193 0.711 46.340
Now, we need to build the index, just like we did in SQLite and DuckDB.
# we need to pass a matrix to the indexer
<- do.call(rbind, kev$embedding)
kev_mat
# TODO4U: see if changing the `distance` does anything to the results
<- hnsw_build(kev_mat, distance = "cosine") ann
Finally, we replicate the “find CVE’s similar to Log4Shell” example from the previous post:
|>
kev filter(
== "CVE-2021-44228"
cveID |>
) select(embedding) |>
::pull(embedding) |> # get the KEV embedding—
dplyrunlist() |>
matrix(nrow = 1) |> # —as a 1 row matrix
hnsw_search(ann, k = 10) |> # k == # similar items
getElement("idx") |> # `idx` has the indices of the nearest neighbors; `dist` in the object has the computed distances
as.vector() |>
= _, c("cveID", "product", "vulnerabilityName")] kev[i
# A tibble: 10 × 3
cveID product vulnerabilityName
<chr> <chr> <chr>
1 CVE-2021-44228 Log4j2 Apache Log4j2 Remote Code Exec…
2 CVE-2021-45046 Log4j2 Apache Log4j2 Deserialization …
3 CVE-2010-1871 JBoss Seam 2 Red Hat Linux JBoss Seam 2 Rem…
4 CVE-2016-8735 Tomcat Apache Tomcat Remote Code Exec…
5 CVE-2017-12149 JBoss Application Server Red Hat JBoss Application Serv…
6 CVE-2017-12617 Tomcat Apache Tomcat Remote Code Exec…
7 CVE-2019-10758 mongo-express MongoDB mongo-express Remote C…
8 CVE-2022-41128 Windows Microsoft Windows Scripting La…
9 CVE-2017-5638 Struts Apache Struts Remote Code Exec…
10 CVE-2013-0422 Java Runtime Environment (JRE) Oracle JRE Remote Code Executi…
Feel empowered to try the clustering as well, but I highly recommend doing it with DBSCAN.
RAG-Time
Now that we have this data in the bestest programming language, let’s see what it takes to build a naive RAG (a.k.a. RAG-lite). I say “naive” and “lite”, as we’re doing literally the bare minimum (perhaps a bit less, truth be told) in terms of implementation. You need to put in many safeguards against malicioius input, and we should layer in full-text search, and more. But, there are some general concepts, here, that translate well to many RAG applications.
We’re using llama3.1 8b
to help us rewrite the questions we’ll get from our humans (yes, all the fancy online chat thing you talk to do this as well; they’re way better at writing prompts than we are; sorry-not-sorry if you wanted a lucrative career as a prompt engineer). It’s also going to help us produce the RAG output to the human query (it has a bonkers yuge input context).
We’re also going to use UniversalNER, a LLaMA 2 7B model fine tuned with UniversalNER data for entity extraction tasks. In our case, we’re going to try to tease the vendor out of the query for more accurate results filtering.
Unfortunately, this is all bundled into a non-refactored massive ask_kev()
function, which means you’ll need to read through the comments, below, for the commentary:
<- function(q) {
ask_kev
message("Analyzing original query…")
# a middling prompt to Ollama and llama3.1 to get it to rewrite our input query.
chat_quietly(
model = "llama3.1",
messages = list(list(role = "user", content = sprintf(r"(
From the input at the end, identify the core topic:
- focus on the most important aspect of the query
- remove unnecessary words and phrases
- create a concise phrase suitable for vector search and only return the new prompt without any other text or commentary: ```%s```)", q[1]))),
stream = FALSE
|>
) response_content() -> new_q # {ollamar} is super chatty, sigh
message("Generated embeddings…")
nomic_embed(
input = new_q[1],
task = "search_query" # tell nomic we're querying, now
|>
) matrix(
nrow = 1
-> X
)
message("Searching corpus…")
# find the closest matches in the corpus
<- hnsw_search(X, ann, 10)
answer <- kev[as.numeric(answer$idx),]
result
message("Trying to identify mentioned vendors…")
# there will inevitably be mis-hits, so we try to help the process out
# by seeing if we can narrow them down to any vendors specified in the query.
# this is imperfect, but might be better if we could send tuning parameters to
# ollama from {ollamar}, but we can't, for now.
chat_quietly(
model = "zeffmuks/universal-ner",
messages = list(
list(
role = "user",
content = sprintf("%s Organization", q[1])
)
)|>
) response_content() |>
::fromJSON() |>
jsonlitestri_trans_tolower() -> vendors
if (length(vendors)) {
message(" - Narrowing results down to: ", paste0(vendors, collapse=", "))
|>
result filter(
stri_trans_tolower(vendorProject) %in% vendors
-> result
)
}
# now we build up the corpus we'll send along with an intricate prompt to Ollama and llama3.1b
|>
result with(
::glue(
glue"(CVE: {cveID}
rVendor: {vendorProject}
Product: {product}
Vulnerability: {vulnerabilityName}
Description: {stri_replace_all_regex(shortDescription, "[[:space:]]", " ")}
----
)"
)|>
) paste0(collapse = "\n") -> corpus
message("Shunting the question and corpus to the LLM…")
<- list(
messages list(role = "assistant", content = r"(For each received question, use the provided search results, and generate a response following these guidelines:
1. Create a concise list in markdown of relevant CVEs and their associated vulnerability names. Only include results that match the specific technology, product, vendor, or vulnerability type mentioned in the query.
2. Format each list item as follows:
- CVE-YYYY-NNNNN: (Vendor) Vulnerability Name
3. After the list, select one vulnerability from the results and provide a detailed explanation, including:
- Full vulnerability name
- Affected systems or software
- Technical details of the vulnerability
- Potential impact if exploited
- Recommended mitigation or patch information
Ensure your response is precise, thorough, and directly addresses the query without unnecessary preamble.)"),
list(role = "user", content = sprintf("%s\n Search results: ````\n%s\n````", q[1], corpus))
)
chat_quietly(
model = "llama3.1",
messages = messages,
stream = FALSE
|>
) response_content()
}
Let’s give it a go!
<- ask_kev("I'd like to know about Java deserialization vulnerabilities in Oracle products?") answer
Analyzing original query…
Generated embeddings…
Searching corpus…
Trying to identify mentioned vendors…
- Narrowing results down to: oracle
Shunting the question and corpus to the LLM…
cat(answer)
Here is a concise list of relevant CVEs for Java deserialization vulnerabilities in Oracle products:
Affected Vulnerabilities
- CVE-2013-2465: (Oracle) Oracle Java SE Unspecified Vulnerability
- CVE-2010-0840: (Oracle) Oracle JRE Unspecified Vulnerability
Detailed Explanation of Selected Vulnerability: CVE-2010-0840
Full Vulnerability Name
Oracle JRE Unspecified Vulnerability (CVE-2010-0840)
Affected Systems/Software
This vulnerability affects the Java Runtime Environment (JRE) in Oracle’s Java SE product.
Technical Details
The specific details of this vulnerability are not publicly disclosed due to its unspecified nature. However, it is categorized as affecting confidentiality, integrity, and availability via Unknown vectors related to unspecified components. Deserialization vulnerabilities typically occur when untrusted input is fed into an object deserialization mechanism, potentially leading to code execution or data corruption.
Potential Impact
If exploited, this vulnerability could allow remote attackers to access sensitive information, disrupt system functionality, or even gain control over the affected systems. The nature of the impact would depend on how the vulnerability is exploited and what parts of the Java SE environment are compromised.
Recommended Mitigation/Patch Information
Given the unspecified nature of the vulnerability, detailed mitigation steps beyond ensuring the latest patches and updates are installed cannot be provided. It is essential for users to stay updated with Oracle’s security advisories for any specific instructions related to CVE-2010-0840 or similar vulnerabilities.
How about an easy one:
<- ask_kev("Are there any Microsoft Exchange vulnerabilities?") answer
Analyzing original query…
Generated embeddings…
Searching corpus…
Trying to identify mentioned vendors…
Shunting the question and corpus to the LLM…
cat(answer)
List of Microsoft Exchange vulnerabilities:
- CVE-2022-41080: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
- CVE-2021-31207: (Microsoft) Microsoft Exchange Server Security Feature Bypass Vulnerability
- CVE-2018-8581: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
- CVE-2022-41082: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerability
- CVE-2021-34523: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
- CVE-2024-21410: (Microsoft) Microsoft Exchange Server Privilege Escalation Vulnerability
- CVE-2022-41040: (Microsoft) Microsoft Exchange Server Server-Side Request Forgery Vulnerability
- CVE-2021-33766: (Microsoft) Microsoft Exchange Server Information Disclosure
- CVE-2021-26858: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerability
- CVE-2021-26857: (Microsoft) Microsoft Exchange Server Remote Code Execution Vulnerality
Detailed explanation of CVE-2022-41082:
Full vulnerability name:
CVE-2022-41082 - Microsoft Exchange Server Remote Code Execution Vulnerability
Affected systems or software:
Microsoft Exchange Server
Technical details of the vulnerability:
This vulnerability exists in Microsoft Exchange Server and allows for authenticated remote code execution. It is chainable with CVE-2022-41040, which makes it possible to execute arbitrary code on a vulnerable server.
Potential impact if exploited:
An attacker could exploit this vulnerability to gain elevated privileges within the Exchange server environment or even further across the network if other vulnerabilities are successfully chained together. This vulnerability represents part of the ProxyNotShell exploit chain, highlighting the severe risk posed by successful exploitation.
Recommended mitigation or patch information:
Microsoft has released patches and updates for CVE-2022-41082 as part of broader security updates to address this and related vulnerabilities. Ensuring that Exchange Server installations are up-to-date with the latest security patches is crucial for mitigating this vulnerability. Additionally, implementing robust network segmentation, enforcing least privilege access policies, and conducting regular security audits can help minimize potential impacts if exploitation occurs.
(Yeah, it UniversalNER misses Microsoft
every other run. That prompt need work.)
And, why not pick on Apple, too?
<- ask_kev("What type of vulnerabilities does Safari have?") answer
Analyzing original query…
Generated embeddings…
Searching corpus…
Trying to identify mentioned vendors…
Shunting the question and corpus to the LLM…
cat(answer)
List of Relevant CVEs
- CVE-2023-42916: (Apple) Apple Multiple Products WebKit Out-of-Bounds Read Vulnerability
- CVE-2023-28204: (Apple) Apple Multiple Products WebKit Out-of-Bounds Read Vulnerability
- CVE-2023-32435: (Apple) Apple Multiple Products WebKit Memory Corruption Vulnerability
- CVE-2023-42917: (Apple) Apple Multiple Products WebKit Memory Corruption Vulnerability
- CVE-2023-37450: (Apple) Apple Multiple Products WebKit Code Execution Vulnerability
- CVE-2023-41993: (Apple) Apple Multiple Products WebKit Code Execution Vulnerability
- CVE-2023-32439: (Apple) Apple Multiple Products WebKit Type Confusion Vulnerability
- CVE-2021-30663: (Apple) Apple Multiple Products WebKit Integer Overflow Vulnerability
- CVE-2023-23529: (Apple) Apple Multiple Products WebKit Type Confusion Vulnerability
- CVE-2023-28205: (Apple) Apple Multiple Products WebKit Use-After-Free Vulnerability
Detailed Explanation of a Selected Vulnerability
CVE-2023-32435: Apple Multiple Products WebKit Memory Corruption Vulnerability
Full Vulnerability Name: Apple Multiple Products WebKit Memory Corruption Vulnerability
Affected Systems or Software: The vulnerability affects various Apple products and software that use the WebKit browser engine, including but not limited to: - iOS - iPadOS - macOS - tvOS - watchOS - Safari
Technical Details of the Vulnerability: This vulnerability involves a memory corruption issue within WebKit. It allows an attacker to craft malicious web content that leads to code execution by manipulating how WebKit processes HTML. The technical specifics involve a manipulation of resources in WebKit that are not properly protected, leading to corruption and potential execution of arbitrary code.
Potential Impact if Exploited: If exploited successfully, this vulnerability could lead to unauthorized access to sensitive information or the execution of malicious code on affected systems. This could result in various negative consequences depending on the system and data compromised.
Recommended Mitigation or Patch Information: Apple typically addresses vulnerabilities through security updates and patches for their operating systems and software products. Users are advised to ensure their devices and browsers are updated with the latest available patches, which will mitigate this vulnerability among others that may be present in older versions of affected products.
Please note that while I have provided information on how to address the identified vulnerability, it’s crucial for users to follow specific instructions from Apple or other relevant vendors regarding updates and security patches. Additionally, general best practices such as avoiding suspicious links or downloads can also help prevent exploitation attempts.
If we could do some tuning, the results would be far more deterministic, but modifying {ollamar} for this weekend bonus post was just not happening thanks to something called “yard work” (sigh).
Play Around!
We did not do much defensive coding for the LLM, it could use some more keyword filtering, perhaps even asking UniversalNER to give us technologies as well as vendors.
If you manage to get good results with smaller models, please let me know! The more we can do with our own hardware and open source tools like Ollama/llamafile, and open weights models, the more this side of data science can be democratized.