Post

SecurityOnion GPT

SecurityOnion GPT

Introduction

I was recently catching up on some conference videos and saw a talk by Roberto Rodriguez on Empowering Security Teams with Generative AI: GPT models. This got me thinking about how to integrate GPT to hunting with Security Onion.

Goals:

  1. Summarize activity found in Security Onion
  2. Enrich activity with MITRE ATT&CK attribution
  3. Convert English questions to Kibana Query Language to hunt

In this post, I’ll tackle goals 1 and 2. I’ll do goal 3 in a separate post. These experiments will be conducted in Jupyter lab.

1. Summarize activity found in Security Onion

First we need to connect Jupyter to SO to search for malicious activity. I’ve run some ransomware attacks in the lab, so we’ll test out these goals by looking for pre-ransom activity, which includes deleting the Windows Volume Shadow copies as part of Inhibit System Recovery.

Connect to Elasticsearch

1
!pip3 install pandas openai autogen elasticsearch elasticsearch-dsl python-dotenv
1
2
3
4
5
6
7
8
9
10
import pandas as pd
import os
import json
import openai
import urllib3
import autogen
from datetime import datetime, timedelta
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from dotenv import load_dotenv

Here we connect to the Elasticsearch service that’s running on SO. search_so() is a helper function to reduce boilerplate code when searching for activity.

terms is a list of Elasticsearch DSL queries. They will take the form of:

1
Q('TYPE', **{field1: value1}),

where TYPE is ‘match’ to do normal or fuzzy searches, ‘term’ for precise values, or ‘range’ to look for values between two limits. ‘range’ will be used for @timestamp field to focus in on when the activity occurred. The **{} syntax is needed for the @timestamp field, as the Search DSL documentation says:

In some cases [Pass all the parameters as keyword arguments] is not possible due to python’s restriction on identifiers - for example if your field is called @timestamp. In that case you have to fall back to unpacking a dictionary: Range(** {‘@timestamp’: {‘lt’: ‘now’}})

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Security Onion setup
load_dotenv()
soindex='*:so-*'
so_user = os.getenv("SO_USERNAME")
so_pass = os.getenv("SO_PASSWORD")
sohost = os.getenv("SO_HOST")
so_api_key = os.getenv("SO_API_KEY")
es = Elasticsearch([f'https://{sohost}:9200'], ca_certs=False,verify_certs=False, api_key=so_api_key)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def search_so(terms: list, start_date=(datetime.utcnow() - timedelta(days=3)).isoformat(), end_date=datetime.now().isoformat()):
    search = Search(using=es, index=soindex, doc_type='doc')
    search = search.query(
        Q('bool',
          must=terms
        )
    )
    response = search.execute()
    if response.success():
        df = pd.DataFrame((d.to_dict() for d in search.scan()))
        return df
    else:
        print(f"Query failed: {response}")
        return None

Verifying that Jupyter and Elasticsearch can communicate:

1
es.info()

ObjectApiResponse({‘name’: ‘securityonion’, ‘cluster_name’: ‘securityonion’, ‘cluster_uuid’: ‘WVA9WFJETpCLgQPeLtGk1A’, ‘version’: {‘number’: ‘8.10.4’, ‘build_flavor’: ‘default’, ‘build_type’: ‘docker’, ‘build_hash’: ‘b4a62ac808e886ff032700c391f45f1408b2538c’, ‘build_date’: ‘2023-10-11T22:04:35.506990650Z’, ‘build_snapshot’: False, ‘lucene_version’: ‘9.7.0’, ‘minimum_wire_compatibility_version’: ‘7.17.0’, ‘minimum_index_compatibility_version’: ‘7.0.0’}, ‘tagline’: ‘You Know, for Search’})

Now we can search for activity. This will be the equivalent of the querystring

1
[@timestamp: 2023-11-12T00:00:00 TO 2023-11-15T00:00:00] and event.dataset:process_creation and process.command_line:*shadows*

Note that periods in the field names get replaced with double underscores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Replace 'your_term_field' and 'your_term_value' with your term field and value
field1 = 'event__dataset'
value1 = 'process_creation'
field2 = 'process__command_line'
value2 = '*shadows*'
# Replace 'your_date_field' with your date field
date_field = '@timestamp'

# Replace the date range as needed
start_date = datetime(2023, 11, 12)
end_date = datetime(2023, 11, 15)

# Create a Bool query with Term and Range clauses
terms = [
  Q('match', **{field1: value1}),
  Q('wildcard', **{field2: value2}),
  Q('range', **{date_field: {'gte': start_date, 'lte': end_date}})
]

# Execute the search
response = search_so(terms, start_date, end_date)

# Access the search results
if response is not None:
    df = response
1
df
metadataagentprocesswinloglogmessagetagsobserver@timestampfileecs@versionhosteventuserhash
0{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'SCR-ACT-PC2', 'id': '2e5cc8c1-e445-4...{'parent': {'entity_id': '{14f46ddd-e36e-6552-...{'computer_name': 'SCR-ACT-PC2.blue.local', 'p...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'SCR-ACT-PC2.blue.local'}2023-11-14T03:03:10.110Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'SCR-ACT-PC2', 'os': {'build': '1...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': '272245E2988E1E430500B852C4FB5E18'...
1{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'SCR-ACT-PC2', 'id': '2e5cc8c1-e445-4...{'parent': {'entity_id': '{14f46ddd-e36e-6552-...{'computer_name': 'SCR-ACT-PC2.blue.local', 'p...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'SCR-ACT-PC2.blue.local'}2023-11-14T03:03:10.142Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'SCR-ACT-PC2', 'os': {'build': '1...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': 'C1EDC431CD345F0A0F32019895D13FCE'...
2{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'scr-sales-pc1', 'id': '0ae611f1-d2e3...{'parent': {'entity_id': '{5f62a1c2-dc3d-6552-...{'computer_name': 'scr-sales-pc1.blue.local', ...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'scr-sales-pc1.blue.local'}2023-11-14T02:32:29.137Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'scr-sales-pc1', 'os': {'build': ...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': '272245E2988E1E430500B852C4FB5E18'...
3{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'scr-sales-pc1', 'id': '0ae611f1-d2e3...{'parent': {'entity_id': '{5f62a1c2-dc3d-6552-...{'computer_name': 'scr-sales-pc1.blue.local', ...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'scr-sales-pc1.blue.local'}2023-11-14T02:32:29.170Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'scr-sales-pc1', 'os': {'build': ...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': 'C1EDC431CD345F0A0F32019895D13FCE'...
4{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'scr-off-pc1', 'id': 'df9e4617-58fd-4...{'parent': {'entity_id': '{7face796-de00-6552-...{'computer_name': 'scr-off-pc1.blue.local', 'p...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'scr-off-pc1.blue.local'}2023-11-14T02:40:00.081Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'scr-off-pc1', 'os': {'build': '2...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': 'D509661209CA0D9B45580702D62B63C0'...
5{'beat': 'winlogbeat', 'ip_address': '10.10.41...{'name': 'scr-off-pc1', 'id': 'df9e4617-58fd-4...{'parent': {'entity_id': '{7face796-ddff-6552-...{'computer_name': 'scr-off-pc1.blue.local', 'p...{'level': 'information'}Process Create:\nRuleName: -\nUtcTime: 2023-11...[beat-ext, beats_input_codec_plain_applied]{'name': 'scr-off-pc1.blue.local'}2023-11-14T02:40:00.058Z{'hash': {}}{'version': '8.0.0'}1{'hostname': 'scr-off-pc1', 'os': {'build': '2...{'code': '1', 'provider': 'Microsoft-Windows-S...{'name': 'NT AUTHORITY\SYSTEM'}{'imphash': 'D73E39DAB3C8B57AA408073D01254964'...

The process information is a JSON object, so we can use json_normalize to pull that information into its own dataframe.

1
processes = pd.json_normalize(df.process)

Connect to OpenAI

Now we’ll set up a connection to OpenAI’s API. We’l ask ChatGPT to summarize what happened in the command_line values.

1
2
3
4
5
6
7
8
9
10
# OpenAI setup
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

def chat_gpt(prompt):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content
1
2
3
4
5
6
7
8
9
10
prompt = f"""
Provide a summary of the following commandline activity.
Give me a few sentences summarizing what actually happened based
on the parent and process command_line values.

Review: `{processes[['parent.executable','working_directory','executable', 'command_line']].to_json}`
"""

vss_response = chat_gpt(prompt)
print(vss_response)

The commandline activity involves executing the command “vssadmin delete shadows /all /quiet” using the cmd.exe executable. This command is being run multiple times, with different parent executables and working directories. The command is initiated by a process with the executable C:\Windows\Temp\6341D6.exe, which then calls the cmd.exe executable with the given command line.

That seems pretty “okay”. It’s not a great description of why a threat actor would do this. Let’s adjust the prompt and see if we can get something better.

1
2
3
4
5
6
7
prompt = f"""
You are a cyber security analyst and found this activity. Why would a threat actor perform these actions?
Review: `{processes[['parent.executable','working_directory','executable', 'command_line']].to_json}`
"""

vss_response2 = chat_gpt(prompt)
print(vss_response2)

Based on the provided activity, it appears that a threat actor has performed the following actions: 1. Executed a file named “6341D6.exe” located in the “C:\Windows\Temp" directory with the working directory set to “C:\WINDOWS\system32". 2. Executed the command “C:\Windows\System32\cmd.exe” with the working directory set to “C:\WINDOWS\system32". 3. Executed the command “vssadmin delete shadows /all /quiet”. 4. Executed the command “C:\Windows\System32\vssadmin.exe”. 5. Executed the command “vssadmin delete shadows /all /quiet”. 6. Executed the command “C:\Windows\System32\vssadmin.exe”. 7. Executed the command “vssadmin delete shadows /all /quiet”. 8. Executed the command “C:\Windows\System32\cmd.exe” with the working directory set to “C:\Windows\system32". From these actions, it can be inferred that the threat actor is attempting to delete Volume Shadow Copies (VSS) on the system. VSS is a feature in Windows that creates backups of files and allows users to restore them to a previous state. By deleting the VSS, the threat actor may be trying to cover their tracks or prevent file recovery tools from accessing previous versions of files, thus hindering potential forensic investigation or restoring system to a previous state.
Deleting VSS can be a common tactic used by threat actors involved in activities such as ransomware attacks, where they aim to prevent victims from restoring their files from backup copies. It can also be used to hide the presence of malicious files or activities on the system by removing potential traces. As a cyber security analyst, it is important to investigate further, analyze the impact of the VSS deletion, and take appropriate actions to mitigate further risks. This may include conducting a thorough system analysis, monitoring for any additional suspicious activities, and implementing security measures to prevent future unauthorized access.

This is much more relevant, but is also a lot to read through. When we enrich this later, we’ll see that we’re going to exceed token limitations in the GPT model, so we’d want this summarized more.

2. Enrich activity with MITRE ATT&CK attribution

Now let’s look at taking what we found above and try to attribute it to threat actors. First we need to create a knowledge base of threat actors that GPT can look at. We do this with text embeddings.

Create text embeddings

Cyb3rWard0g has already parsed ATT&CK groups and stored them as .md files. His repo references a ChromaDB, but the database was not committed. We need to index the markdown files and create the database ourselves. We can copy Roberto’s code and run it ourselves.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import glob
from langchain.document_loaders import UnstructuredMarkdownLoader
documents_directory = "/mnt/storage/GenAI-Security-Adventures/experiments/RAG/Threat-Intelligence/ATTCK-Groups/source-knowledge/documents"
# variables
group_files = glob.glob(os.path.join(documents_directory, "*.md"))

# Loading Markdown files
md_docs = []
print("[+] Loading Group markdown files..")
for group in group_files:
    print(f' [*] Loading {os.path.basename(group)}')
    loader = UnstructuredMarkdownLoader(group)
    md_docs.extend(loader.load())

print(f'[+] Number of .md documents processed: {len(md_docs)}')
1
2
3
[+] Loading Group markdown files..

[+] Number of .md documents processed: 134

Next we tokenize the documents.

1
2
3
4
5
6
7
8
9
10
import tiktoken

tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')
token_integers = tokenizer.encode(md_docs[0].page_content, disallowed_special=())
num_tokens = len(token_integers)
token_bytes = [tokenizer.decode_single_token_bytes(token) for token in token_integers]

print(f"token count: {num_tokens} tokens")
print(f"token integers: {token_integers}")
print(f"token bytes: {token_bytes}")

token count: 3241 tokens

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=() #To disable this check for all special tokens
    )
    return len(tokens)

# Get token counts
token_counts = [tiktoken_len(doc.page_content) for doc in md_docs]

print(f"""[+] Token Counts:
Min: {min(token_counts)}
Avg: {int(sum(token_counts) / len(token_counts))}
Max: {max(token_counts)}""")
1
2
3
4
[+] Token Counts:
Min: 155
Avg: 1789
Max: 8131

Here Roberto splits the documents into chunks to deal with token limits.

1
2
3
4
5
6
7
8
9
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Chunking Text
print('[+] Initializing RecursiveCharacterTextSplitter..')
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)
1
[+] Initializing RecursiveCharacterTextSplitter..
1
2
3
4
5
print('[+] Splitting documents in chunks..')
chunks = text_splitter.split_documents(md_docs)

print(f'[+] Number of documents: {len(md_docs)}')
print(f'[+] Number of chunks: {len(chunks)}')
1
2
3
[+] Splitting documents in chunks..
[+] Number of documents: 134
[+] Number of chunks: 694

Next we take the split documents, apply the embedding function to create the vectors, and load them into the database.

1
2
3
4
5
6
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")
persist_directory = './chroma_db'
db = Chroma.from_documents(chunks, embedding_function, collection_name="groups_collection", persist_directory=persist_directory)
1
2
3
4
5
6
# Roberto's test
query = "What threat actors send text messages to their targets?"
relevant_docs = db.similarity_search(query)

# print results
print(relevant_docs[0].page_content)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Molerats - G0021

Created: 2017-05-31T21:31:55.093Z

Modified: 2021-04-27T20:16:16.057Z

Contributors:

Aliases

Molerats,Operation Molerats,Gaza Cybergang

Description

Molerats is an Arabic-speaking, politically-motivated threat group that has been operating since 2012. The group's victims have primarily been in the Middle East, Europe, and the United States.(Citation: DustySky)(Citation: DustySky2)(Citation: Kaspersky MoleRATs April 2019)(Citation: Cybereason Molerats Dec 2020)

Techniques Used
1
2
3
4
5
6
# Test a search using our technique
query = "What technique is delete the volume shadow copies?"
relevant_docs = db.similarity_search(query)

# print results
print(relevant_docs[0].page_content)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Matrix 
 Domain 
 Platform 
 Technique ID 
 Technique Name 
 Use 
 
 mitre-attack 
 enterprise-attack 
 Linux,macOS,Windows,Network 
 T1090 
 Proxy 
 CopyKittens  has used the AirVPN service for operational activity.(Citation: Microsoft POLONIUM June 2022) 
 
 mitre-attack 
 enterprise-attack 
 PRE 
 T1588.002 
 Tool 
 CopyKittens  has used Metasploit,  Empire , and AirVPN for post-exploitation activities.(Citation: ClearSky and Trend Micro Operation Wilted Tulip July 2017)(Citation: Microsoft POLONIUM June 2022) 
 
 mitre-attack 
 enterprise-attack 
 macOS,Windows,Linux 
 T1564.003 
 Hidden Window 
 CopyKittens  has used  -w hidden  and  -windowstyle hidden  to conceal  PowerShell  windows. (Citation: ClearSky Wilted Tulip July 2017) 
 
 mitre-attack 
 enterprise-attack 
 Linux,macOS,Windows 
 T1560.003 
 Archive via Custom Method 
 CopyKittens  encrypts data with a substitute cipher prior to exfiltration.(Citation: CopyKittens Nov 2015) 
 
 mitre-attack 
 enterprise-attack 
 Windows 
 T1218.011 
 Rundll32 
 CopyKittens  uses rundll32 to load various tools on victims, including a lateral movement tool named Vminst, Cobalt Strike, and shellcode.(Citation: ClearSky Wilted Tulip July 2017) 
 
 mitre-attack 
 enterprise-attack 
 Linux,macOS,Windows 
 T1560.001 
 Archive via Utility 
 CopyKittens  uses ZPP, a .NET console program, to compress files with ZIP.(Citation: ClearSky Wilted Tulip July 2017) 
 
 mitre-attack 
 enterprise-attack 
 Windows 
 T1059.001 
 PowerShell

This was not highly accurate. Roberto’s data is based on MITRE’s Groups. The technique that I’m looking at is employed by ransomware - a software, not a group. Lockbit does not show up in MITRE’s Groups list, nor does it show up in software. We may want to make an additional database that focuses on TTPs. But first, let’s try querying OpenAI.

I add on CompressibleAgent to try and overcome the token limitations that are experienced when taking OpenAI’s explanation of the observed activity.

1
2
3
4
5
6
7
8
9
10
11
12
# Set up AutoGen config list
config_list = autogen.oai.config_list_from_models(
    model_list=["gpt-3.5-turbo", "gpt-4"]
)

# Set up LLM Config
llm_config = {
    "timeout" : 600,
    "seed" : 42,
    "config_list" : config_list,
    "temperature" : 0
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
import chromadb
ragproxyagent = RetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    retrieve_config={
        "task": "qa",
        "collection_name": "groups_collection",
        "model": config_list[0]["model"],
        "client": chromadb.PersistentClient(path='./chroma_db'),
        "embedding_model": "all-mpnet-base-v2", #Sentence-transformers model
    },
)
1
2
3
4
5
assistant = RetrieveAssistantAgent(
    name="assistant", 
    system_message="You are a helpful assistant.",
    llm_config=llm_config,
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from autogen.agentchat.contrib.compressible_agent import CompressibleAgent
compressed_assistant = CompressibleAgent(
    name="assistant", 
    system_message="You are a cyber security analyst.",
    llm_config={
        "timeout": 600,
        "cache_seed": 42,
        "config_list": config_list,
    },
    compress_config={
        "mode": "COMPRESS",
        "trigger_count": 600, # set this to a large number for less frequent compression
        "verbose": True, # to allow printing of compression information: contex before and after compression
        "leave_last_n": 2,
    }
)

INFO:autogen.token_count_utils:gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.

At this point, Roberto can prompt the RAG agent on different MITRE ATT&CK groups. Lets try asking about the pre-ransomware activity that was found:

1
2
3
4
assistant.reset()

qa_problem = f"What threat actors use the following techniques: {vss_response2}"
ragproxyagent.initiate_chat(compressed_assistant, problem=qa_problem)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
User's question is: What threat actors use the following techniques: Based on the provided activity, it appears that a threat actor has performed the following actions:
1. Executed a file named "6341D6.exe" located in the "C:\Windows\Temp\" directory with the working directory set to "C:\WINDOWS\system32\".
2. Executed the command "C:\Windows\System32\cmd.exe" with the working directory set to "C:\WINDOWS\system32\".
3. Executed the command "vssadmin delete shadows /all /quiet".
4. Executed the command "C:\Windows\System32\vssadmin.exe".
5. Executed the command "vssadmin delete shadows /all /quiet".
6. Executed the command "C:\Windows\System32\vssadmin.exe".
7. Executed the command "vssadmin delete shadows /all /quiet".
8. Executed the command "C:\Windows\System32\cmd.exe" with the working directory set to "C:\Windows\system32\".
From these actions, it can be inferred that the threat actor is attempting to delete Volume Shadow Copies (VSS) on the system. VSS is a feature in Windows that creates backups of files and allows users to restore them to a previous state. By deleting the VSS, the threat actor may be trying to cover their tracks or prevent file recovery tools from accessing previous versions of files, thus hindering potential forensic investigation or restoring system to a previous state.
Deleting VSS can be a common tactic used by threat actors involved in activities such as ransomware attacks, where they aim to prevent victims from restoring their files from backup copies. It can also be used to hide the presence of malicious files or activities on the system by removing potential traces.
As a cyber security analyst, it is important to investigate further, analyze the impact of the VSS deletion, and take appropriate actions to mitigate further risks. This may include conducting a thorough system analysis, monitoring for any additional suspicious activities, and implementing security measures to prevent future unauthorized access.
...
assistant (to ragproxyagent):
Based on the provided activity, the threat actor is attempting to delete Volume Shadow Copies (VSS) on the system.
--------------------------------------------------------------------------------

This response misses the mark. We can probably make this better by engineering a better prompt. The initial OpenAI response was very wordy - if we can make that first prompt return a more focused response, we can try feeding that OpenAI output into the RAG. Then the RAG should provide a better answer to this question. Let’s manually test with a more focused prompt.

1
2
3
4
assistant.reset()

qa_problem = f"What ransomware focused threat actors delete volume shadow copies?"
ragproxyagent.initiate_chat(compressed_assistant, problem=qa_problem)
1
2
3
4
5
User's question is: What ransomware focused threat actors delete volume shadow copies?
...
assistant (to ragproxyagent):
EXOTIC LILY, Wizard Spider
--------------------------------------------------------------------------------

The Wizard Spider page says:

Wizard Spider has used WMIC and vssadmin to manually delete volume shadow copies. Wizard Spider has also used Conti ransomware to delete volume shadow copies automatically with the use of vssadmin.[7]

And EXOTIC LILY says:

EXOTIC LILY is a financially motivated group that has been closely linked with Wizard Spider and the deployment of ransomware including Conti and Diavol.

This is encouraging - the EXOTIC LILY page does not explicitly say that vssadmin is used, but it seems like the link to Wizard Spider provided enough association to make a connection. So, we’ll want to engineer the original prompt to produce output that looks like “this threat actor deleted volume shadow copies”. On the other hand, there are several other groups that do the same actions, but those were not returned.

As a control, let’s see if the GPT will correctly answer a question about another threat actor. The MITRE group page for Lazarus does not list any volume shadow copy interaction.

1
2
3
4
assistant.reset()

qa_problem = f"What commands does Lazarus Group use to delete volume shadow copies?"
ragproxyagent.initiate_chat(compressed_assistant, problem=qa_problem)

User’s question is: What commands does Lazarus Group use to delete volume shadow copies? … assistant (to ragproxyagent): The Lazarus Group does not use specific commands to delete volume shadow copies. ——————————————————————————–

The GPT correctly identifies that Lazarus does not delete volume shadow copies

Next steps

At this point, we can manually query Elastic and feed results to a GPT for summarization. Using gpt-3.5 provides okay results and we can mitigate some of its shortcomings (context length, training end date) with RAG. My next goal is to have a GPT save time in creating queries for Elastic. Eventually, I’d like to have an agent that you can provide with a hunting lead, have the agent create and run an Elastic query, then summarize and explain results.

This post is licensed under CC BY 4.0 by the author.