Wild hunt or how you can automate process of malware collection
When you only start learning malware analysis, it is always frustrating to find malicious samples, as those from practical malware analysis labs are a little bit old and you have already mastered them. These thoughts pushed me to research different malware sources and ways to automate this routine.
Here is a great list of malware samples sources: https://www.megabeets.net/fantastic-malware-and-where-to-find-them
There are a lot of different platforms where you can get malicious files —VirusTotal, VirusBay, VirusShare or maybe you have your own custom honeypot network. It’s up to you to pick up, but I chose twitter. If you didn’t know, a lot of security researchers post information about malware hosting servers with hashtag #opendir in twitter.
For example @makflwana posted information about malicious server with bunch of binaries. We just need to visit listed links and download all samples for further malware analysis.
Wait, you said something about automation before, right? Here is where python comes into play.
At first we need to get those malicious links. Most convenient way of doing this task is to use Twitter API, but I c̵o̵u̵l̵d̵n̵’̵t̵ ̵o̵b̵t̵a̵i̵n̵ ̵T̵w̵i̵t̵t̵e̵r̵ ̵c̵r̵e̵d̵s̵ like challenges so I decided to manually parse html page with BeautifulSoup. To be able to retrieve tweets you need to provide a valid user-agent, otherwise twitter redirects you to login page. Next step is to find tweet body on the html page. For this task I used Firefox inspector - just right click on tweet, Inspect Element(Q) and you will see that element js-tweet-text-container contains tweet’s text.
I wrote a function which searches twitter by provided hashtag and returns array of tweets.
Final code of the project is available with the link
After we get the text from tweet, we can easily extract urls with regular expressions and use urlparse to validate them. Lets examine how security researchers post malicious links. I found couple of patterns:
- hxxp://<url> — protocol letters replaces with x
- http://malicious[.]site — dots escaped by square brackets.
- #opendir <some keywords> <url>
- pastebin url
I skipped case 3, because I don’t need 100% results, at least right now. So in the end you should have something like this.
When we are done with links, we need to visit each extracted page and download recursively all malicious files ( we need to write our own crawler). Here I recommend not to reinvent the wheel and use a nice full-featured framework Scrapy.
Most important part of our crawler is spider (in Scrapy, spiders are classes which define how a certain site will be scraped). In most cases file listing hides behind malicious link under hashtag #opendir. So our spider should download samples from file listing and recursively visit all folders. The best example of file listing is python SimpleHTTPServer.
I wrote a simple spider with all functionality I needed: visit folders recursively and download files. Also I added a file structure for downloaded files similar to one on the file listing.
We are almost done!
Attention required! You need to be careful while working with malware or Command&Contorl servers, so I also recommend to use proxy/tor. Scrapy can’t work with socks proxy but you can setup polipo which will act as a bridge to tor socks proxy. Detailed instructions on how to do this are available here.
After running the code above I got next results which look pretty awesome!
Today I shared my experience with getting latest malware samples and how we can automate this process with python. Also we wrote our custom tweeter parser and own web crawler. Good idea would be to run this code on regular basis with cron and send collected samples to sandbox.
Stay safe with Tor/VPN and …